Opened 11 years ago

Closed 9 years ago

Last modified 9 years ago

#5239 closed defect (fixed)

MSSQLSpatial output mangles non-ASCII characters

Reported by: iainelder Owned by: tamas
Priority: high Milestone: 2.0.0
Component: OGR_SF Version: 1.9.2
Severity: normal Keywords:
Cc: tamas

Description (last modified by iainelder)

This problem was originally described in the gis.se question How to preserve non-ASCII characters in ogr2ogr import?.

I want to import a list of international place names and their locations into a SQL Server database using ogr2ogr.

The source data is a CSV file encoded in UTF-8.

I've prepared a five-record sample file called london_points.csv. It contains the records for London in Arabic, Czech, English. Russian, and Chinese. It looks like this:

LanguageCode,Name,Longitude,Latitude
"ar","لندن","-0.143555","51.577222"
"cz","Londýn","-0.143555","51.577222"
"en","London","-0.143555","51.577222"
"ru","Лондон","-0.143555","51.577222"
"zw","倫敦","-0.143555","51.577222"

(Don't worry if it looks like the Arabic line has swapped the Name and Latitude values. It's just a quirk of bidirectional text rendering.)

To import the CSV file using ogr2ogr I created a VRT file called london_points.vrt. It tells ogr2ogr to import the data and derive a point column from the Longitude and Latitude columns. It looks like this:

<OGRVRTDataSource>
    <OGRVRTLayer name="london_points">
        <SrcDataSource>london_points.csv</SrcDataSource>
        <GeometryType>wkbPoint</GeometryType>
        <LayerSRS>WGS84</LayerSRS>
        <GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude"/>
    </OGRVRTLayer>
</OGRVRTDataSource>

I used a command like this to import the records:

ogr2ogr
    --DEBUG ON
    -s_srs EPSG:4326
    -lco GEOM_TYPE=geography
    -overwrite
    -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes"
    london_points.vrt

For clarity I've put each parameter on its own line.

The output shows that ogr2ogr successfully imports five rows:

OGR: OGROpen(london_points.vrt/014AFF40) succeeded as VRT.
OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes")
ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes/0338CFF0) succeeded as MSSQLSpatial.
OGR: OGROpen(london_points.csv/033A41C8) succeeded as CSV.
MSSQLSpatial: DeleteLayer(london_points)
OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points.
OGR2OGR: 5 features written in layer 'london_points'
ODBC: SQLDisconnect()
VRT: 5 features read on layer 'london_points'.
CSV: 5 features read on layer 'london_points'.

I use a query like this to inspect the results in the database:

SELECT languagecode, name, longitude, latitude
FROM tempdb.dbo.london_points;

The result set looks like this:

languagecode name         longitude latitude
------------ ------------ --------- ---------
ar           لندن     -0.143555 51.577222
cz           Londýn      -0.143555 51.577222
en           London       -0.143555 51.577222
ru           Лондон -0.143555 51.577222
zw           倫敦       -0.143555 51.577222

The result set contains the correct number of rows. The languagecode, longitude, and latitude columns, which contain only ASCII characters, contain the correct values.

But all the non-ASCII characters in the name column are mangled!

I traced the import using SQL Server Profiler to capture the SQL statements that ogr2ogr uses to insert data into the london_points table.

The trace captured these insert statements:

INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ar', 'لندن', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'cz', 'Londýn', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'en', 'London', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ru', 'Лондон', '-0.143555', '51.577222');
go
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude])
VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'zw', '倫敦', '-0.143555', '51.577222');
go

For clarity I split each INSERT VALUES statement across two lines.

Inspecting the SQL reveals two problems.

ogr2ogr mangles the values for the name column before they reach the server.

ogr2ogr passes the values as varchar literals ('abc') rather than nvarchar literals (N'abc'). Even if the query text contained the correct characters, the server would replace with a question mark any character that don't exist in the default code page.

This query illustrates how SQL Server loses information when it processes varchar literals containing non-existent characters:

SELECT 'لندن' AS varchar_value, N'لندن' AS nvarchar_value
UNION ALL
SELECT 'Londýn', N'Londýn'
UNION ALL
SELECT 'London', N'London'
UNION ALL
SELECT 'Лондон', N'Лондон'
UNION ALL
SELECT '倫敦', N'倫敦';

varchar_value nvarchar_value
------------- --------------
????          لندن
Londýn        Londýn
London        London
??????        Лондон
??            倫敦

My default code page is Windows-1252. It contains characters for English and Czech, so varchar string literals are fine for these languages. Arabic, Russian, and Traditional Chinese supported only in Unicode, so all the characters are replaced by question marks.

The MSSQLSpatial driver should be able to generate query text containing the correct nvarchar literal values.

Attachments (3)

london_points.csv (240 bytes ) - added by iainelder 11 years ago.
The sample data that I want to import.
london_points.vrt (321 bytes ) - added by iainelder 11 years ago.
The VRT that ogr2ogr needs to import the CSV.
ogr2ogr_import.trc (22.5 KB ) - added by iainelder 11 years ago.
SQL Server Profiler trace capturing the faulty INSERT statements.

Download all attachments as: .zip

Change History (11)

by iainelder, 11 years ago

Attachment: london_points.csv added

The sample data that I want to import.

by iainelder, 11 years ago

Attachment: london_points.vrt added

The VRT that ogr2ogr needs to import the CSV.

by iainelder, 11 years ago

Attachment: ogr2ogr_import.trc added

SQL Server Profiler trace capturing the faulty INSERT statements.

comment:1 by iainelder, 11 years ago

Description: modified (diff)

comment:2 by martinl, 10 years ago

Component: defaultOGR_SF
Milestone: 1.11.1
Priority: normalhigh

Any progress? Thanks for info in advance.

comment:3 by Even Rouault, 10 years ago

Cc: tamas added

comment:4 by tamas, 10 years ago

Owner: changed from warmerdam to tamas

comment:5 by tamas, 10 years ago

Fixed in trunk (r27746)

comment:6 by tamas, 10 years ago

Milestone: 1.11.12.0
Status: newassigned

comment:7 by tamas, 9 years ago

Resolution: fixed
Status: assignedclosed

closing

comment:8 by Even Rouault, 9 years ago

Milestone: 2.02.0.0

Milestone renamed

Note: See TracTickets for help on using tickets.