Opened 11 years ago
Last modified 9 years ago
#5239 closed defect
MSSQLSpatial output mangles non-ASCII characters — at Initial Version
Reported by: | iainelder | Owned by: | warmerdam |
---|---|---|---|
Priority: | high | Milestone: | 2.0.0 |
Component: | OGR_SF | Version: | 1.9.2 |
Severity: | normal | Keywords: | |
Cc: | tamas |
Description
This problem was originally described on gis.se.
I want to import a list of international place names and their locations into a SQL Server database using ogr2ogr.
The source data is a CSV file encoded in UTF-8.
I've prepared a five-record sample file called london_points.csv
. It contains the records for London in Arabic, Czech, English. Russian, and Chinese. It looks like this:
LanguageCode,Name,Longitude,Latitude "ar","لندن","-0.143555","51.577222" "cz","Londýn","-0.143555","51.577222" "en","London","-0.143555","51.577222" "ru","Лондон","-0.143555","51.577222" "zw","倫敦","-0.143555","51.577222"
(Don't worry if it looks like the Arabic line has swapped the Name
and Latitude
values. It's just a quirk of bidirectional text rendering.)
To import the CSV file using ogr2ogr I created a VRT file called london_points.vrt
. It tells ogr2ogr to import the data and derive a point column from the Longitude and Latitude columns. It looks like this:
<OGRVRTDataSource> <OGRVRTLayer name="london_points"> <SrcDataSource>london_points.csv</SrcDataSource> <GeometryType>wkbPoint</GeometryType> <LayerSRS>WGS84</LayerSRS> <GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude"/> </OGRVRTLayer> </OGRVRTDataSource>
I used a command like this to import the records:
ogr2ogr --DEBUG ON -s_srs EPSG:4326 -lco GEOM_TYPE=geography -overwrite -f MSSQLSpatial "MSSQL:server=.;database=tempdb;trusted_connection=yes" london_points.vrt
For clarity I've put each parameter on its own line.
The output shows that ogr2ogr successfully imports five rows:
OGR: OGROpen(london_points.vrt/014AFF40) succeeded as VRT. OGR_MSSQLSpatial: EstablishSession(Connection:"server=.;database=tempdb;trusted_connection=yes") ODBC: SQLDriverConnect(DRIVER=SQL Server;server=.;database=tempdb;trusted_connection=yes) OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points. OGR: OGROpen(MSSQL:server=.;database=tempdb;trusted_connection=yes/0338CFF0) succeeded as MSSQLSpatial. OGR: OGROpen(london_points.csv/033A41C8) succeeded as CSV. MSSQLSpatial: DeleteLayer(london_points) OGR_MSSQLSpatial: Using column ogr_fid as FID for table london_points. OGR2OGR: 5 features written in layer 'london_points' ODBC: SQLDisconnect() VRT: 5 features read on layer 'london_points'. CSV: 5 features read on layer 'london_points'.
I use a query like this to inspect the results in the database:
SELECT languagecode, name, longitude, latitude FROM tempdb.dbo.london_points;
The result set looks like this:
languagecode name longitude latitude ------------ ------------ --------- --------- ar لندن -0.143555 51.577222 cz Londýn -0.143555 51.577222 en London -0.143555 51.577222 ru Лондон -0.143555 51.577222 zw 倫敦 -0.143555 51.577222
The result set contains the correct number of rows. The languagecode
, longitude
, and latitude
columns, which contain only ASCII characters, contain the correct values.
But all the non-ASCII characters in the name
column are mangled!
I traced the import using SQL Server Profiler to capture the SQL statements that ogr2ogr uses to insert data into the london_points
table.
The trace captured these insert statements:
INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude]) VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ar', 'لندن', '-0.143555', '51.577222'); go INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude]) VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'cz', 'Londýn', '-0.143555', '51.577222'); go INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude]) VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'en', 'London', '-0.143555', '51.577222'); go INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude]) VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'ru', 'Лондон', '-0.143555', '51.577222'); go INSERT INTO [dbo].[london_points] (ogr_geometry, [languagecode], [name], [longitude], [latitude]) VALUES (geography::STGeomFromText('POINT (-0.143555 51.577222)',4326), 'zw', '倫敦', '-0.143555', '51.577222'); go
For clarity I split each INSERT VALUES statement across two lines.
Inspecting the SQL reveals two problems.
ogr2ogr mangles the values for the name column before they reach the server.
ogr2ogr passes the values as varchar literals ('abc'
) rather than nvarchar literals (N'abc'
). Even if the query text contained the correct characters, the server would replace with a question mark any character that don't exist in the default code page.
This query illustrates how SQL Server loses information when it processes varchar literals containing non-existent characters:
SELECT 'لندن' AS varchar_value, N'لندن' AS nvarchar_value UNION ALL SELECT 'Londýn', N'Londýn' UNION ALL SELECT 'London', N'London' UNION ALL SELECT 'Лондон', N'Лондон' UNION ALL SELECT '倫敦', N'倫敦'; varchar_value nvarchar_value ------------- -------------- ???? لندن Londýn Londýn London London ?????? Лондон ?? 倫敦
My default code page is Windows-1252. It contains characters for English and Czech, so varchar string literals are fine for these languages. Arabic, Russian, and Traditional Chinese supported only in Unicode, so all the characters are replaced by question marks.
The MSSQLSpatial driver should be able to generate query text containing the correct nvarchar literal values.
Change History (3)
by , 11 years ago
Attachment: | london_points.csv added |
---|
by , 11 years ago
Attachment: | london_points.vrt added |
---|
The VRT that ogr2ogr needs to import the CSV.
by , 11 years ago
Attachment: | ogr2ogr_import.trc added |
---|
SQL Server Profiler trace capturing the faulty INSERT statements.
The sample data that I want to import.