Opened 13 years ago
Closed 13 years ago
#1465 closed defect (fixed)
loader/Latin1 regression failure with database SQL_ASCII encoding
Reported by: | gdt | Owned by: | mcayland |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 2.0.0 |
Component: | utils/loader-dumper | Version: | master |
Keywords: | Cc: |
Description
$ ./run_test loader/Latin1 Creating spatial db postgis_reg TMPDIR is /tmp/pgis_reg_20444 PATH is /home/gdt/ccache:/usr/pkg/java/sun-6/bin:/usr/amanda/bin:/usr/amanda/sbin:/home/gdt/bin:/home/gdt/bin/i386-NetBSD:/usr/y0/sbin:/usr/y0/bin:/usr/pkg/sbin:/usr/pkg/bin:/usr/X11R7/bin:/usr/local/sbin:/usr/local/bin:/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/pilot/bin PostgreSQL 8.4.10 on i386--netbsdelf, compiled by GCC gcc (GCC) 4.1.3 20080704 prerelease (NetBSD nb3 20111107), 32-bit Postgis 2.0.0SVN - 2012-01-13 20:25:58 GEOS: 3.3.2-CAPI-1.7.2 PROJ: Rel. 4.7.1, 23 September 2009 Running tests loader/Latin1 ... failed ( wkb test: running shp2pgsql output: /tmp/pgis_reg_20444/loader.err) uninstall .. ok (3697) Run tests: 2 Failed: 1
SET CLIENT_ENCODING TO UTF8; SET STANDARD_CONFORMING_STRINGS TO ON; BEGIN; CREATE TABLE "loadedshp" (gid serial, "address" varchar(32)); ALTER TABLE "loadedshp" ADD PRIMARY KEY (gid); INSERT INTO "loadedshp" ("address") VALUES ('Tårneby in Våler I Solør kommune'); COMMIT;
SET SET BEGIN psql:/tmp/pgis_reg_20444/loader.out:5: NOTICE: CREATE TABLE will create implicit sequence "loadedshp_gid_seq" for serial column "loadedshp.gid" CREATE TABLE psql:/tmp/pgis_reg_20444/loader.out:6: NOTICE: ALTER TABLE / ADD PRIMARY KEY will create implicit index "loadedshp_pkey" for table "loadedshp" ALTER TABLE psql:/tmp/pgis_reg_20444/loader.out:7: ERROR: value too long for type character varying(32)
The issue might be that the encoding of the string is more than 32:
regress > wc Tårneby in Våler I Solør kommune
1 6 36
(which is 35, since I added a return).
I don't know enough about beyond-ASCII character sets to understand what's wrong, but the loader-generated SQL looks ok.
Is varchar(N) supposed to handle N bytes, or N possibly-wide characters?
Change History (12)
comment:1 by , 13 years ago
comment:2 by , 13 years ago
I reinitialized postgresql with LC_ALL=en_US.ISO8859-1, and the string can be inserted, but on output it differs from expected. I'm 99% sure the output is 8859-1 (bytes match what's in Latin1.dbf). But the expected output has 35 bytes, and appears to be UTF-8 encoded.
With this change, the test passes
Index: regress/loader/Latin1.select.sql =================================================================== --- regress/loader/Latin1.select.sql (revision 8814) +++ regress/loader/Latin1.select.sql (working copy) @@ -1 +1,2 @@ +set client_encoding to UTF8; SELECT * FROM loadedshp;
So there seem to be two issues:
inserting UTF8 into SQL_ASCII requires a bigger varchar to account for the expansion
Unless there's a requirement for the database to have a particular encoding, loading/dumping both have to set the encoding.
comment:3 by , 13 years ago
If I nuke the pgsql database dir and recreate with LC_ALL=en_US.UTF-8, then the regression test passes without my modification. But even with the modification, creating the database without LC_ALL set (and hence SQL_ASCII encoding) causes the test to fail with value too long (because there are 35 bytes in the UTF-8 encoding of the test address, which don't fit in varchar(32)).
comment:4 by , 13 years ago
Right, so we don't use PostgreSQL to do the encoding conversion but use iconv instead. Hence we force everything to UTF-8 encoding in the output and indicate this to PostgreSQL.
Now I can't say that I have any shapefiles lying around without 7-bit ASCII names, so it wouldn't surprise me if there are errors in this area. I'll try and take a look at this, however I'm fairly short on time this week.
comment:5 by , 13 years ago
To clarify, there are two separate bugs in the regression test:
- selecting the string doesn't set the encoding, so e.g. if the native encoding is ISO8859-1, the ISO8859-1-encoded version of that address doesn't mach the UTF-8 version. This is easy to fix as above.
- If the database is SQL_ASCII, then 35 bytes that are the UTF-8 version of a 32-character address doesn't fit in a varchar(32). The fix is either to use a shorter string, or a longer varchar, or to explicitly create the regression database with a particular encoding.
comment:6 by , 13 years ago
Creating the DB with a known encoding seems good to me. We really need to be more in control of the environment our testsuite runs in, as much as I'd like to create a whole cluster to enforce such control.
@gdt: the DB creation code is in regress/run_test, if you want to come up with a patch. Make sure it works from 8.4 to 9.2
comment:7 by , 13 years ago
comment:8 by , 13 years ago
Summary: | loader/Latin1 regression failure (NetBSD, postgresql 8.4.10) → loader/Latin1 regression failure with database SQL_ASCII encoding |
---|
With the following, the loader/Latin1 test passes even when pgsql was initialized with no locale and hence is SQL_ASCII. Note that I've chosen to set the encoding to ISO8859-1 instead of UTF-8, to force showing the translation on load and query works too.
(Of course, the commented-out —locale line is not needed; that fails on my system and I'm not sure why — it's left as something to try.)
Index: regress/run_test =================================================================== --- regress/run_test (revision 8924) +++ regress/run_test (working copy) @@ -559,7 +559,8 @@ # ON_ERROR_STOP is used by psql to return non-0 on an error _psql_opts="--no-psqlrc --variable ON_ERROR_STOP=true" - createdb --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log + # --locale en_US.UTF-8 + createdb --encoding=ISO8859-1 --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log createlang plpgsql "${DB}" >> ${TMPDIR}/regress_log # Count database objects before installing anything Index: regress/loader/Latin1.select.sql =================================================================== --- regress/loader/Latin1.select.sql (revision 8924) +++ regress/loader/Latin1.select.sql (working copy) @@ -1 +1,2 @@ +SET CLIENT_ENCODING to UTF8; SELECT * FROM loadedshp;
comment:9 by , 13 years ago
I believe the —locale line fails on my system because LC_COLLATE is not implemented well enough to satisfy postgresql. But just setting the encoding is enough for the test, because it's just about encoding round trips and no actual locale processing is needed.
comment:10 by , 13 years ago
Does forcing the encoding to UTF-8 work just as well? I don't want to test against ISO8859-1 when that's not a best practice, we prefer everyone to use UTF-8 encodings.
comment:11 by , 13 years ago
Yes, forcing to UTF-8 works too (with the set client encoding in the select). Clean patch:
Index: regress/run_test =================================================================== --- regress/run_test (revision 9084) +++ regress/run_test (working copy) @@ -559,7 +559,7 @@ # ON_ERROR_STOP is used by psql to return non-0 on an error _psql_opts="--no-psqlrc --variable ON_ERROR_STOP=true" - createdb --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log + createdb --encoding=UTF-8 --template=template0 --lc-collate="C" "${DB}" > ${TMPDIR}/regress_log createlang plpgsql "${DB}" >> ${TMPDIR}/regress_log # Count database objects before installing anything Index: regress/loader/Latin1.select.sql =================================================================== --- regress/loader/Latin1.select.sql (revision 9084) +++ regress/loader/Latin1.select.sql (working copy) @@ -1 +1,2 @@ +SET CLIENT_ENCODING to UTF8; SELECT * FROM loadedshp;
Editing the loader.out and re-running, I find that with the column declared as varchar(35) the insert works, and as varchar(34) it errors. So this seems to be a postgresql issue (as compiled, i18n is hard it seems), not a postgis issue, and I'm looking into it there.