Opened 10 years ago

Closed 10 years ago

Last modified 10 years ago

#5361 closed defect (fixed)

/vsizip: troubles with cyrillic filenames

Reported by: oleinik Owned by: warmerdam
Priority: normal Milestone: 1.11.0
Component: GDAL_Raster Version: 1.10.1
Severity: normal Keywords:
Cc:

Description (last modified by oleinik)

Most programs on windows create zip-archives with filenames in DOS CP866 charset. Sometimes zip-archives also contains filenames in UNICODE charset. GDAL doesn't convert charset between DOS and ANSI and doesn't use UNICODE version as result we have trouble. To open dataset '/vsizip/archive.zip/image_name' with cyrillic name we need firstly convert image_name from ANSI to OEM, then concat with prefix and archive name and then convert result to UTF8. Not very easy-to-use. Moreover different parts of result dataset's description also have different charsets and as result PAM .aux.xml and .ovr filenames not suitable for OS.

Attachments (5)

a.zip (326 bytes ) - added by oleinik 10 years ago.
created by linux, uses general purpose bit 11
b.zip (294 bytes ) - added by oleinik 10 years ago.
created by 7zip, uses general purpose bit 11
c.zip (354 bytes ) - added by oleinik 10 years ago.
created by winzip, uses combination cp866 + utf-8
d.zip (354 bytes ) - added by oleinik 10 years ago.
created by winrar, uses combination cp866 + utf-8
e.zip (270 bytes ) - added by oleinik 10 years ago.
This archive contains only CP866 names

Download all attachments as: .zip

Change History (16)

comment:1 by oleinik, 10 years ago

Description: modified (diff)

comment:2 by Even Rouault, 10 years ago

Is the encoding of filenames inside a ZIP file indicated somewhere in the ZIP ? If not, there's no way to address that issue automatically.

in reply to:  2 comment:3 by oleinik, 10 years ago

Look at "APPENDIX D" of last zip specification http://www.pkware.com/documents/casestudies/APPNOTE.TXT Application can choose storage method of filenames in Unicode. I try to create zip archves in windows and ubuntu linux with linux zip, windows archiver, 7zip, winrar, winzip. 7zip and linux uses general purpose bit 11 to indicates UTF-8 encoding. Windows archiver, winrar and winzip store names in cp866 charset. Additional winrar and winzip store UTF-8 names in extra field.

comment:4 by oleinik, 10 years ago

Winrar and winzip store UTF-8 name in 0x7075 extra field (Info-ZIP Unicode Path Extra Field).

comment:5 by Even Rouault, 10 years ago

Do you have an example of such zip with cp866 + utf-8 names ?

by oleinik, 10 years ago

Attachment: a.zip added

created by linux, uses general purpose bit 11

by oleinik, 10 years ago

Attachment: b.zip added

created by 7zip, uses general purpose bit 11

by oleinik, 10 years ago

Attachment: c.zip added

created by winzip, uses combination cp866 + utf-8

by oleinik, 10 years ago

Attachment: d.zip added

created by winrar, uses combination cp866 + utf-8

comment:6 by Even Rouault, 10 years ago

Milestone: 2.0
Resolution: fixed
Status: newclosed

trunk r26882 "/vsizip/: use the extented field for UTF-8 filenames to retrieve UTF-8 filenames when the base filename field contains CP437 filename (#5361)"

in reply to:  6 comment:7 by oleinik, 10 years ago

Resolution: fixed
Status: closedreopened

What about general purpose bit 11 ? Did your append additional convertion to UTF-8, if this bit isn't set ?

comment:8 by Even Rouault, 10 years ago

In the a.zip and b.zip samples, the filenames are already in UTF-8. In the c.zip and d.zip samples, there's a 0x7075 extra field with the filenames in UTF-8. That's what I've implemented.

So conversion to UTF-8 would be needed if there's no 0x7075 and the general purpose bit 11 is not set ? Do you have such samples ?

And you mention CP866, but the ZIP spec only mentions CP437 as the default encoding.

by oleinik, 10 years ago

Attachment: e.zip added

This archive contains only CP866 names

in reply to:  8 comment:9 by oleinik, 10 years ago

Replying to rouault:

So conversion to UTF-8 would be needed if there's no 0x7075 and the general purpose bit 11 is not set ?

YES.

And you mention CP866, but the ZIP spec only mentions CP437 as the default encoding.

I think the best way (at least on windows) is use CP_OEMCP codepage as source codepage during convertion to UTF8.

comment:10 by Even Rouault, 10 years ago

Resolution: fixed
Status: reopenedclosed

trunk r26889 "/vsizip/ : recode filenames from CP437 (or CP_OEMCP on Windows), or encoding indicated in CPL_ZIP_ENCODING config option if there are not already in UTF-8 (general purpose bit 11 unset) and that extra field 0x7075 is not available (#5361)"

comment:11 by Even Rouault, 10 years ago

Milestone: 2.01.11.0
Note: See TracTickets for help on using tickets.