Opened 4 years ago

Closed 4 years ago

Last modified 3 years ago

#5361 closed defect (fixed)

/vsizip: troubles with cyrillic filenames

Reported by: oleinik Owned by: warmerdam
Priority: normal Milestone: 1.11.0
Component: GDAL_Raster Version: 1.10.1
Severity: normal Keywords:
Cc:

Description (last modified by oleinik)

Most programs on windows create zip-archives with filenames in DOS CP866 charset. Sometimes zip-archives also contains filenames in UNICODE charset. GDAL doesn't convert charset between DOS and ANSI and doesn't use UNICODE version as result we have trouble. To open dataset '/vsizip/archive.zip/image_name' with cyrillic name we need firstly convert image_name from ANSI to OEM, then concat with prefix and archive name and then convert result to UTF8. Not very easy-to-use. Moreover different parts of result dataset's description also have different charsets and as result PAM .aux.xml and .ovr filenames not suitable for OS.

Attachments (5)

a.zip (326 bytes) - added by oleinik 4 years ago.
created by linux, uses general purpose bit 11
b.zip (294 bytes) - added by oleinik 4 years ago.
created by 7zip, uses general purpose bit 11
c.zip (354 bytes) - added by oleinik 4 years ago.
created by winzip, uses combination cp866 + utf-8
d.zip (354 bytes) - added by oleinik 4 years ago.
created by winrar, uses combination cp866 + utf-8
e.zip (270 bytes) - added by oleinik 4 years ago.
This archive contains only CP866 names

Download all attachments as: .zip

Change History (16)

comment:1 Changed 4 years ago by oleinik

Description: modified (diff)

comment:2 Changed 4 years ago by Even Rouault

Is the encoding of filenames inside a ZIP file indicated somewhere in the ZIP ? If not, there's no way to address that issue automatically.

comment:3 in reply to:  2 Changed 4 years ago by oleinik

Look at "APPENDIX D" of last zip specification http://www.pkware.com/documents/casestudies/APPNOTE.TXT Application can choose storage method of filenames in Unicode. I try to create zip archves in windows and ubuntu linux with linux zip, windows archiver, 7zip, winrar, winzip. 7zip and linux uses general purpose bit 11 to indicates UTF-8 encoding. Windows archiver, winrar and winzip store names in cp866 charset. Additional winrar and winzip store UTF-8 names in extra field.

comment:4 Changed 4 years ago by oleinik

Winrar and winzip store UTF-8 name in 0x7075 extra field (Info-ZIP Unicode Path Extra Field).

comment:5 Changed 4 years ago by Even Rouault

Do you have an example of such zip with cp866 + utf-8 names ?

Changed 4 years ago by oleinik

Attachment: a.zip added

created by linux, uses general purpose bit 11

Changed 4 years ago by oleinik

Attachment: b.zip added

created by 7zip, uses general purpose bit 11

Changed 4 years ago by oleinik

Attachment: c.zip added

created by winzip, uses combination cp866 + utf-8

Changed 4 years ago by oleinik

Attachment: d.zip added

created by winrar, uses combination cp866 + utf-8

comment:6 Changed 4 years ago by Even Rouault

Milestone: 2.0
Resolution: fixed
Status: newclosed

trunk r26882 "/vsizip/: use the extented field for UTF-8 filenames to retrieve UTF-8 filenames when the base filename field contains CP437 filename (#5361)"

comment:7 in reply to:  6 Changed 4 years ago by oleinik

Resolution: fixed
Status: closedreopened

What about general purpose bit 11 ? Did your append additional convertion to UTF-8, if this bit isn't set ?

comment:8 Changed 4 years ago by Even Rouault

In the a.zip and b.zip samples, the filenames are already in UTF-8. In the c.zip and d.zip samples, there's a 0x7075 extra field with the filenames in UTF-8. That's what I've implemented.

So conversion to UTF-8 would be needed if there's no 0x7075 and the general purpose bit 11 is not set ? Do you have such samples ?

And you mention CP866, but the ZIP spec only mentions CP437 as the default encoding.

Changed 4 years ago by oleinik

Attachment: e.zip added

This archive contains only CP866 names

comment:9 in reply to:  8 Changed 4 years ago by oleinik

Replying to rouault:

So conversion to UTF-8 would be needed if there's no 0x7075 and the general purpose bit 11 is not set ?

YES.

And you mention CP866, but the ZIP spec only mentions CP437 as the default encoding.

I think the best way (at least on windows) is use CP_OEMCP codepage as source codepage during convertion to UTF8.

comment:10 Changed 4 years ago by Even Rouault

Resolution: fixed
Status: reopenedclosed

trunk r26889 "/vsizip/ : recode filenames from CP437 (or CP_OEMCP on Windows), or encoding indicated in CPL_ZIP_ENCODING config option if there are not already in UTF-8 (general purpose bit 11 unset) and that extra field 0x7075 is not available (#5361)"

comment:11 Changed 3 years ago by Even Rouault

Milestone: 2.01.11.0
Note: See TracTickets for help on using tickets.