Opened 9 years ago
Closed 9 years ago
#2692 closed defect (fixed)
v.in.ascii does not handle text in qoutes
Reported by: | wenzeslaus | Owned by: | |
---|---|---|---|
Priority: | normal | Milestone: | 7.0.3 |
Component: | Default | Version: | svn-trunk |
Keywords: | CSV, doublequote, singlequote, text delimiter | Cc: | |
CPU: | Unspecified | Platform: | Unspecified |
Description
Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted text fields" (text
option) but the quotes are stored to the database. This is the result from v.db.select
(same thing in sqliteman
):
100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"
To reproduce, run:
$ cd grass/src $ ./bin.../grass71 ... -text ... > cd vector/v.in.ascii/testsuite/ > python test_csv.py
I added these tests in r65426 and r65427 if you wish to review them.
G_tokenize2()
results online:
v.in.ascii
results online:
Attachments (3)
Change History (19)
follow-up: 2 comment:1 by , 9 years ago
by , 9 years ago
Attachment: | tokenize_with_textdelim.diff added |
---|
follow-up: 3 comment:2 by , 9 years ago
Replying to mlennert:
Replying to wenzeslaus:
Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted text fields" (
text
option) but the quotes are stored to the database. This is the result fromv.db.select
(same thing insqliteman
):100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"IIUC, this comes from this part of lib/gis/token.c. The p++ just advances the pointer like for any other character. I guess that in order to skip it, one would have to create a second pointer which takes a copy of the first, but advances only when the first one is not equal to the character you want to eliminate. Or something like that. But my pointer foo is much too low to handle this directly. Anyone more C-savy who could do this in 5 minutes ?
The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...
Moritz
follow-ups: 4 6 comment:3 by , 9 years ago
Replying to mlennert:
The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...
Simply removing the quotes isn't sufficient to support CSV. It's necessary to treat the field separator as a normal character when it occurs within a quoted field (that's the point of using quotes), and also to replace two consecutive quotes with a single quote (so that the quote character itself can occur within a field).
For that, an explicit state machine is likely to be more legible than ad-hoc logic.
Off the top of my head:
classes: delimiter, separator, newline, other states: start, in_quote, after_quote, error actions: no_op, add_char, new_field, end_record, error start,delimiter -> in_quote,no_op start,separator -> start,new_field start,newline -> start,end_record start,other -> start,add_char in_quote,delimiter -> after_quote,no_op in_quote,separator -> in_quote,add_char in_quote,newline -> error,error in_quote,other -> in_quote,add_char after_quote,delimiter -> in_quote,add_char after_quote,separator -> start,new_field after_quote,newline -> start,end_record after_quote,other -> error,error
follow-up: 5 comment:4 by , 9 years ago
Replying to glynn:
Replying to mlennert:
The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...
Simply removing the quotes isn't sufficient to support CSV. It's necessary to treat the field separator as a normal character when it occurs within a quoted field (that's the point of using quotes), and also to replace two consecutive quotes with a single quote (so that the quote character itself can occur within a field).
For that, an explicit state machine is likely to be more legible than ad-hoc logic.
Sounds like a fun programming challenge (for it's a challenge at least), but I wonder whether we couldn't just integrate the work of others in the same domain, notably libcsv. It's GPL (2 or higher) and its fairly lightweight, so we could just integrate it into the GRASS source (although Debian ships it as a package as well, don't know for other distributions, nor for OSX and Windows). AFAICS, there have been no modifications since 2013, so it does not look like a heavy dependency to track.
comment:5 by , 9 years ago
Replying to mlennert:
Replying to glynn:
Replying to mlennert:
The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...
Simply removing the quotes isn't sufficient to support CSV. It's necessary to treat the field separator as a normal character when it occurs within a quoted field (that's the point of using quotes), and also to replace two consecutive quotes with a single quote (so that the quote character itself can occur within a field).
For that, an explicit state machine is likely to be more legible than ad-hoc logic.
Sounds like a fun programming challenge (for it's a challenge at least), but I wonder whether we couldn't just integrate the work of others in the same domain, notably libcsv. It's GPL (2 or higher) and its fairly lightweight, so we could just integrate it into the GRASS source (although Debian ships it as a package as well, don't know for other distributions, nor for OSX and Windows). AFAICS, there have been no modifications since 2013, so it does not look like a heavy dependency to track.
Another option would be to use Python's csv module and rewrite v.in.ascii in Python...
Moritz
follow-up: 7 comment:6 by , 9 years ago
Replying to glynn:
For that, an explicit state machine is likely to be more legible than ad-hoc logic.
Please test attachment:tokenise.diff
An external library may be worth using for improved fault-tolerance (CSV is a rather loose "standard", to say the least). But any such dependency should be
- on specific modules (e.g. v.in.ascii), not lib/gis (i.e. G_tokenize), and
- an optional alternative to G_tokenize(), i.e. modules should still compile and work if the library isn't available.
Python is far too heavyweight a dependency for such a task.
follow-up: 8 comment:7 by , 9 years ago
Replying to glynn:
Replying to glynn:
For that, an explicit state machine is likely to be more legible than ad-hoc logic.
Please test attachment:tokenise.diff
Great, thanks !
I propose two small changes (attached tokenise_corrected.diff), one seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes from the fact that when we are in state AFTER_QUOTE and we reach a delimiter, we have to go back to state S_START. Otherwise if the next field starts again with a quote, this quote is treated as a second quote.
Using the following example:
echo "123|123|1|test1|'test2'|'\"test3\"'|'test''4'" | v.in.ascii in=- out=testtext text=singlequote --o
With your patch:
> v.db.select testtextcat|int_1|int_2|int_3|str_1|str_2|str_3|str_4 1|123|123|1|test1|test2|'"test3"|'test'4t''4'
With the correction:
> v.db.select testtextcat|int_1|int_2|int_3|str_1|str_2|str_3|str_4 1|123|123|1|test1|test2|"test3"|test'4
An external library may be worth using for improved fault-tolerance (CSV is a rather loose "standard", to say the least). But any such dependency should be
- on specific modules (e.g. v.in.ascii), not lib/gis (i.e. G_tokenize), and
- an optional alternative to G_tokenize(), i.e. modules should still compile and work if the library isn't available.
Python is far too heavyweight a dependency for such a task.
Well, I thought about a new module v.in.ascii2/v.in.csv which would be based on the Python csv module. As Python is a dependency anyway so on module level, this shouldn't be a problem. But I think that with your patch this particular bug is solved, and that we can leave handling of more complex csv files to other tools which people can use to prepare the data for v.in.ascii.
by , 9 years ago
Attachment: | tokenise_corrected.diff added |
---|
follow-up: 11 comment:8 by , 9 years ago
Replying to mlennert:
I propose two small changes (attached tokenise_corrected.diff), one seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes from the fact that when we are in state AFTER_QUOTE and we reach a delimiter, we have to go back to state S_START. Otherwise if the next field starts again with a quote, this quote is treated as a second quote.
Indeed, good catches.
Applied as r65499.
follow-up: 12 comment:11 by , 9 years ago
Replying to glynn:
Replying to mlennert:
I propose two small changes (attached tokenise_corrected.diff), one seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes from the fact that when we are in state AFTER_QUOTE and we reach a delimiter, we have to go back to state S_START. Otherwise if the next field starts again with a quote, this quote is treated as a second quote.
Indeed, good catches.
Applied as r65499.
The backport is still pending. Sync lib/gis/token.c to trunk?
follow-up: 13 comment:12 by , 9 years ago
comment:13 by , 9 years ago
comment:15 by , 9 years ago
Replying to martinl:
Can we close the ticket?
The tests I added when opening the ticket are running fine and it is now backported but perhaps Moritz can review the tests if they cover his use cases:
Replying to wenzeslaus:
IIUC, this comes from this part of lib/gis/token.c. The p++ just advances the pointer like for any other character. I guess that in order to skip it, one would have to create a second pointer which takes a copy of the first, but advances only when the first one is not equal to the character you want to eliminate. Or something like that. But my pointer foo is much too low to handle this directly. Anyone more C-savy who could do this in 5 minutes ?