Opened 6 years ago

Closed 6 years ago

#2692 closed defect (fixed)

v.in.ascii does not handle text in qoutes

Reported by: wenzeslaus Owned by: grass-dev@…
Priority: normal Milestone: 7.0.3
Component: Default Version: svn-trunk
Keywords: CSV, doublequote, singlequote, text delimiter Cc:
CPU: Unspecified Platform: Unspecified

Description

Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted text fields" (text option) but the quotes are stored to the database. This is the result from v.db.select (same thing in sqliteman):

100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"

To reproduce, run:

$ cd grass/src
$ ./bin.../grass71 ... -text
...
> cd vector/v.in.ascii/testsuite/
> python test_csv.py

I added these tests in r65426 and r65427 if you wish to review them.

G_tokenize2() results online:

v.in.ascii results online:

Attachments (3)

tokenize_with_textdelim.diff (467 bytes) - added by mlennert 6 years ago.
tokenise.diff (2.5 KB) - added by glynn 6 years ago.
improved tokeniser
tokenise_corrected.diff (2.5 KB) - added by mlennert 6 years ago.

Download all attachments as: .zip

Change History (19)

comment:1 in reply to:  description ; Changed 6 years ago by mlennert

Replying to wenzeslaus:

Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted text fields" (text option) but the quotes are stored to the database. This is the result from v.db.select (same thing in sqliteman):

100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"

IIUC, this comes from this part of lib/gis/token.c. The p++ just advances the pointer like for any other character. I guess that in order to skip it, one would have to create a second pointer which takes a copy of the first, but advances only when the first one is not equal to the character you want to eliminate. Or something like that. But my pointer foo is much too low to handle this directly. Anyone more C-savy who could do this in 5 minutes ?

Changed 6 years ago by mlennert

comment:2 in reply to:  1 ; Changed 6 years ago by mlennert

Replying to mlennert:

Replying to wenzeslaus:

Since r63581 G7:v.in.ascii should be able to handle "CSV with quoted text fields" (text option) but the quotes are stored to the database. This is the result from v.db.select (same thing in sqliteman):

100|437343.6704|4061363.41525|"High Erosion"|"Low Deposition"

IIUC, this comes from this part of lib/gis/token.c. The p++ just advances the pointer like for any other character. I guess that in order to skip it, one would have to create a second pointer which takes a copy of the first, but advances only when the first one is not equal to the character you want to eliminate. Or something like that. But my pointer foo is much too low to handle this directly. Anyone more C-savy who could do this in 5 minutes ?

The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...

Moritz

comment:3 in reply to:  2 ; Changed 6 years ago by glynn

Replying to mlennert:

The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...

Simply removing the quotes isn't sufficient to support CSV. It's necessary to treat the field separator as a normal character when it occurs within a quoted field (that's the point of using quotes), and also to replace two consecutive quotes with a single quote (so that the quote character itself can occur within a field).

For that, an explicit state machine is likely to be more legible than ad-hoc logic.

Off the top of my head:

classes: delimiter, separator, newline, other
states: start, in_quote, after_quote, error
actions: no_op, add_char, new_field, end_record, error

start,delimiter -> in_quote,no_op
start,separator -> start,new_field
start,newline -> start,end_record
start,other -> start,add_char
in_quote,delimiter -> after_quote,no_op
in_quote,separator -> in_quote,add_char
in_quote,newline -> error,error
in_quote,other -> in_quote,add_char
after_quote,delimiter -> in_quote,add_char
after_quote,separator -> start,new_field
after_quote,newline -> start,end_record
after_quote,other -> error,error

comment:4 in reply to:  3 ; Changed 6 years ago by mlennert

Replying to glynn:

Replying to mlennert:

The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...

Simply removing the quotes isn't sufficient to support CSV. It's necessary to treat the field separator as a normal character when it occurs within a quoted field (that's the point of using quotes), and also to replace two consecutive quotes with a single quote (so that the quote character itself can occur within a field).

For that, an explicit state machine is likely to be more legible than ad-hoc logic.

Sounds like a fun programming challenge (for it's a challenge at least), but I wonder whether we couldn't just integrate the work of others in the same domain, notably libcsv. It's GPL (2 or higher) and its fairly lightweight, so we could just integrate it into the GRASS source (although Debian ships it as a package as well, don't know for other distributions, nor for OSX and Windows). AFAICS, there have been no modifications since 2013, so it does not look like a heavy dependency to track.

comment:5 in reply to:  4 Changed 6 years ago by mlennert

Replying to mlennert:

Replying to glynn:

Replying to mlennert:

The attached patch seems to solve the problem for me, but it feels like a quick fix and there probably is a better way...

Simply removing the quotes isn't sufficient to support CSV. It's necessary to treat the field separator as a normal character when it occurs within a quoted field (that's the point of using quotes), and also to replace two consecutive quotes with a single quote (so that the quote character itself can occur within a field).

For that, an explicit state machine is likely to be more legible than ad-hoc logic.

Sounds like a fun programming challenge (for it's a challenge at least), but I wonder whether we couldn't just integrate the work of others in the same domain, notably libcsv. It's GPL (2 or higher) and its fairly lightweight, so we could just integrate it into the GRASS source (although Debian ships it as a package as well, don't know for other distributions, nor for OSX and Windows). AFAICS, there have been no modifications since 2013, so it does not look like a heavy dependency to track.

Another option would be to use Python's csv module and rewrite v.in.ascii in Python...

Moritz

Changed 6 years ago by glynn

Attachment: tokenise.diff added

improved tokeniser

comment:6 in reply to:  3 ; Changed 6 years ago by glynn

Replying to glynn:

For that, an explicit state machine is likely to be more legible than ad-hoc logic.

Please test attachment:tokenise.diff

An external library may be worth using for improved fault-tolerance (CSV is a rather loose "standard", to say the least). But any such dependency should be

  1. on specific modules (e.g. v.in.ascii), not lib/gis (i.e. G_tokenize), and
  2. an optional alternative to G_tokenize(), i.e. modules should still compile and work if the library isn't available.

Python is far too heavyweight a dependency for such a task.

comment:7 in reply to:  6 ; Changed 6 years ago by mlennert

Replying to glynn:

Replying to glynn:

For that, an explicit state machine is likely to be more legible than ad-hoc logic.

Please test attachment:tokenise.diff

Great, thanks !

I propose two small changes (attached tokenise_corrected.diff), one seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes from the fact that when we are in state AFTER_QUOTE and we reach a delimiter, we have to go back to state S_START. Otherwise if the next field starts again with a quote, this quote is treated as a second quote.

Using the following example:

echo "123|123|1|test1|'test2'|'\"test3\"'|'test''4'"  | v.in.ascii in=- out=testtext text=singlequote --o

With your patch:

> v.db.select testtextcat|int_1|int_2|int_3|str_1|str_2|str_3|str_4
1|123|123|1|test1|test2|'"test3"|'test'4t''4'

With the correction:

> v.db.select testtextcat|int_1|int_2|int_3|str_1|str_2|str_3|str_4
1|123|123|1|test1|test2|"test3"|test'4

An external library may be worth using for improved fault-tolerance (CSV is a rather loose "standard", to say the least). But any such dependency should be

  1. on specific modules (e.g. v.in.ascii), not lib/gis (i.e. G_tokenize), and
  2. an optional alternative to G_tokenize(), i.e. modules should still compile and work if the library isn't available.

Python is far too heavyweight a dependency for such a task.

Well, I thought about a new module v.in.ascii2/v.in.csv which would be based on the Python csv module. As Python is a dependency anyway so on module level, this shouldn't be a problem. But I think that with your patch this particular bug is solved, and that we can leave handling of more complex csv files to other tools which people can use to prepare the data for v.in.ascii.

Changed 6 years ago by mlennert

Attachment: tokenise_corrected.diff added

comment:8 in reply to:  7 ; Changed 6 years ago by glynn

Replying to mlennert:

I propose two small changes (attached tokenise_corrected.diff), one seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes from the fact that when we are in state AFTER_QUOTE and we reach a delimiter, we have to go back to state S_START. Otherwise if the next field starts again with a quote, this quote is treated as a second quote.

Indeed, good catches.

Applied as r65499.

comment:9 Changed 6 years ago by neteler

Milestone: 7.0.17.0.2

Ticket retargeted after 7.0.1 milestone closed

comment:10 Changed 6 years ago by neteler

Milestone: 7.0.27.0.3

Ticket retargeted after milestone closed

comment:11 in reply to:  8 ; Changed 6 years ago by neteler

Replying to glynn:

Replying to mlennert:

I propose two small changes (attached tokenise_corrected.diff), one seems just a typo (in case A_END_RECORD: "*q++ - '\0';") and the other comes from the fact that when we are in state AFTER_QUOTE and we reach a delimiter, we have to go back to state S_START. Otherwise if the next field starts again with a quote, this quote is treated as a second quote.

Indeed, good catches.

Applied as r65499.

The backport is still pending. Sync lib/gis/token.c to trunk?

comment:12 in reply to:  11 ; Changed 6 years ago by glynn

Replying to neteler:

The backport is still pending. Sync lib/gis/token.c to trunk?

I think so.

comment:13 in reply to:  12 Changed 6 years ago by martinl

Replying to glynn:

Replying to neteler:

The backport is still pending. Sync lib/gis/token.c to trunk?

I think so.

done in r67322

comment:14 Changed 6 years ago by martinl

Can we close the ticket?

comment:15 in reply to:  14 Changed 6 years ago by wenzeslaus

Replying to martinl:

Can we close the ticket?

The tests I added when opening the ticket are running fine and it is now backported but perhaps Moritz can review the tests if they cover his use cases:

comment:16 Changed 6 years ago by mlennert

Resolution: fixed
Status: newclosed

Looks good to me.

Closing.

Note: See TracTickets for help on using tickets.