Opened 6 months ago
Closed 5 weeks ago
#5695 closed defect (fixed)
standardize_address incorrect handling of directionals address_standardizer_data_us us_lex
Reported by: | tmcsys | Owned by: | robe |
---|---|---|---|
Priority: | medium | Milestone: | PostGIS 3.3.7 |
Component: | pagc_address_parser | Version: | 3.4.x |
Keywords: | Cc: | tmcsys |
Description
standardize_address in PostGIS 3.4.1 returns different results than the example on page 210 of PostGIS in Action, 2nd edition.
It fails to identify basic pre-directionals, instead including them in the output as part of the house number.
The only pre-directional that appears to be correctly recognized by standardize_address is ’S’ as ‘SOUTH’.
debug_standardize_address correctly identifies ‘E’ or ‘W’ as a directional token.
# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', 'ONE E PIMA ST STE 999, TUCSON, AZ'); ┌──────────┬───────────┬────────┬──────┬─────────┬──────┬─────────┬────────┬────────────┬───────┬────────┬─────────┬─────────┬──────────┬─────┬───────────┐ │ building │ house_num │ predir │ qual │ pretype │ name │ suftype │ sufdir │ ruralroute │ extra │ city │ state │ country │ postcode │ box │ unit │ ├──────────┼───────────┼────────┼──────┼─────────┼──────┼─────────┼────────┼────────────┼───────┼────────┼─────────┼─────────┼──────────┼─────┼───────────┤ │ ¤ │ 1 E │ ¤ │ ¤ │ ¤ │ PIMA │ STREET │ ¤ │ ¤ │ ¤ │ TUCSON │ ARIZONA │ USA │ ¤ │ ¤ │ SUITE 999 │ └──────────┴───────────┴────────┴──────┴─────────┴──────┴─────────┴────────┴────────────┴───────┴────────┴─────────┴─────────┴──────────┴─────┴───────────┘ (1 row) # SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD,NORTH PORT,FL, 34291'); ┌──────────┬───────────┬────────┬──────┬─────────┬────────────┬───────────┬────────┬────────────┬───────┬────────────┬─────────┬─────────┬──────────┬─────┬──────┐ │ building │ house_num │ predir │ qual │ pretype │ name │ suftype │ sufdir │ ruralroute │ extra │ city │ state │ country │ postcode │ box │ unit │ ├──────────┼───────────┼────────┼──────┼─────────┼────────────┼───────────┼────────┼────────────┼───────┼────────────┼─────────┼─────────┼──────────┼─────┼──────┤ │ ¤ │ 8907 W │ ¤ │ ¤ │ ¤ │ TROPICAIRE │ BOULEVARD │ ¤ │ ¤ │ ¤ │ NORTH PORT │ FLORIDA │ USA │ 34291 │ ¤ │ ¤ │ └──────────┴───────────┴────────┴──────┴─────────┴────────────┴───────────┴────────┴────────────┴───────┴────────────┴─────────┴─────────┴──────────┴─────┴──────┘ (1 row) # SELECT it->>'pos' AS position, it->>'word' AS word, it->>'stdword' AS standardized_word, [more]2-# it->>'token' AS token, it->>'token-code' AS token_code [more]3-# FROM jsonb( [more]3(# debug_standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD, NORTH PORT,FL, 34291') [more]3(# ) AS s, jsonb_array_elements(s->'input_tokens') AS it; ┌──────────┬────────────┬───────────────────┬────────┬────────────┐ │ position │ word │ standardized_word │ token │ token_code │ ├──────────┼────────────┼───────────────────┼────────┼────────────┤ │ 0 │ 8907 │ 8907 │ NUMBER │ 0 │ │ 0 │ 8907 │ 8907 │ QUAD │ 29 │ │ 1 │ W │ W │ SINGLE │ 18 │ │ 1 │ W │ WEST │ DIRECT │ 22 │ │ 2 │ TROPICAIRE │ TROPICAIRE │ WORD │ 1 │ │ 3 │ BLVD │ BOULEVARD │ TYPE │ 2 │ └──────────┴────────────┴───────────────────┴────────┴────────────┘ (6 rows) # SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 S TROPICAIRE BLVD,NORTH PORT,FL, 34291'); ┌──────────┬───────────┬────────┬──────┬─────────┬────────────┬───────────┬────────┬────────────┬───────┬────────────┬─────────┬─────────┬──────────┬─────┬──────┐ │ building │ house_num │ predir │ qual │ pretype │ name │ suftype │ sufdir │ ruralroute │ extra │ city │ state │ country │ postcode │ box │ unit │ ├──────────┼───────────┼────────┼──────┼─────────┼────────────┼───────────┼────────┼────────────┼───────┼────────────┼─────────┼─────────┼──────────┼─────┼──────┤ │ ¤ │ 8907 │ SOUTH │ ¤ │ ¤ │ TROPICAIRE │ BOULEVARD │ ¤ │ ¤ │ ¤ │ NORTH PORT │ FLORIDA │ USA │ 34291 │ ¤ │ ¤ │ └──────────┴───────────┴────────┴──────┴─────────┴────────────┴───────────┴────────┴────────────┴───────┴────────────┴─────────┴─────────┴──────────┴─────┴──────┘ (1 row) SELECT city, country FROM parse_address('55 Laurier Avenue East, Room 3156, Ottawa ON K1N 6N5');
This potential defect was first noticed in PostgresAPP. You can read the original issue at https://github.com/PostgresApp/PostgresApp/issues/756 For more details expand the “My output looks like this:” button in the reply.
Change History (8)
comment:1 by , 6 months ago
Component: | postgis → pagc_address_parser |
---|---|
Owner: | changed from | to
comment:2 by , 6 months ago
comment:3 by , 6 months ago
I isolated the issue down to the change in us_lex, by confirming an old backup I have if I swap out the new us_lex with the old us_lex, the directional processing is correct.
However after purging the two entries I had introduced in #5299, the directional processing is still wrong.
The main difference between the old and new (besides the 2 entries I added) is the autogenerated id field which I had mistakenly assumed didn't matter but somehow even though the values are in same order as they were before, records are no longer in same order and that matters.
So I guess two bugs, here, but work around for now I'll force the insert order, if I force the insert order of entries to be as it was before, then those examples you have seem to behave correctly.
comment:5 by , 6 months ago
Milestone: | PostGIS 3.4.3 → PostGIS 3.3.7 |
---|
comment:6 by , 6 months ago
Summary: | standardize_address suspected of returning incorrect results → standardize_address incorrect handling of directionals address_standardizer_data_us us_lex |
---|
comment:8 by , 5 weeks ago
Resolution: | → fixed |
---|---|
Status: | new → closed |
Must be something about the data set that changed I'm guessing the weight of one of the items which I'll have to investigate possible my change in #5299 inadvertently introduced something when I was resorting the values.
I note that the one shipped with tiger_geocoder does the right thing with the direction.
outputs:
Which is why I'm thinking it's isolated to the address_data_us extension dataset.
I was thinking maybe the switched in regex library to newer pcre2, but I get the same failures you describe with a system running the old pcre and new pcre libraries.