Opened 2 months ago

Last modified 2 months ago

#5695 new defect

standardize_address incorrect handling of directionals address_standardizer_data_us us_lex

Reported by: tmcsys Owned by: robe
Priority: medium Milestone: PostGIS 3.3.7
Component: pagc_address_parser Version: 3.4.x
Keywords: Cc: tmcsys

Description

standardize_address in PostGIS 3.4.1 returns different results than the example on page 210 of PostGIS in Action, 2nd edition.

It fails to identify basic pre-directionals, instead including them in the output as part of the house number.

The only pre-directional that appears to be correctly recognized by standardize_address is ’S’ as ‘SOUTH’.

debug_standardize_address correctly identifies ‘E’ or ‘W’ as a directional token.

# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', 'ONE E PIMA ST STE 999, TUCSON, AZ');
┌──────────┬───────────┬────────┬──────┬─────────┬──────┬─────────┬────────┬────────────┬───────┬────────┬─────────┬─────────┬──────────┬─────┬───────────┐
│ building │ house_num │ predir │ qual │ pretype │ name │ suftype │ sufdir │ ruralroute │ extra │  city  │  state  │ country │ postcode │ box │   unit    │
├──────────┼───────────┼────────┼──────┼─────────┼──────┼─────────┼────────┼────────────┼───────┼────────┼─────────┼─────────┼──────────┼─────┼───────────┤
│ ¤        │ 1 E       │ ¤      │ ¤    │ ¤       │ PIMA │ STREET  │ ¤      │ ¤          │ ¤     │ TUCSON │ ARIZONA │ USA     │ ¤        │ ¤   │ SUITE 999 │
└──────────┴───────────┴────────┴──────┴─────────┴──────┴─────────┴────────┴────────────┴───────┴────────┴─────────┴─────────┴──────────┴─────┴───────────┘
(1 row)

# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD,NORTH PORT,FL, 34291');
┌──────────┬───────────┬────────┬──────┬─────────┬────────────┬───────────┬────────┬────────────┬───────┬────────────┬─────────┬─────────┬──────────┬─────┬──────┐
│ building │ house_num │ predir │ qual │ pretype │    name    │  suftype  │ sufdir │ ruralroute │ extra │    city    │  state  │ country │ postcode │ box │ unit │
├──────────┼───────────┼────────┼──────┼─────────┼────────────┼───────────┼────────┼────────────┼───────┼────────────┼─────────┼─────────┼──────────┼─────┼──────┤
│ ¤        │ 8907 W    │ ¤      │ ¤    │ ¤       │ TROPICAIRE │ BOULEVARD │ ¤      │ ¤          │ ¤     │ NORTH PORT │ FLORIDA │ USA     │ 34291    │ ¤   │ ¤    │
└──────────┴───────────┴────────┴──────┴─────────┴────────────┴───────────┴────────┴────────────┴───────┴────────────┴─────────┴─────────┴──────────┴─────┴──────┘
(1 row)

# SELECT it->>'pos' AS position, it->>'word' AS word, it->>'stdword' AS standardized_word,
[more]2-#          it->>'token' AS token, it->>'token-code' AS token_code
[more]3-#     FROM jsonb(
[more]3(#          debug_standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 W TROPICAIRE BLVD, NORTH PORT,FL, 34291')
[more]3(#     ) AS s, jsonb_array_elements(s->'input_tokens') AS it;
┌──────────┬────────────┬───────────────────┬────────┬────────────┐
│ position │    word    │ standardized_word │ token  │ token_code │
├──────────┼────────────┼───────────────────┼────────┼────────────┤
│ 0        │ 8907       │ 8907              │ NUMBER │ 0          │
│ 0        │ 8907       │ 8907              │ QUAD   │ 29         │
│ 1        │ W          │ W                 │ SINGLE │ 18         │
│ 1        │ W          │ WEST              │ DIRECT │ 22         │
│ 2        │ TROPICAIRE │ TROPICAIRE        │ WORD   │ 1          │
│ 3        │ BLVD       │ BOULEVARD         │ TYPE   │ 2          │
└──────────┴────────────┴───────────────────┴────────┴────────────┘
(6 rows)

# SELECT * FROM standardize_address('us_lex', 'us_gaz', 'us_rules', '8907 S TROPICAIRE BLVD,NORTH PORT,FL, 34291');
┌──────────┬───────────┬────────┬──────┬─────────┬────────────┬───────────┬────────┬────────────┬───────┬────────────┬─────────┬─────────┬──────────┬─────┬──────┐
│ building │ house_num │ predir │ qual │ pretype │    name    │  suftype  │ sufdir │ ruralroute │ extra │    city    │  state  │ country │ postcode │ box │ unit │
├──────────┼───────────┼────────┼──────┼─────────┼────────────┼───────────┼────────┼────────────┼───────┼────────────┼─────────┼─────────┼──────────┼─────┼──────┤
│ ¤        │ 8907      │ SOUTH  │ ¤    │ ¤       │ TROPICAIRE │ BOULEVARD │ ¤      │ ¤          │ ¤     │ NORTH PORT │ FLORIDA │ USA     │ 34291    │ ¤   │ ¤    │
└──────────┴───────────┴────────┴──────┴─────────┴────────────┴───────────┴────────┴────────────┴───────┴────────────┴─────────┴─────────┴──────────┴─────┴──────┘
(1 row)
SELECT city, country
FROM parse_address('55 Laurier Avenue East, Room 3156, Ottawa ON K1N 6N5');

This potential defect was first noticed in PostgresAPP. You can read the original issue at https://github.com/PostgresApp/PostgresApp/issues/756 For more details expand the “My output looks like this:” button in the reply.

Change History (7)

comment:1 by robe, 2 months ago

Component: postgispagc_address_parser
Owner: changed from pramsey to robe

comment:2 by robe, 2 months ago

Must be something about the data set that changed I'm guessing the weight of one of the items which I'll have to investigate possible my change in #5299 inadvertently introduced something when I was resorting the values.

I note that the one shipped with tiger_geocoder does the right thing with the direction.

CREATE EXTENSION postgis_tiger_geocoder CASCADE;
SELECT * FROM standardize_address('tiger.pagc_lex', 'tiger.pagc_gaz', 'tiger.pagc_rules', 'ONE E PIMA ST STE 999, TUCSON, AZ');

outputs:

building | house_num | predir | qual | pretype | name | suftype | sufdir | ruralroute | extra |  city  | state | country | postcode | box |   unit
----------+-----------+--------+------+---------+------+---------+--------+------------+-------+--------+-------+---------+----------+-----+-----------
          | 1         | E      |      |         | PIMA | ST      |        |            |       | TUCSON | AZ    | USA     |          |     | SUITE 999
(1 row)
SELECT * FROM standardize_address('tiger.pagc_lex', 'tiger.pagc_gaz', 'tiger.pagc_rules', '8907 W TROPICAIRE BLVD,NORTH PORT,FL, 34291');


 building | house_num | predir | qual | pretype |    name    | suftype | sufdir | ruralroute | extra |    city    | state | country | postcode | box | unit
----------+-----------+--------+------+---------+------------+---------+--------+------------+-------+------------+-------+---------+----------+-----+------
          | 8907      | W      |      |         | TROPICAIRE | BLVD    |        |            |       | NORTH PORT | FL    | USA     | 34291    |     |
(1 row)

Which is why I'm thinking it's isolated to the address_data_us extension dataset.

I was thinking maybe the switched in regex library to newer pcre2, but I get the same failures you describe with a system running the old pcre and new pcre libraries.

comment:3 by robe, 2 months ago

I isolated the issue down to the change in us_lex, by confirming an old backup I have if I swap out the new us_lex with the old us_lex, the directional processing is correct.

However after purging the two entries I had introduced in #5299, the directional processing is still wrong.

[722d8873/git#file3]

The main difference between the old and new (besides the 2 entries I added) is the autogenerated id field which I had mistakenly assumed didn't matter but somehow even though the values are in same order as they were before, records are no longer in same order and that matters.

So I guess two bugs, here, but work around for now I'll force the insert order, if I force the insert order of entries to be as it was before, then those examples you have seem to behave correctly.

comment:4 by Regina Obe <lr@…>, 2 months ago

In 990c57c8/git:

Guarantee order of us_lex items
References #5695 for PostGIS 3.5.0

comment:5 by robe, 2 months ago

Milestone: PostGIS 3.4.3PostGIS 3.3.7

comment:6 by robe, 2 months ago

Summary: standardize_address suspected of returning incorrect resultsstandardize_address incorrect handling of directionals address_standardizer_data_us us_lex

comment:7 by Regina Obe <lr@…>, 2 months ago

In ad75e78/git:

address_standardizer_data_us incorrect handling
of directionals
References #5695 for PostGIS 3.4.3

Note: See TracTickets for help on using tickets.