Context Navigation

← Previous Ticket
Next Ticket →

#5397 closed enhancement (fixed)

debug_standardize_address

Reported by:	robe	Owned by:	robe
Priority:	high	Milestone:	PostGIS 3.4.0
Component:	pagc_address_parser	Version:	master
Keywords:		Cc:

Description ¶

This is a new function for address_standardizer extension that allows you to debug why an address is standardized the way it is.

It's most useful when abbreviations or words can resolve to more than one kind of token - e.g. ST for Street vs. ST for SAINT.

The way the address standardizer works, it generates a standardized list of all tokens in an address and then applies rules to them. In many cases many rules can result in some answer and the address standardizer picks the one with the highest score.

This function allows you to look into the guts of what it is doing.

I'm still working locally fitting all the bits and am almost there. One thought is if I should expose this as just json text (similar to ST_AsGeoJSON) or as a bonified jsonb or json. I don't like that jsonb looses the sorting of the keys so that is one downside.

I went with JSON as Leo's suggestion cause it allows maximum flexibility in turns of structure without having to change the output of the function.

For example

SELECT s::jsonb
FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301', 'Boston, MA 02109') AS s;

returns at the moment:

{
  "rules": [
    {
      "no": 0,
      "score": 0.87625,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 1,
      "score": 0.81125,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 2,
      "score": 0.754167,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 3,
      "score": 0.735937,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 4,
      "score": 0.724167,
      "rule_string": "0 1 2 17 0 -1 1 5 5 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "STREET",
          "input-token-code": 2,
          "output-token-code": 5
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 5,
      "score": 0.70625,
      "rule_string": "0 1 2 2 0 -1 1 5 5 6 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "STREET",
          "input-token-code": 2,
          "output-token-code": 5
        },
        {
          "pos": 3,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    }
  ],
  "input_tokens": [
    {
      "pos": 0,
      "token": "NUMBER",
      "stdword": "1",
      "token-code": 0
    },
    {
      "pos": 0,
      "token": "WORD",
      "stdword": "1",
      "token-code": 1
    },
    {
      "pos": 1,
      "token": "WORD",
      "stdword": "DEVONSHIRE",
      "token-code": 1
    },
    {
      "pos": 2,
      "token": "TYPE",
      "stdword": "PLACE",
      "token-code": 2
    },
    {
      "pos": 3,
      "token": "TYPE",
      "stdword": "PATH",
      "token-code": 2
    },
    {
      "pos": 3,
      "token": "UNITT",
      "stdword": "PENTHOUSE",
      "token-code": 17
    },
    {
      "pos": 4,
      "token": "NUMBER",
      "stdword": "301",
      "token-code": 0
    }
  ]
}

Changes I need to make before I commit my work

1) Add an option for the single address version (this will be the same function, but the last will be a default arg, which when left blank will call parse address first (much like current one does when only one arg).

SELECT s::jsonb
FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301, Boston, MA 02109') AS s;

2) Output the macro rules (Not sure why it isn't at moment. This is the one that parses Boston, MA 02109 part 3) Define for each rule what kind of rule it is 4) Finish off the rulestring. Right now the rule_string is missing the type of rule and the weight of the rule to allow it to be easily matched up in the rule table. I also think there might be something wrong with my looping thru the rules since some rules appear to be repeated.

5) Output the results much like what:

SELECT s.*
FROM standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301', 'Boston, MA 02109') AS s;

This will be another array node in the document: where each element with be a jsonified stdaddress

e.g.

"standardized_addresses": [ {"building":null,"house_num":"1","predir":null,"qual":null,"pretype":null,"name":"DEVONSHIRE","suftype":"PLACE","sufdir":null,"ruralroute":null,"extra":null,"city":"BOSTON","state":"MASSACHUSETTS","country":null,"postcode":"02109","box":null,"unit":"# PENTHOUSE 301"} ]

I was thinking of skipping the nulls and eventually adding the lower ranking addresses after.

Back Links(1)

source:git – [06bed87] [1caae1c] [58cd6af] [6ec7d17] [8cebf5d] [d901678] [fe02ad9]

Change History (9)

comment:1 by Regina Obe <lr@…>, 22 months ago

In 06bed87/git:

debug_standardize_address function for debugging
address standardizer rules. Json formatted output
References #5397

Includes

rulestring (currently missing the rule type and weight
tokens which includes input words and corresponding standardize output
the stdaddr (best guess) address

comment:2 by Regina Obe <lr@…>, 22 months ago

In 58cd6af/git:

Cleanup
References #5397

Get rid of unused variable and declaration warnings
Revise documentation to show how you can expand the json document

comment:3 by Regina Obe <lr@…>, 22 months ago

In 6ec7d17/git:

Fix doco link. References #5397

comment:4 by Regina Obe <lr@…>, 22 months ago

In 1caae1c/git:

Fix doco link. References #5397

comment:5 by Regina Obe <lr@…>, 22 months ago

In fe02ad9/git:

Fix doco link. References #5397

comment:6 by robe, 22 months ago

Type:	defect → enhancement

comment:7 by Regina Obe <lr@…>, 22 months ago

In 8cebf5d6/git:

CHANGES for debug_standardize_address
References #5397

Add tests for debug_standardize_address
Try to fix gitlab complaints
Turn back on address_standardizer testing on winnie
Fix postgis_tiger_geocoder regress not working. References #5254
Turn back on winnie postgis_tiger_geocoder regression (it's going to fail because of #5408)

comment:8 by robe, 22 months ago

I still need to add in the rule id and full rule string and then also hoping to figure out adding the rules for macro (not sure why they are not coming thru)

comment:9 by Regina Obe <lr@…>, 22 months ago

Resolution:	→ fixed
Status:	new → closed

In d901678/git:

debug_standardize_address

Add rule_string and rule_id, by looking up from rule table
If no matching rule found (shouldn't happen but it does) return -1 for rule_id
Also return rule_stub_string which is the filter used to search for the rule in rule table
Add test to output rule_string/rule_stub_string

Closes #5397

Note: See TracTickets for help on using tickets.

Download in other formats: