#5397 closed enhancement (fixed)

debug_standardize_address

Reported by: robe Owned by: robe
Priority: high Milestone: PostGIS 3.4.0
Component: pagc_address_parser Version: master
Keywords: Cc:

Description

This is a new function for address_standardizer extension that allows you to debug why an address is standardized the way it is.

It's most useful when abbreviations or words can resolve to more than one kind of token - e.g. ST for Street vs. ST for SAINT.

The way the address standardizer works, it generates a standardized list of all tokens in an address and then applies rules to them. In many cases many rules can result in some answer and the address standardizer picks the one with the highest score.

This function allows you to look into the guts of what it is doing.

I'm still working locally fitting all the bits and am almost there. One thought is if I should expose this as just json text (similar to ST_AsGeoJSON) or as a bonified jsonb or json. I don't like that jsonb looses the sorting of the keys so that is one downside.

I went with JSON as Leo's suggestion cause it allows maximum flexibility in turns of structure without having to change the output of the function.

For example

SELECT s::jsonb
FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301', 'Boston, MA 02109') AS s;

returns at the moment:

{
  "rules": [
    {
      "no": 0,
      "score": 0.87625,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 1,
      "score": 0.81125,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 2,
      "score": 0.754167,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 3,
      "score": 0.735937,
      "rule_string": "0 1 2 17 0 -1 1 5 6 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 4,
      "score": 0.724167,
      "rule_string": "0 1 2 17 0 -1 1 5 5 17 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "STREET",
          "input-token-code": 2,
          "output-token-code": 5
        },
        {
          "pos": 3,
          "input-token": "UNITT",
          "output-token": "UNITT",
          "input-token-code": 17,
          "output-token-code": 17
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    },
    {
      "no": 5,
      "score": 0.70625,
      "rule_string": "0 1 2 2 0 -1 1 5 5 6 17",
      "rule_tokens": [
        {
          "pos": 0,
          "input-token": "NUMBER",
          "output-token": "HOUSE",
          "input-token-code": 0,
          "output-token-code": 1
        },
        {
          "pos": 1,
          "input-token": "WORD",
          "output-token": "STREET",
          "input-token-code": 1,
          "output-token-code": 5
        },
        {
          "pos": 2,
          "input-token": "TYPE",
          "output-token": "STREET",
          "input-token-code": 2,
          "output-token-code": 5
        },
        {
          "pos": 3,
          "input-token": "TYPE",
          "output-token": "SUFTYP",
          "input-token-code": 2,
          "output-token-code": 6
        },
        {
          "pos": 4,
          "input-token": "NUMBER",
          "output-token": "UNITT",
          "input-token-code": 0,
          "output-token-code": 17
        }
      ]
    }
  ],
  "input_tokens": [
    {
      "pos": 0,
      "token": "NUMBER",
      "stdword": "1",
      "token-code": 0
    },
    {
      "pos": 0,
      "token": "WORD",
      "stdword": "1",
      "token-code": 1
    },
    {
      "pos": 1,
      "token": "WORD",
      "stdword": "DEVONSHIRE",
      "token-code": 1
    },
    {
      "pos": 2,
      "token": "TYPE",
      "stdword": "PLACE",
      "token-code": 2
    },
    {
      "pos": 3,
      "token": "TYPE",
      "stdword": "PATH",
      "token-code": 2
    },
    {
      "pos": 3,
      "token": "UNITT",
      "stdword": "PENTHOUSE",
      "token-code": 17
    },
    {
      "pos": 4,
      "token": "NUMBER",
      "stdword": "301",
      "token-code": 0
    }
  ]
}

Changes I need to make before I commit my work

1) Add an option for the single address version (this will be the same function, but the last will be a default arg, which when left blank will call parse address first (much like current one does when only one arg).

SELECT s::jsonb
FROM debug_standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301, Boston, MA 02109') AS s;

2) Output the macro rules (Not sure why it isn't at moment. This is the one that parses Boston, MA 02109 part 3) Define for each rule what kind of rule it is 4) Finish off the rulestring. Right now the rule_string is missing the type of rule and the weight of the rule to allow it to be easily matched up in the rule table. I also think there might be something wrong with my looping thru the rules since some rules appear to be repeated.

5) Output the results much like what:

SELECT s.*
FROM standardize_address( 'us_lex', 'us_gaz','us_rules','One Devonshire Pl PH 301', 'Boston, MA 02109') AS s;

This will be another array node in the document: where each element with be a jsonified stdaddress

e.g.

"standardized_addresses": [ {"building":null,"house_num":"1","predir":null,"qual":null,"pretype":null,"name":"DEVONSHIRE","suftype":"PLACE","sufdir":null,"ruralroute":null,"extra":null,"city":"BOSTON","state":"MASSACHUSETTS","country":null,"postcode":"02109","box":null,"unit":"# PENTHOUSE 301"} ]

I was thinking of skipping the nulls and eventually adding the lower ranking addresses after.

Change History (9)

comment:1 by Regina Obe <lr@…>, 15 months ago

In 06bed87/git:

debug_standardize_address function for debugging
address standardizer rules. Json formatted output
References #5397

Includes

  • rulestring (currently missing the rule type and weight
  • tokens which includes input words and corresponding standardize output
  • the stdaddr (best guess) address

comment:2 by Regina Obe <lr@…>, 15 months ago

In 58cd6af/git:

Cleanup
References #5397

  • Get rid of unused variable and declaration warnings
  • Revise documentation to show how you can expand the json document

comment:3 by Regina Obe <lr@…>, 15 months ago

In 6ec7d17/git:

Fix doco link. References #5397

comment:4 by Regina Obe <lr@…>, 15 months ago

In 1caae1c/git:

Fix doco link. References #5397

comment:5 by Regina Obe <lr@…>, 15 months ago

In fe02ad9/git:

Fix doco link. References #5397

comment:6 by robe, 15 months ago

Type: defectenhancement

comment:7 by Regina Obe <lr@…>, 15 months ago

In 8cebf5d6/git:

CHANGES for debug_standardize_address
References #5397

  • Add tests for debug_standardize_address
  • Try to fix gitlab complaints
  • Turn back on address_standardizer testing on winnie
  • Fix postgis_tiger_geocoder regress not working. References #5254
  • Turn back on winnie postgis_tiger_geocoder regression (it's going to fail because of #5408)

comment:8 by robe, 15 months ago

I still need to add in the rule id and full rule string and then also hoping to figure out adding the rules for macro (not sure why they are not coming thru)

comment:9 by Regina Obe <lr@…>, 15 months ago

Resolution: fixed
Status: newclosed

In d901678/git:

debug_standardize_address

  • Add rule_string and rule_id, by looking up from rule table
  • If no matching rule found (shouldn't happen but it does) return -1 for rule_id
  • Also return rule_stub_string which is the filter used to search for the rule in rule table
  • Add test to output rule_string/rule_stub_string

Closes #5397

Note: See TracTickets for help on using tickets.