Opened 13 years ago

Closed 11 years ago

#1075 closed defect (wontfix)

Too much info in address breaks geocoder

Reported by: mikepease Owned by: robe
Priority: medium Milestone: PostGIS 2.1.0
Component: tiger geocoder Version: master
Keywords: Cc: woodbri

Description

Sometimes, addresses have extra "junk" at the end of the street address. This extra junk can throw off the geocoder.

—Works great! select (addy).*,* from geocode('156 Galewski Dr, Winona, MN 55987')

—Too much information! It breaks down. select (addy).*,* from geocode('156 Galewski Dr Airport Industrial Park, Winona, MN 55987')

Change History (7)

comment:1 by robe, 13 years ago

Milestone: PostGIS 2.0.0PostGIS Future

comment:2 by woodbri, 11 years ago

Cc: woodbri added

comment:3 by woodbri, 11 years ago

The PAGC tools solve this.

comment:4 by robe, 11 years ago

Milestone: PostGIS FuturePostGIS 2.1.0
Version: 1.5.Xtrunk

Steven,

thanks for the work on the parser, got it to compile and install on my ming32. Haven't tried yet on my ming64 but I assume since that is what you are using, there should be no issue. I'm a bit puzzled by the standardizer. How do I get it to give the abbreviated.

e.g

select * from standardize_address(
        'select seq, word::text, stdword::text, token from lex order by id',
        'select seq, word::text, stdword::text, token from gaz order by id',
        'select * from rules order by id',
        'select 0::int4 as id, ''156 Galewski Dr Airport Industrial Park''::text as micro, ''Winona, MN 55987''::text as macro');

Yields:

id |        building         | house_num | predir | qual | pretype |   name   |
suftype | sufdir | ruralroute | extra |  city  |   state   | country | postcode
| box | unit
---+-------------------------+-----------+--------+------+---------+----------+
--------+--------+------------+-------+--------+-----------+---------+---------
+-----+------
 0 | AIRPORT INDUSTRIAL PARK | 156       |        |      |         | GALEWSKI |
DRIVE   |        |            |       | WINONA | MINNESOTA |         | 55987
|     |

Anyway going to start putting in tickets to integrate the PAGC parser.

comment:5 by woodbri, 11 years ago

I was building on ming64 and it was crashing there, but it might be that my build environment is not clean. Let me know if it works or not on ming64.

Regarding abbreviations, of more generally how things get standardized, if you look at the lex and gaz tables they have columns:

  id serial NOT NULL,
  seq integer,
  word character varying,    -- word to find in input text
  stdword character varying, -- word to standardize it to
  token integer,             -- token classification for the word

Input symbols are classified as (see:"pagc_api.h", these are a mix of input and output symbols):

#define NUMBER 0
#define WORD 1
#define TYPE 2

#define ROAD 6
#define STOPWORD 7

#define DASH 9
#define CITY 10
#define PROV 11
#define NATION 12
#define AMPERS 13

#define ORD 15

#define SINGLE 18
#define BUILDH 19
#define MILE 20
#define DOUBLE 21
#define DIRECT 22
#define MIXED 23
#define BUILDT 24
#define FRACT 25
#define PCT 26
#define PCH 27
#define QUINT 28
#define QUAD 29

It is possible for a word to be classified as multiple tokens which is ok. For each token when there are multiple tokens you get all the possible combinations. So if a you got something like:

23 A Street  -->  0,[1, 18], 2

and this would get evaluated as two sequences of tokens:

0, 1, 2
0, 18, 2

The evaluation code uses the rules tables to transform input sequences to output sequences and based on probabilities assigned to the rules and scores them for most likely sequence. This is done in gamma.c.

Anyway you can added to the lexicon and gazeteer and to the rules as needed and while the tables I provided are a good starting point for Tiger they are not perfect and I generally find it worth while using two tables:

tiger data table → standardized table

Then I can look at what records did not standardize to determine if new record in the lex or gaz tables are needed or new rules are needed. Then all my queries are done off the standardized table except when I need to address ranges or geometry which I get by a join to the tiger table using the gid which is common to both.

comment:6 by robe, 11 years ago

Interesting wasn't crashing in ming32. I think though that the 64-bit is more sensitive to issues with releasing memory since I see issues in 64-bit that don't show in 32-bit. In the end the 64-bit is usually right and things aren't closed right.

comment:7 by robe, 11 years ago

Resolution: wontfix
Status: newclosed

will be replaced with PAGC.

Note: See TracTickets for help on using tickets.