Opened 12 years ago

Closed 11 years ago

#1614 closed defect (wontfix)

More normalizer tricks

Reported by: mikepease Owned by: robe
Priority: medium Milestone: PostGIS 2.1.0
Component: tiger geocoder Version: 1.5.X
Keywords: Cc: woodbri

Description


—Specifying St. Paul but not MN misinterprets state select * from normalize_address('933 Vandalia Ave, St Paul'); —Output PALAU (PW)?? —933 "" "Vandalia" "" "" "" "St" "PW" "" t

select * from normalize_address('933 Vandalia, St. Paul, 55304') ; —Still goes to Palau

select * from normalize_address('933 Vandalia, St. Paul, MN') —Works properly


-Why is the syntax so sensitive here? —none of these work right select * from normalize_address('901 Mainstreet, Fl 2, Hopkins MN 55343') select * from normalize_address('901 Mainstreet Fl 2, Hopkins, MN 55343') select * from normalize_address('901 Mainstreet Fl 2 Hopkins, MN 55343') select * from normalize_address('901 Mainstreet Fl 2 Hopkins, MN 55343') select * from normalize_address('901 Mainstreet, Fl 2 Hopkins, MN 55343') select * from normalize_address('901 Mainstreet St, Fl 2 Hopkins, MN 55343') —this one does select * from normalize_address('901 Mainstreet St Fl 2, Hopkins, MN 55343')

Change History (8)

comment:1 by robe, 12 years ago

Milestone: PostGIS 2.0.0PostGIS 2.1.0

I'm going to push this but might get to it before then. Just don't what people yelling at me with the 2.0.0 space cluttered.

comment:2 by mikepease, 12 years ago

—County Road syntax is sensitive In a address database we have, they used a different syntax for listing county roads. Example: 8435 COUNTY 20 RD SE, ROCHESTER, MN 55904

This normalizes differently than: 8435 COUNTY RD 20 SE, ROCHESTER, MN 55904

But this second syntax stumps the normalizer. If you write it this way, then it works: 8435 COUNTY ROAD 20 SE, ROCHESTER, MN 55904

select * from normalize_address('8435 COUNTY 20 RD SE, ROCHESTER, MN 55904')

select * from normalize_address('8435 COUNTY RD 20 SE, ROCHESTER, MN 55904')

select * from normalize_address('8435 COUNTY ROAD 20 SE, ROCHESTER, MN 55904')

I can see why the first syntax may produce a reasonable, if not the desired, result. But the second syntax shouldn't get stumped.

Looks like the look up table for for road type needs more versions of spellings to include: COUNTY RD as well as COUNTY ROAD

Perhaps this is true for other street types too?

comment:3 by mikepease, 12 years ago

Google uses "U.S." as its formal syntax for a US Hwy. If I add this to the street_type_lookup, I get some more matches in my address database.

select * from normalize_address('3208 U.S. 52, Rochester, MN 55901')

comment:4 by robe, 12 years ago

added County Rd and U.S. to list at r10309

comment:5 by woodbri, 12 years ago

Cc: woodbri added

comment:6 by woodbri, 11 years ago

When I loaded and standardized the all of Tiger for the whole US using the PAGC standardizer. I looked at the records that failed to standardize so I could add entries to the lexicon and gazeteer and parser rules. I found a lot of garbage in these records. Things like you mention above COUNTY 20 RD vs COUNTY RD 20, and things like the street type in BOTH the name and the type fields. I think there were about 9000 records out 50 Million, so I have not waded through them yet as I had other higher priority items. I also think some simple regex checking and cleaning of these in the loading process is the best way to deal with them. Another words spend the time once to deal with these, so the search code is cleaner, simpler and faster.

comment:7 by robe, 11 years ago

Steve,

That's a great idea. I haven't thought much to doing that, mostly because I haven't thought what cleaning rules to institute. In theory it should be easy to inject these preprocessing steps since the loader loads the shapefile into a staging table before pushing to the final tables. So all this cleaning can be done in staging.

comment:8 by robe, 11 years ago

Resolution: wontfix
Status: newclosed

Going to focus my effort on integrating PAGC

Note: See TracTickets for help on using tickets.