Opened 9 years ago

Last modified 4 years ago

#2289 new enhancement

Redesign Geocoder

Reported by: woodbri Owned by: robe
Priority: medium Milestone: PostGIS Fund Me
Component: tiger geocoder Version: 2.0.x
Keywords: Cc: robe


I migrated these comments from #1118 as they really should be in a separate ticket. The discussion around the issue in #1118 is really a symptom of the fact that we need to redesign the tiger geocoder to better use the new pagc_address_standardizer, and to make it possible in the future to be less Tiger centric (as a secondary goal.

Ticket #1118 is a problem of not standardizing the reference dataset and relying on the existing standardization. This is a process bug, not a code bug. If you take a random address and ask some people to standardize it into components, you will surely get some different results because the people will have a different set of rules in mind. So we take Tiger data which has been standardized by 3300 different counties where it was collect and given to Census and you will not even find consistency within Tiger. So relying on the pre-parsed standardization is the wrong way to approach this problem.

The way to fix this is to load the tiger data, then clump the name attributes into a single string and give it to the standardizer to parse and then save that. When we get a query request, we standardize that using our same standardizer and rules and we match those results against our standardized reference set.

Then we don't care if the standardization is right or wrong, because if it is wrong, it will be wrong in both cases and will still match.

This process also has the benefit that you can analyze those records that failed to standardize because of missing lexicon, gazeteer or rules and add those that we might need to improve the tools over time. This part can be done separate from the automated loading process. I should be done as part of the bug fixing and enhancements to the geocoder over time.

While the pagc address standardizer improves things and proves some easy tool to change the behavior if you don't make this process change. You will have an endless list of bugs like this that have nothing to do with the code. While you might be able to fix some of these with change to lex, gaz and rules you also might be breaking other cases that are not obvious when you make changes. DAMHIK.

I know the plan it to move forward without making this process change, but it should be planned for sometime in the future.

Yah I was thinking of it in future. I'll ticket that I'm leaning toward using hstore to store the normalized hash for the tiger set possibly only doing it for the obvious ambiguities.

The issue I have with doing it for after load and for all

1) inserting is a lot less painful than updating since updating requires both an insert and delete. So its faster to do on load.

2) Since this is in flux, they'll be a lot of updating going on initially so I don't want to push that on users until things are more stable, plus it complicates update script with update requiring user data changes -- something I kind of want to stay away from until I have my upgrade bullet proof.

3) I actually don't think its necessary to standardize all tiger (I would say about 85% or more of it is fine). For the most part there aren't that many ambiguities and a lot of those would be long and painful to itemize and doing it by lex is probably not the right way.

Clearly for things like Camino etc that would be the right thing.

so I'm thinking more along a hybrid. It would also make my hstore index way shorter and faster to scan if its only the questionable problematic ones that need to be changed. Anyway I'll put in a separate future ticket. For PostGIS 2.1 I would like to change the norm_addy structure since that is part issue that I am mixing pre abbrev with post abbrevs.

I don't do any updates. I load the tiger data into a table, I then standardize that into a stdaddr table that is linked by the primary of the tiger table. If I make changes to the lex, gaz, or rules, I drop the stdaddr table and recreate it. All searches are done only on the stdaddr table and only when I have candidate records do I join those back to the tiger data to get the geometry and compute the location.

So for production, you install a "standard" set of tables for lex, gaz and rules. you load your data, create the stdaddr table and you are done. Users should not be modifying the lex, gaz or rules unless they are developing a different geocoder and then they are not you normal user and they have to understand the process for doing this including the fact that they need to recreate the stdaddr table if they make changes.

While this may require a lot of changes in the current geocoder to move to this structure, long term it is good because it moves you away from being Tiger centric. If our northern neighbors want to use it for Canadian data, then can make a loader for that data, standardize it into stdaddr table and your geocoder will work on that too.

This simplicity will also translate into cleaner and simpler code which will be easier to maintain and in all likelihood be faster also.

Change History (6)

comment:1 Changed 9 years ago by robe

Milestone: PostGIS Future
Summary: Redesign Tiger GeocoderRedesign Geocoder

I've retitled this :)

It is my plan for a more generic geocoder, but not for the tiger one and I expect both to live in parallel for a while. The problem is that this requires an overhaul of how I'm doing queries and also my reverse geocoder logic. Since its working well for ME :) and the project that originally funded it (mostly geocoding and reverse geocoding highways and ramps in MA), I really don't want to mess with it until I get more funding for it.

What you are describing I consider more of a whole rewrite project with consideration of what is salvageable in current tiger geocoder and porting that over. The queries which is the brunt of my effort on it is NOT SALVAGEABLE unfortunately as it relies on a lot of tables from tiger. It uses featnames, edges, faces, addr etc. to determine side of street alternative names, street range etc etc.

Out of curiosity how many rows does your stdaddr table have for example to satisfy just MA?

comment:2 Changed 9 years ago by woodbri

I have 756,778 segments in stdaddr and 757,766 in tiger for MA. The difference is 1000+ that did not standardize. I simple left join and distinct on the name will show you the records that need to be considered when updating the the lexicon and/or rules.

I store my tiger records as single-sided denormalized records. So if a segment has 4 names and 3 alt address ranges, and 2 placenames, I have 4*3*2 records for it. Each of my tiger records has a side field so I can I can offset if needed. My goal was to preprocess the data to make the geocoder design simple and straight forward.

comment:3 Changed 9 years ago by woodbri

Correction: The 1000+ delta was because the tiger data did not have enough data or standardize, for example 434 had no name info and others probably had a required field that was null.

To look at the record that failed to stanadardize I used the following query:

select count(*) as cnt,
       array_to_string(ARRAY[predirabrv, pretypabrv, prequalabr, name,
                             sufdirabrv, suftypabrv, sufqualabr], ' ') as micro
  from streets
 where gid in (
    select id
      from stdstreets
     where coalesce(building, house_num, predir, qual, pretype, name,
                    suftype, sufdir, ruralroute, extra,
                    city, state, country, postcode, box, unit) is null
      and ac2='MA'
 group by micro
 order by micro;

comment:4 Changed 9 years ago by woodbri

Here are the road names that failed to stanadardize. The leading number is the count of segments. Some of these are obviously bad form, others not so much. I common issue is single and double letters are special tokens typically need some additional rules. At some point we should sit down together so I can go over how to define and add rules.

4;"(Army Corps of Engineers Road)"
96;"1st Parish Rd"
2;"6066 I U S Hill Rd"
11;"Acc Rd 3/windsor Pd Rd"
12;"Ap Newcomb Rd"
1;"Arthur J Remillard Jr Way"
2;"Arthur Remillard Jr Way"
23;"Ashburnham State Rd"
16;"Ashby State Rd"
19;"Baldwinville State Rd"
4;"Bay View Bb Ave"
32;"Berkshires To the Capes Trl"
2;"Double Ox Rd"
2;"F and L Right of Way"
5;"Farm To Market Rd"
25;"First Parish Rd"
4;"Gina Jo Way"
12;"Granby Rd / State Rt 202"
9;"Granby Rd / W State / Rt 202"
8;"Harmony Homes Access Rd"
12;"Henry Graf Jr Rd"
4;"Highland Bb Ave"
5;"Industrial Park E Rd"
6;"Industrial Park W Rd"
4;"Jacksonville State Rd"
2;"John Glass Jr Sq"
4;"Jones Cross E Rd"
21;"Labor in Vain Rd"
20;"Mass State Hwy"
32;"Mbta Green Line Trolley (Suspended)"
1;"Mile Hill Rd"
2;"Mile Ln"
2;"Mile Rd"
2;"Nan Ke Rafe Path"
36;"New State Hwy"
12;"New State Rd"
17;"RR Oxford Ave"
10;"Rhodes To Bailey Rd"
1;"Ridgevale Right of Way"
16;"Right Rd Old"
17;"Rindge State Rd"
17;"S Washington State Rd"
2;"US Bates Rd"
2;"Val Go Way"
6;"W Ox Pasture Ln"
6;"Way To the River Rd"
2;"William C Gould Jr Way"
2;"William Sands Jr Rd"

comment:5 Changed 9 years ago by robe

Which tiger year is that? I have in featnames (I think I have 2012 loaded) ( 564,976 records with non-empty fullnames for MA) granted some might not be things usually addressable. so I would expect your table to be a little bigger. Maybe not. Anyway I like the idea of a sideline table rather than adding a column to a tiger table so I might start with that approach.

The main reason I haven't denormalized it is I'm not convinced it makes speed faster for PostgreSQL (for MySQL or SQLite Of course :) ). Although there is more joining - PostgreSQL is pretty good at that. What you get with multiple tables is possibility of using multiple disk seeks (since you can have the different sets on different disks) and possibility of loading more records in shared memory since you won't need to duplicate.

So before going that route, I'd have to benchmark how the loss in the aforementioned is improved with a single table set. Granted it will make query easier to understand to others, but that's the least of my problems :)

Aside from that, disk space is a big concern for me mostly because 1) I have to worry about cloud disks which are often limited in size and scaled based on processor (except for Amazon disks which you pay for by the super suckiness). 2) It takes longer to copy my tables from one install to another the larger the dataset.

comment:6 Changed 4 years ago by robe

Milestone: PostGIS FuturePostGIS Fund Me

Milestone renamed

Note: See TracTickets for help on using tickets.