|Reported by:||woodbri||Owned by:||robe|
|Priority:||medium||Milestone:||PostGIS Fund Me|
I migrated these comments from #1118 as they really should be in a separate ticket. The discussion around the issue in #1118 is really a symptom of the fact that we need to redesign the tiger geocoder to better use the new pagc_address_standardizer, and to make it possible in the future to be less Tiger centric (as a secondary goal.
Ticket #1118 is a problem of not standardizing the reference dataset and relying on the existing standardization. This is a process bug, not a code bug. If you take a random address and ask some people to standardize it into components, you will surely get some different results because the people will have a different set of rules in mind. So we take Tiger data which has been standardized by 3300 different counties where it was collect and given to Census and you will not even find consistency within Tiger. So relying on the pre-parsed standardization is the wrong way to approach this problem.
The way to fix this is to load the tiger data, then clump the name attributes into a single string and give it to the standardizer to parse and then save that. When we get a query request, we standardize that using our same standardizer and rules and we match those results against our standardized reference set.
Then we don't care if the standardization is right or wrong, because if it is wrong, it will be wrong in both cases and will still match.
This process also has the benefit that you can analyze those records that failed to standardize because of missing lexicon, gazeteer or rules and add those that we might need to improve the tools over time. This part can be done separate from the automated loading process. I should be done as part of the bug fixing and enhancements to the geocoder over time.
While the pagc address standardizer improves things and proves some easy tool to change the behavior if you don't make this process change. You will have an endless list of bugs like this that have nothing to do with the code. While you might be able to fix some of these with change to lex, gaz and rules you also might be breaking other cases that are not obvious when you make changes. DAMHIK.
I know the plan it to move forward without making this process change, but it should be planned for sometime in the future.
Yah I was thinking of it in future. I'll ticket that I'm leaning toward using hstore to store the normalized hash for the tiger set possibly only doing it for the obvious ambiguities.
The issue I have with doing it for after load and for all
1) inserting is a lot less painful than updating since updating requires both an insert and delete. So its faster to do on load.
2) Since this is in flux, they'll be a lot of updating going on initially so I don't want to push that on users until things are more stable, plus it complicates update script with update requiring user data changes — something I kind of want to stay away from until I have my upgrade bullet proof.
3) I actually don't think its necessary to standardize all tiger (I would say about 85% or more of it is fine). For the most part there aren't that many ambiguities and a lot of those would be long and painful to itemize and doing it by lex is probably not the right way.
Clearly for things like Camino etc that would be the right thing.
so I'm thinking more along a hybrid. It would also make my hstore index way shorter and faster to scan if its only the questionable problematic ones that need to be changed. Anyway I'll put in a separate future ticket. For PostGIS 2.1 I would like to change the norm_addy structure since that is part issue that I am mixing pre abbrev with post abbrevs.
I don't do any updates. I load the tiger data into a table, I then standardize that into a stdaddr table that is linked by the primary of the tiger table. If I make changes to the lex, gaz, or rules, I drop the stdaddr table and recreate it. All searches are done only on the stdaddr table and only when I have candidate records do I join those back to the tiger data to get the geometry and compute the location.
So for production, you install a "standard" set of tables for lex, gaz and rules. you load your data, create the stdaddr table and you are done. Users should not be modifying the lex, gaz or rules unless they are developing a different geocoder and then they are not you normal user and they have to understand the process for doing this including the fact that they need to recreate the stdaddr table if they make changes.
While this may require a lot of changes in the current geocoder to move to this structure, long term it is good because it moves you away from being Tiger centric. If our northern neighbors want to use it for Canadian data, then can make a loader for that data, standardize it into stdaddr table and your geocoder will work on that too.
This simplicity will also translate into cleaner and simpler code which will be easier to maintain and in all likelihood be faster also.