Opened 13 years ago

Closed 13 years ago

#1109 closed defect (fixed)

Lake Drive is good; Lake DR - not so good

Reported by: mikepease Owned by: robe
Priority: high Milestone: PostGIS 2.0.0
Component: tiger geocoder Version: 1.5.X
Keywords: Cc:

Description

—this query works great
select * from geocode('4373 LAKE DRIVE, ROBBINSDALE, MN 55422')

—but if DRIVE is abbreviated to DR, then the query runs a very long time (2 minutes) and doesn't return the right results

select * from geocode('4373 LAKE DR, ROBBINSDALE, MN 55422')

Any idea why?

Change History (12)

comment:1 by mikepease, 13 years ago

Component: postgistiger geocoder
Owner: changed from pramsey to robe

Oops. Missed Tiger component. Sorry Paul!

comment:2 by robe, 13 years ago

I haven't looked but I suspect DR is missing from street_type_lookup table and just needs to be added.

comment:3 by mikepease, 13 years ago

That's what I thought too, but I found DR is already in the street_type_lookup table.

"DR";"Dr";FALSE

"DRIV";"Dr";FALSE

"DRIVE";"Dr";FALSE

comment:4 by robe, 13 years ago

ah it's this entry:

INSERT INTO street_type_lookup (name, abbrev) VALUES ('LAKE', 'Lk');

I'll have to think about what to do about that. Are their really lakes for street types or you think that can be deleted

comment:5 by chodgson, 13 years ago

This seems like a bigger problem; Many street types could also be potential street names, for example consider:

west crescent crescent rd

I think the real problem is in the expectation of there being only a single way to normalize an input string - there are guaranteed to be ambiguous cases. Note that having multiple normalizations doesn't necessarily mean there are multiple results - there may not be a 100 1st St in 'Paul', but there may be a 100 1 St in St. Paul. Note it could also be the other way around, though I don't know of city named 'Paul'… there are a lot of Saints, sometimes it might be true. Sometimes the existence of a given street address on a given street name in a given city or zip code will reduce the number of possible results - however sometimes the multiple parsings/normalizations will produce many additional possible results - and really the geocoding part of the code should know about the different normalizations, and the normalizations might need a score/rating as well, which would carry through into the geocoding.

From my experience with geocoding, assuming there is a single "normalized" interpretation for any given input string is just not going to work for many cases. And I'm not even talking about really random stuff, I'm talking about inputs that would actually make sense to a human (and a local resident would be able to identify easily).

comment:6 by robe, 13 years ago

Yah I know that is why I haven't closed it out. I know how I broke it too. That logic was working before I touched things.

comment:7 by mikepease, 13 years ago

I think the previous comment about needing to match against multiple possible normalizations is a very good observation.

If it helps, here's another example where the normalizer doesn't find the right match. Parsing addresses is a pain in the hind-quarters. Thanks, Regina!

209 TURNERS CROSSROAD S, GOLDEN VALLEY, MN 55416

The actual street name is "TURNERS CROSSROAD", but the normalizer is parsing CROSSROAD as street type XRD.

select abs(ST_X(geomout)::numeric(8,5))
'W, ' ST_Y(geomout)::numeric(8,5) 'N' as lat_lon, *

from geocode('209 TURNERS CROSSROAD S, GOLDEN VALLEY, MN 55416')

select abs(ST_X(geomout)::numeric(8,5))
'W, ' ST_Y(geomout)::numeric(8,5) 'N' as lat_lon, *

from geocode('209 TURNERSCROSSROAD S, GOLDEN VALLEY, MN 55416')

—this finds the right address, but it far lower ranked than the top (incorrect) result select * from normalize_address('209 TURNERS CROSSROAD S, GOLDEN VALLEY, MN 55416')

—this works! select * from normalize_address('209 TURNERSCROSSROAD S, GOLDEN VALLEY, MN 55416')

comment:8 by robe, 13 years ago

Well I guess we could create a normalizer function that is capable of returning multiple results if there are ambiguities or at least 2 anyway.

I personally would as a human think CROSSROAD is a street type in that address since there is no street type given :)

comment:9 by robe, 13 years ago

Okay I think I fixed the Lake like cases like Lake Dr, Park Avenue etc at r7613 without breaking any of the existing regression tests I put in place. Hopefully I didn't introduce any more.

The CROSSROAD case is a bit trickier since like I said that in itself is ambiguous to a human.

I do have the region filter piece done, which I am testing a bit more before I submit to trunk. So that should help a bit with these false positives.

comment:10 by mikepease, 13 years ago

Thanks, Regina. I'll get your latest version and re-run my list. Seems like we're increasing the percentage of accurate geocodes. I'll do some comparisons to see.

I'm interested to try this region filter. Along this concept, have you considered an optional parameter for favoring parts of the address? For example, maybe you know that the zip is accurate or maybe the city. So, could the geocoder do a similar filter as the region filter based on the desired components of the address?

For example (not in correct SQL syntax):

favorOptions.zip = true;

favorOptions.city = true;

geocode('212 N 3rd Ave, Minneapolis, MN 55401', favorOptions);

comment:11 by robe, 13 years ago

yap that was what I was thinking with a rating object weight you pass in.

The rates are essentially just weighting the elements of the normalized address so the rate object I had in mind is similar to your favorOptions and would take on more or less the same shape as addy except have number weights for the fields instead of the value.

comment:12 by robe, 13 years ago

Resolution: fixed
Status: newclosed

closing this out. The specific issue mentioned in this ticket was resolved a while ago, but that just raised a whole bunch of other issues of a similar class that could only be easily resolved with a more configurable rating system. Will put in a separate ticket.

Note: See TracTickets for help on using tickets.