Opened 13 years ago

Closed 13 years ago

#1148 closed defect (duplicate)

Geocoder has poor results with correct inputs

Reported by: darkblueb Owned by: robe
Priority: medium Milestone: PostGIS 2.0.0
Component: tiger geocoder Version: master
Keywords: Cc:

Description

please see attachment pdf for discussion and example

Attachments (1)

onGeocoding.pdf (217.4 KB ) - added by darkblueb 13 years ago.

Download all attachments as: .zip

Change History (19)

in reply to:  description comment:1 by bitner, 13 years ago

Replying to darkblueb:

please see attachment pdf for discussion and example

What is the specific defect being reported here? A defect should be a very specific repeatable issue preferably with data that can be used as a test case.

comment:2 by darkblueb, 13 years ago

Hi Bitner - I've been putting all of the data for these test cases on

http://download.osgeo.org/postgis

comment:3 by robe, 13 years ago

Bryan,

Just to add to what Bitner is saying. It's nice to have all these examples, but would be even nicer if you isolated 5-10 representative population cases I can stick in my regress script. I wouldn't need as much time with that as trying to digest a 6 MB file and loading it. That is useful for stress testing, but not terribly useful during development phase for me.

So my ideal bug reports are the one you submitted about the normalize issues (with a set of examples I could easily test #1077) and the tickets that Mike Pease has been submitting like this one #1145 - a set of few concrete examples demonstrating the point I can chew on and dump into regress.

Less is more :)

comment:4 by darkblueb, 13 years ago

"ZIP Code Tabulation Areas (ZCTA) are quasi-statistical areas which attempt to approximate, but are by no means the same as, the USPS ZIP codes."

wikipedia - Topologically_Integrated_Geographic_Encoding_and_Referencing

this pdf was written before I read that... At any rate… a couple of well-chosen example addresses for the regression SQL, definitely, but also, what I was trying to do was to use data mining techniques to look at patterns of behavior over large, controlled sets of inputs.. Those samples on the download site are what I was using as inputs..

comment:5 by robe, 13 years ago

Give r7689 a try and see if it improves things. Also don't forget to post specific examples with issues. It's much easier to develop regress with a small representative sample than a lot.

by darkblueb, 13 years ago

Attachment: onGeocoding.pdf added

comment:6 by darkblueb, 13 years ago

Notes on Geocoder 7691: ===============================================

Input data is 400,000 correct residential addresses in Alameda County

(or subsets thereof)

Two runs

no limit on results ("L_N"), Result limit = 1 ("L_1")

Comparison Geocoder

Brand-Y commercial geocoder ("Brand_Y")

Observations:


  • Less than 0.25 % of addresses had imperfect readings via normalize_address() (levenshtein distance between input address and pretty print normalized address)
  • L_1 has roughly 3% of geocoded points not near the actual address (Brand_Y < 26 thousandths of one percent for the same set)
  • L_N has > 10% of points not near the actual address

comment:7 by robe, 13 years ago

Some example sets would be nice too. 4 or 5 samplings of address and expected vs. returned.

comment:8 by robe, 13 years ago

Mostly just want to gauge if its the imperfectness of tiger data (which not too much can be done about) or something that can be improved in the algorithm. For example tiger is missing some alias street names that Navteq has and the address ranges and zips are also imperfect.

there are some cases I know can be improved on such as handling prequalifiers and post qualifiers.

comment:9 by darkblueb, 13 years ago

one example, two dozen addresses on

"El Cajon Ave, Fremont, CA 94536"

were geocoded by L_1 as

"El Cajon Ave, Shasta Lake, CA 94536"

4533,4541,4595,4552,4528,4549,4568 etc…

comment:10 by darkblueb, 13 years ago

please see download.osgeo dot org / postgis / alameda_cajons.csv.bz2 for details

comment:11 by robe, 13 years ago

Okay I think this one is a typo in tiger data.

This query should give me all the streets that are around the right location of El Cajon Fremont, CA

SELECT f.*, e.lfromadd, e.ltoadd, e.rfromadd, e.zipl, ST_AsText(ST_Centroid(the_geom)) As b 
	from tiger_data.ca_edges As e INNER JOIN tiger_data.ca_featnames AS f ON e.tlid = f.tlid where ST_DWithin(ST_GeomFromText('POINT(-122.024396 37.558286)',4269),the_geom,0.005)
order by ST_Distance(ST_GeomFromText('POINT(-122.024396 37.558286)',4269),e.the_geom), f.fullname
limit 10;

If I run the above query, the first match I get is

El Canon Ave

Looking at Google, I think that should be El Cajon.

I haven't had a chance to look at your other entries — does that file just contain all failures or is it a mishmush of failures and successes?

comment:12 by robe, 13 years ago

And to demonstrate — if I insert this record:

INSERT INTO tiger_data.ca_featnames( tlid, fullname, "name", predirabrv, pretypabrv, prequalabr, 
       sufdirabrv, suftypabrv, sufqualabr, predir, pretyp, prequal, 
       sufdir, suftyp, sufqual, linearid, mtfcc, paflag, statefp)
SELECT  tlid, 'el Cajon Ave' AS fullname,'el Cajon' AS name, predirabrv, pretypabrv, prequalabr, 
       sufdirabrv, suftypabrv, sufqualabr, predir, pretyp, prequal, 
       sufdir, suftyp, sufqual, linearid, mtfcc, paflag, statefp
  FROM tiger_data.ca_featnames
  WHERE tlid = '125027794' AND name = 'el Canon';

And then run this query:

SELECT pprint_addy(addy), ST_AsText(geomout) As coord, rating 
	FROM geocode('4533 El Cajon Ave, Fremont, CA 94536',1) As g;

The answer comes back pretty fast (175 ms on my crappy drive with many other states loaded and with out that entry its very slow about 30 secs and comes back as you said with the wrong answers) with:

             pprint_addy              |          coord           | rating
--------------------------------------+--------------------------+--------
 4533 el Cajon Ave, Fremont, CA 94536 | POINT(-122.0241 37.5582) |      0

comment:13 by robe, 13 years ago

I think I answered my question about these being all example errors. Another one I looked at I know about. This is the prequal issue I was talking about that I haven't corrected yet:

For: 1798 PASEO DEL CAJON, PLEASANTON, CA 94566

I have this one itemized #1118 which is pretty endemic with California data because of the spanish influence.

comment:14 by robe, 13 years ago

Brian, I looked at the others in this csv file, and all the other failures seem to be the result of incorrect normalization as mentioned in #1118.

Interestingly you have this one in your list, which does geocode to the right location.

SELECT pprint_addy(addy), ST_AsText(ST_SnapToGrid(geomout,0.0001)) As coord, rating 
	FROM geocode('5551 CORTE DEL CAJON, PLEASANTON, CA 94566',1) As g;

gives:

5551 del Cajon Corte, Pleasanton, CA 94566	POINT(-121.8895 37.6681)	60

Which is right next to the google answer of: -121.889295, 37.668137 and your Geocoder Y answer of: 37.667858 -121.889823

(In fact google and tiger geocoder place it on the same parcel and geocoder Y is one parcel away) But the rating it gives for the result is poor. I presume you have this listed because the normalize/pretty print is wrong and/or the rating it gave is so low.

This again is issue described in #1118 but just proves that even with this problem the geocoder sometimes gets the location right but displays the name wrong because the street type should go in front of the address instead of after.

I think these two facts explain all the errors in the file 1) Tiger data error (should be reported to Census) 2) Pretypabbrev not being normalized correctly (is on my list to fix)

Do you have other examples with poor results that don't fit case 2

comment:15 by darkblueb, 13 years ago

2461 forino drive, dublin, ca 94568

comment:16 by robe, 13 years ago

This one I think is again result of poor tiger data around that area (or something wrong with my loader). If I run this query which according to google is around where that address should fall,

 SELECT e.fullname, e.tlid, f.*, e.lfromadd, e.ltoadd, e.rfromadd, e.zipl
	from tiger_data.ca_edges As e 
		INNER JOIN ST_GeomFromText('POINT(-121.851763 37.720674)',4269) As target_geom 
			ON ST_DWithin(target_geom,the_geom,0.005) 
		LEFT JOIN tiger_data.ca_featnames AS f ON (e.tlid = f.tlid )
		
order by ST_Distance(target_geom, e.the_geom), f.fullname
limit 10;

The best matches are all unnamed streets. I presume the first on the list is probably the right one. Are these fairly new streets? I think tiger data may be incomplete for newly developed streets and given that I its in sort of in a corner looks like it might be in an expanding region of California.

comment:17 by darkblueb, 13 years ago

Priority: highmedium

I have written a new tool - alameda_tiger_peek.py; after the alameda_geocode.py pass ( which records detailed information on all stages of the geocoder ) this next tool operates on its output.

Loop over the output of the geocoder and work on elements which : the input address starts with a number, the manhatten distance of the result POINT and the expected input POINT is greater than a threshold (0.01 now); from those, gather the UNIQUE street names from the input addresses (sum of the street name components 1-4 in norm_addy)

Loop over the output of the geocoder again.. if the input address is on a street identified above, "peek" at the TIGER edges JOIN featurenames that are within a small buffer of the POINT of the expected geocode result, spit out those "peeks" to a csv, and remove that street name from the list of street names.

The purpose of this is to help identify all cases where the geocoder performance is poor due to lack of information in TIGER, versus where the geocoder is actually failing in some way.. As noted previously, there could be a problem with the way the geocoder is loading data from TIGER that is not yet found..

Bug Priority changed to medium, because some issues have been addressed.

comment:18 by robe, 13 years ago

Resolution: duplicate
Status: newclosed

I think this is too vague of a bug ticket so closing it out since a lot of the bad results are due to TIGER data itself or other issues already more granularly detailed in other tickets.

Note: See TracTickets for help on using tickets.