Opened 12 years ago

Closed 12 years ago

#1382 closed defect (fixed)

Some addresses take a long time to geocode or normalize

Reported by: raviada Owned by: robe
Priority: medium Milestone: PostGIS 2.0.0
Component: tiger geocoder Version: master
Keywords: Cc:

Description

Regina,

Thanks for helping me on this, normalize_address return slow but when I turn in the debug on, the out seems normal, because the address was parsed correctly and all fields got populated with the right address components.

I am attaching some of the address here which were taking a little longer. This is the output from postgis_full_version().

"POSTGIS="2.0.0SVN" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009" LIBXML="2.7.8" USE_STATS"

Thanks Ravi Ada

4770 RTE. 209, ACCORD, NY 12404 509 METTACAHONTS ROAD, ACCORD, NY 12404 13859 PROGRESS BLVD STE 100, ALACHUA, FL 326159403 17330 113TH AVE, ADDISLEIGH PARK, NY 114334003 179TH ST, ADDISLEIGH PARK, NY 114341413 3510 HWY 186, ALIX, AR 728200061 1250 S 18TH ST, AMELIA ISLAND, FL 320341902 4780 NE 105TH PL, ANTHONY, FL 326173156 233 ROUTE 6, COLUMBIA, CT 062371125 11623 NE 11TH PL, BISCAYNE PARK, FL 331616725 2410 HIGHWAY 65 N, MC GEHEE, AR 716549437 5430 NW 33RD AVE, FT LAUDERDALE, FL 333096349

Attachments (1)

sample_addresses.txt (507 bytes ) - added by raviada 12 years ago.

Download all attachments as: .zip

Change History (12)

by raviada, 12 years ago

Attachment: sample_addresses.txt added

comment:1 by robe, 12 years ago

Status: newassigned
Version: 1.5.Xtrunk

comment:2 by robe, 12 years ago

Ravi, Thanks for the examples. I'm still analyzing these. The ones I've tested the normalize_address functions on returned fairly fast. All under 7ms. So I'm not seeing a normalize speed issue here. If you are you might be running an older version of the geocoder. In the normalize_address code you should see this:

normalize_address.sql 8252 2011-11-29 08:49:06Z robe 

That is the revision number of the latest normalizer. If you don't have it at all or an older number yours is out of date.

Here is what I think is wrong with some of them:

1) 179TH ST, ADDISLEIGH PARK, NY 114341413, I think I can improve on the normalizing logic and I might have a ticket for it already.

is just normalizing incorrectly so its putting 179 in the street number and ST as the street name. This is because this address has no street number. So that is why that one is so slow and comes back with wrong answer. It's still a bit slow on my test box took about 15 secs for 2 reasons

When I do this:

select pprint_addy(addy), rating, ST_AsText(geomout)
 from geocode('0 179TH ST, ADDISLEIGH PARK, NY 114341413',1);
-- which returns this --
0 179th St, New York, NY 11434	11	POINT(-73.7666465 40.663195)

a) the street name is short b) I didn't specify a valid street range b) and ADDISLEIGH PARK doesn't match anything in tiger.

2) This one 509 METTACAHONTS ROAD, ACCORD, NY 12404 — runs fairly fast on my box take 90ms to geocode returning

SELECT pprint_addy(addy), rating, ST_AsText(geomout)
 from geocode('509 METTACAHONTS ROAD, ACCORD, NY 12404', 1);

23 Mettacahonts Rd, Accord, NY 12404	10	POINT(-74.2487999333536 41.7948737121302)

-- this took 78 ms but probably faster because of caching effects
select pprint_addy(addy), rating, ST_AsText(geomout)
 from geocode('509 METTACAHONTS ROAD, ACCORD, NY 12404');

108 Mettacahonts Rd, Accord, NY 12404	9	POINT(-74.2469427078796 41.795613863851)

So the speeds are pretty decent though the address doesn't match. I suspect this is more of a tiger data issue than logic issue. The fact it gives different addresses between limit 1 and none is that to improve speed I have inner limit limitting as well but if there is no perfect match or close to perfect match you run the issue of the gvie me one answer returning slightly worse than the full. I'm not sure there is much I can do about that without compromising speed and the benefit is low.

3) This one select pprint_addy(addy), rating, ST_AsText(geomout)

from geocode('17330 113TH AVE, ADDISLEIGH PARK, NY 114334003',1);

Did take 20,483 ms and came back with only the street.

The reason is because the address for this is really: 173-30

and our geocode doesn't support that kind of street number yet. It would require the same structural changes as #886. I'll see what I can do about it though as a lot of NY addresses will have this issue. But it wouldn't help you much since you don't have the - in your address.

— Note to Steve Woodbridge: Would your C normalizer help in this case? If we were to embed it in?

comment:3 by raviada, 12 years ago

Thanks Regina for looking into this. Appreciate your help. I checked the normalize_address.sql and found out this. Seems like I have the older version. What is your recommendation on upgrading all of the functions and scripts, just in case other scripts may have been older versions too.

—$Id: normalize_address.sql 7801 2011-09-01 14:58:04Z robe $-

Thanks Ravi Ada

comment:4 by darkblueb, 12 years ago

Regina, was each of your examples passing a 1 (one) for max results ?

comment:5 by robe, 12 years ago

Ravi, 1) Download the latest postgis 2.0 tar ball or if you have svn svn update your postgis. 2) I fyou are on linux/unix edit and run the tiger_2010/upgrade_geocoder.sh, for windows there is a similar upgrade_geocoder.bat file.

Then you should be up to date.

Brian, Yes I was using mostly limit 1 but not all the time. Limit 1 tends to perfrom better than all but it varies a lot how much better it is but the speed is usually equal to or better.

in reply to:  2 comment:6 by raviada, 12 years ago

Regina, I upgraded to the new build which is Revision 8450. the output from postgis_full_version is "POSTGIS="2.0.0SVN" GEOS="3.2.2-CAPI-1.6.2" PROJ="Rel. 4.7.1, 23 September 2009" LIBXML="2.7.8" USE_STATS"

Also I see —$Id: normalize_address.sql 8252 2011-11-29 08:49:06Z robe $- in normalize_address function.

select geocode('2601 24TH AVE, ASTORIA, NY 111022337'); returns two rows Total query runtime: 20943 ms. 2 rows retrieved.

"("(24th,Ave,""New York"",NY,11204,t)",0101000020AD10000036CF279D807E52C0511C69D0424E4440,21)" "("(24th,Ave,""New York"",NY,11214,t)",0101000020AD100000FA12A7B3417F52C0699E6C55CF4C4440,21)"

I think it is still a problem in using the geocode function. normalize function is now returning real fast, in 30-60 ms. Geococe is the one taking 20 secs. All NY addresses are like this.

Thanks Ravi Ada

comment:7 by raviada, 12 years ago

select geocode_address(normalize_address('2734 21ST ST, ASTORIA, NY 111023653'),1)

"("(2734,E,21st,St,""New York"",NY,11235,t)",0101000020AD100000097F608BA97C52C06D6463B6F24A4440,13)" "("(273421st,St,""Niagara Falls"",NY,14305,t)",0101000020AD100000625E1B7D62C253C06AD712AC798E4540,17)" "("(21st,St,Troy,NY,12180,t)",0101000020AD100000AA30B0E8D16A52C09CD542B2EC5E4540,18)"

NOTICE: Bad zip newzip range: '{}' NOTICE: Ignore new zip range that is bad too: '{}' NOTICE: Zip range based on only considering city: '{}' NOTICE: stmt: WITH a AS

( SELECT *

FROM (SELECT f.*, ad.side, ad.zip, ad.fromhn, ad.tohn,

RANK() OVER(ORDER BY diff_zip(ad.zip,'111023653') + CASE WHEN lower(f.name) = lower('21ST') THEN 0 ELSE levenshtein_ignore_case(f.name, lower('21ST') ) END +

levenshtein_ignore_case(f.fullname, lower('21ST'
' ' COALESCE('St',)) )

+ CASE WHEN (greatest_hn(ad.fromhn,ad.tohn) % 2)::integer = ('2734' % 2)::integer THEN 0 ELSE 1 END + CASE WHEN '2734'::integer BETWEEN least_hn(ad.fromhn,ad.tohn) AND greatest_hn(ad.fromhn, ad.tohn)

THEN 0 ELSE 4 END + CASE WHEN lower('St') = lower(f.suftypabrv) OR lower('St') = lower(f.pretypabrv) THEN 0 ELSE 1 END + rate_attributes(NULL, f.predirabrv, '21ST', f.name , 'St', suftypabrv , NULL, sufdirabrv, prequalabr) )

As rank

FROM featnames As f INNER JOIN addr As ad ON (f.tlid = ad.tlid)

WHERE '36' = f.statefp AND '36' = ad.statefp

AND lower(f.name) = lower('21ST') AND ( ad.zip = ANY('{}'::varchar[]) ) ) AS foo ORDER BY rank LIMIT 3 )

SELECT * FROM (

SELECT DISTINCT ON (sub.predirabrv,sub.fename,COALESCE(sub.suftypabrv, sub.pretypabrv) ,sub.sufdirabrv,sub.place,s.stusps,sub.zip) sub.predirabrv as fedirp, sub.fename, COALESCE(sub.suftypabrv, sub.pretypabrv) as fetype, sub.sufdirabrv as fedirs, sub.place , s.stusps as state, sub.zip as zip, interpolate_from_address('2734', sub.fromhn, sub.tohn, sub.the_geom, sub.side) as address_geom, sub.sub_rating + least(coalesce(diff_zip('111023653' , sub.zip),0), 10)::integer + coalesce(levenshtein_ignore_case('ASTORIA', sub.place),5) as sub_rating, sub.exact_address as exact_address, sub.tohn, sub.fromhn FROM ( SELECT tlid, predirabrv, COALESCE(b.prequalabr
' ', ) b.name As fename, suftypabrv, sufdirabrv, fromhn, tohn,

side, zip, rate_attributes(NULL, predirabrv, '21ST', b.name , 'St', suftypabrv , NULL, sufdirabrv, prequalabr) + CASE WHEN '2734'::integer IS NULL OR b.fromhn IS NULL THEN 20 WHEN '2734'::integer ≥ least_hn(b.fromhn, b.tohn) AND '2734'::integer ⇐ greatest_hn(b.fromhn,b.tohn) AND ('2734'::integer % 2) = (to_number(b.fromhn,'99999999') % 2)::integer THEN 0 WHEN '2734'::integer ≥ least_hn(b.fromhn,b.tohn) AND '2734'::integer ⇐ greatest_hn(b.fromhn,b.tohn) THEN 2 ELSE ((1.0 - (least_hn('2734'::text,least_hn(b.fromhn,b.tohn)::text)::numeric / (greatest(1,greatest_hn('2734'::text,greatest_hn(b.fromhn,b.tohn)::text))) )) * 5)::integer + 5 END as sub_rating,'2734'::integer ≥ least_hn(b.fromhn,b.tohn) AND '2734'::integer ⇐ greatest_hn(b.fromhn,b.tohn) AND ('2734' % 2)::numeric::integer = (to_number(b.fromhn,'99999999') % 2) as exact_address, b.name, b.prequalabr, b.pretypabrv, b.tfidr, b.tfidl, b.the_geom, b.place FROM

(SELECT a.tlid, a.fullname, a.name, a.predirabrv, a.suftypabrv, a.sufdirabrv, a.prequalabr, a.pretypabrv,

b.the_geom, tfidr, tfidl, a.side , a.fromhn, a.tohn, a.zip, p.name as place

FROM a INNER JOIN edges As b ON (a.statefp = b.statefp AND a.tlid = b.tlid )

INNER JOIN faces AS f ON ('36' = f.statefp AND ( (b.tfidl = f.tfid AND a.side = 'L') OR (b.tfidr = f.tfid AND a.side = 'R' ) )) INNER JOIN place p ON ('36' = p.statefp AND f.placefp = p.placefp )

WHERE a.statefp = '36' AND b.statefp = '36'

) As b

ORDER BY 10 , 11 DESC LIMIT 20

) AS sub

JOIN state s ON ('36' = s.statefp)

ORDER BY 1,2,3,4,5,6,7,9

LIMIT 20) As foo ORDER BY sub_rating, exact_address DESC LIMIT 1

NOTICE: SELECT DISTINCT ON (sub.predirabrv,sub.fename,COALESCE(sub.suftypabrv, sub.pretypabrv) ,sub.sufdirabrv,coalesce(p.name,zip.city,cs.name,co.name),s.stusps,sub.zip) sub.predirabrv as fedirp, sub.fename, COALESCE(sub.suftypabrv, sub.pretypabrv) as fetype, sub.sufdirabrv as fedirs, coalesce(p.name,zip.city,cs.name,co.name)::varchar as place, s.stusps as state, sub.zip as zip, interpolate_from_address($1, sub.fromhn, sub.tohn, e.the_geom, sub.side) as address_geom, sub.sub_rating + least((coalesce(diff_zip($7 , sub.zip),0) *1.00/2)::integer, coalesce(levenshtein_ignore_case($7, sub.zip),0) ) + coalesce(least(levenshtein_ignore_case($3, coalesce(p.name,zip.city,cs.name,co.name)), levenshtein_ignore_case($3, coalesce(cs.name,co.name))),5) as sub_rating, sub.exact_address as exact_address FROM ( SELECT a.tlid, predirabrv, COALESCE(a.prequalabr
' ', ) a.name As fename, suftypabrv, sufdirabrv, fromhn, tohn,
side, a.statefp, zip, rate_attributes($5, a.predirabrv, $2, a.name , $4, a.suftypabrv , $6, a.sufdirabrv, a.prequalabr) + CASE WHEN $1::integer IS NULL OR b.fromhn IS NULL THEN 20 WHEN $1::integer ≥ least_hn(b.fromhn, b.tohn) AND $1::integer ⇐ greatest_hn(b.fromhn,b.tohn) AND ($1::integer % 2) = (to_number(b.fromhn,'99999999') % 2)::integer THEN 0 WHEN $1::integer ≥ least_hn(b.fromhn,b.tohn) AND $1::integer ⇐ greatest_hn(b.fromhn,b.tohn) THEN 2 ELSE ((1.0 - (least_hn($1::text,least_hn(b.fromhn,b.tohn)::text)::numeric / greatest(1,greatest_hn($1::text,greatest_hn(b.fromhn,b.tohn)::text)))) * 5)::integer + 5 END as sub_rating,$1::integer ≥ least_hn(b.fromhn,b.tohn) AND $1::integer ⇐ greatest_hn(b.fromhn,b.tohn) AND ($1 % 2)::numeric::integer = (to_number(b.fromhn,'99999999') % 2) as exact_address, a.name, a.prequalabr, a.pretypabrv FROM featnames a join addr b ON (a.tlid = b.tlid AND a.statefp = b.statefp ) WHERE a.statefp = '36' AND a.mtfcc LIKE 'S%' AND ( soundex($2) = soundex(a.name) OR ( (length($2) > 15 or (length($2) > 7 AND a.prequalabr > ) ) AND lower(a.fullname) LIKE lower(substring($2,1,15))
'%' ) OR numeric_streets_equal($2, a.name) ) ORDER BY 11 LIMIT 20 ) AS sub JOIN edges e ON ('36' = e.statefp AND sub.tlid = e.tlid AND e.mtfcc LIKE 'S%' ) JOIN state s ON ('36' = s.statefp) JOIN faces f ON ('36' = f.statefp AND (e.tfidl = f.tfid OR e.tfidr = f.tfid)) LEFT JOIN zip_lookup_base zip ON (sub.zip = zip.zip AND zip.statefp='36') LEFT JOIN place p ON ('36' = p.statefp AND f.placefp = p.placefp) LEFT JOIN county co ON ('36' = co.statefp AND f.countyfp = co.countyfp) LEFT JOIN cousub cs ON ('36' = cs.statefp AND cs.cosbidfp = sub.statefp co.countyfp f.cousubfp) WHERE ( (sub.side = 'L' and e.tfidl = f.tfid) OR (sub.side = 'R' and e.tfidr = f.tfid) ) ORDER BY 1,2,3,4,5,6,7,9 LIMIT 10
NOTICE: PREPARE query_base_geo(integer, varchar,varchar,varchar,varchar,varchar,varchar,geometry) As SELECT DISTINCT ON (sub.predirabrv,sub.fename,COALESCE(sub.suftypabrv, sub.pretypabrv) ,sub.sufdirabrv,coalesce(p.name,zip.city,cs.name,co.name),s.stusps,sub.zip) sub.predirabrv as fedirp, sub.fename, COALESCE(sub.suftypabrv, sub.pretypabrv) as fetype, sub.sufdirabrv as fedirs, coalesce(p.name,zip.city,cs.name,co.name)::varchar as place, s.stusps as state, sub.zip as zip, interpolate_from_address($1, sub.fromhn, sub.tohn, e.the_geom, sub.side) as address_geom, sub.sub_rating + least((coalesce(diff_zip($7 , sub.zip),0) *1.00/2)::integer, coalesce(levenshtein_ignore_case($7, sub.zip),0) ) + coalesce(least(levenshtein_ignore_case($3, coalesce(p.name,zip.city,cs.name,co.name)), levenshtein_ignore_case($3, coalesce(cs.name,co.name))),5) as sub_rating, sub.exact_address as exact_address FROM ( SELECT a.tlid, predirabrv, COALESCE(a.prequalabr
' ', ) a.name As fename, suftypabrv, sufdirabrv, fromhn, tohn,
side, a.statefp, zip, rate_attributes($5, a.predirabrv, $2, a.name , $4, a.suftypabrv , $6, a.sufdirabrv, a.prequalabr) + CASE WHEN $1::integer IS NULL OR b.fromhn IS NULL THEN 20 WHEN $1::integer ≥ least_hn(b.fromhn, b.tohn) AND $1::integer ⇐ greatest_hn(b.fromhn,b.tohn) AND ($1::integer % 2) = (to_number(b.fromhn,'99999999') % 2)::integer THEN 0 WHEN $1::integer ≥ least_hn(b.fromhn,b.tohn) AND $1::integer ⇐ greatest_hn(b.fromhn,b.tohn) THEN 2 ELSE ((1.0 - (least_hn($1::text,least_hn(b.fromhn,b.tohn)::text)::numeric / greatest(1,greatest_hn($1::text,greatest_hn(b.fromhn,b.tohn)::text)))) * 5)::integer + 5 END as sub_rating,$1::integer ≥ least_hn(b.fromhn,b.tohn) AND $1::integer ⇐ greatest_hn(b.fromhn,b.tohn) AND ($1 % 2)::numeric::integer = (to_number(b.fromhn,'99999999') % 2) as exact_address, a.name, a.prequalabr, a.pretypabrv FROM featnames a join addr b ON (a.tlid = b.tlid AND a.statefp = b.statefp ) WHERE a.statefp = '36' AND a.mtfcc LIKE 'S%' AND ( soundex($2) = soundex(a.name) OR ( (length($2) > 15 or (length($2) > 7 AND a.prequalabr > ) ) AND lower(a.fullname) LIKE lower(substring($2,1,15))
'%' ) OR numeric_streets_equal($2, a.name) ) ORDER BY 11 LIMIT 20 ) AS sub JOIN edges e ON ('36' = e.statefp AND sub.tlid = e.tlid AND e.mtfcc LIKE 'S%' ) JOIN state s ON ('36' = s.statefp) JOIN faces f ON ('36' = f.statefp AND (e.tfidl = f.tfid OR e.tfidr = f.tfid)) LEFT JOIN zip_lookup_base zip ON (sub.zip = zip.zip AND zip.statefp='36') LEFT JOIN place p ON ('36' = p.statefp AND f.placefp = p.placefp) LEFT JOIN county co ON ('36' = co.statefp AND f.countyfp = co.countyfp) LEFT JOIN cousub cs ON ('36' = cs.statefp AND cs.cosbidfp = sub.statefp co.countyfp f.cousubfp) WHERE ( (sub.side = 'L' and e.tfidl = f.tfid) OR (sub.side = 'R' and e.tfidr = f.tfid) ) ORDER BY 1,2,3,4,5,6,7,9 LIMIT 10

NOTICE: EXECUTE query_base_geo(2734,'21ST','ASTORIA','St',NULL,NULL,'111023653',NULL); NOTICE: DEALLOCATE query_base_geo; Total query runtime: 14172 ms. 3 rows retrieved.

comment:8 by darkblueb, 12 years ago

Ravi - what sort of Shared Buffer settings do you have in postgresql.conf? working memory ? The default Postgres settings are very minimal

comment:9 by robe, 12 years ago

Ravi,

Okay that seems to agree with around what I am getting for that address. I get 19 seconds for no limit and 18 seconds for limit. But the address it returns is completely wrong. I'll put in a separate ticket for this. I think it might be the Astoria throwing it off.

since if I run this:

select pprint_addy(addy), ST_AsText(geomout), rating FROM geocode('2601 24TH AVE,  NY 111022337',2);

-- it returns this:
0 24th Ave, New York, NY 11102	POINT(-73.9184479285714 40.7744440102041)	15

Which is much more accurate.

If I reverse geocode:

-- answer tiger geocoder is returning (with Astoria)
SELECT pprint_addy(addy[1]),  array_to_string(street, '|') from reverse_geocode('POINT(-73.9750505 40.613036)', true) As g;

-- 
6148 24th Ave, NY 11204	1404 - 1416 Dahill Rd|2301 - 2399 62nd St

— to rule out a tiger data issue, I reverse geocoded what google returns. — though the place seems suspiciously missing, thought that could be a reverse geocode issue.

SELECT pprint_addy(addy[1]),  array_to_string(street, '|') As cross_streets from reverse_geocode('POINT(-73.918402 40.774441)', true) As g;
---
2607 24th Ave, NY 11102	2301 - 2399 26th St|2300 - 2398 27th St

comment:10 by robe, 12 years ago

forgot to mention without astoria it runs in about 172 ms for me.

comment:11 by robe, 12 years ago

Resolution: fixed
Status: assignedclosed

I'm closing this ticket out and continuing on #1384.

Note: See TracTickets for help on using tickets.