Opened 3 years ago

Closed 3 years ago

#4826 closed defect (fixed)

Geocoder gives goofy resutls for 1 Main St, Hanover, MA

Reported by: robe Owned by: robe
Priority: medium Milestone: PostGIS 3.1.2
Component: tiger geocoder Version: 3.1.x
Keywords: Cc:

Description (last modified by robe)

tiger geocoder as mentioned on IRC

http://irclogs.geoapt.com/postgis/%23postgis.2021-01-05.log

sorry for the repeat from last night, but if anyone's here and knows where I can look for an answer - my TIGER PostGIS setup is returning weird results? when I run geocode('1 Main St, Hanover, MA 02239') it returns a good result, but when I skip the zip code, it returns a result in Andover, which is obviously wrong even though the correct result does exist

One observation is that Hanover is not in the place table, but is in the cousub table.

-- yields right answer
SELECT *
FROM geocode('1 Main St, Hanover, MA 02339',1);

/** (1,,Main,St,,,Hanover,MA,02339,t,,)	0101000020AD1000007FB2EFD615B651C06A1D3D91020F4540	0
**/
-- yields wrong answer
SELECT *
FROM geocode('1 Main St, Hanover, MA',1);

/**
(1,E,Main,St,,,Ware,MA,01082,t,,)	0101000020AD1000002FB5DC320F0F52C0E0E8EF375B214540	10
**/

Change History (12)

comment:1 by robe, 3 years ago

Description: modified (diff)

comment:2 by robe, 3 years ago

Adding the following fixes the issue. Main St:

INSERT INTO tiger_data.ma_zip_lookup_base(zip, state, county, city, statefp)
SELECT '02339', 'MA', c.name As  county, cs.name AS city, cs.statefp
FROM tiger.county AS c INNER JOIN tiger_data.ma_cousub AS cs ON (c.countyfp = cs.countyfp AND c.statefp = cs.statefp)
WHERE cs.name = 'Hanover';

now yields right answer:

SELECT *
FROM geocode('1 Main St, Hanover, MA',1);

(1,,Main,St,,,Hanover,MA,02339,t,,)	0101000020AD1000007FB2EFD615B651C06A1D3D91020F4540	3

comment:3 by robe, 3 years ago

Okay this is a more generic query that should handle all these. I put in a custom table to distinguish from main. If this works well I'll incorporate into the main prep logic add add as part of the main state lookups.

It basically adds in all the county sub areas zips that aren't already present. So should handle all cases where the place is a county sub.

CREATE TABLE tiger_data.custom_zip_lookup_base_custom( CONSTRAINT pk_custom_zip_state_loc_city PRIMARY KEY (zip, state, county, city, statefp)) INHERITS (tiger.zip_lookup_base);

INSERT INTO tiger_data.custom_zip_lookup_base_custom(zip, state, county, city, statefp)
WITH csub AS (SELECT DISTINCT COALESCE(me.zipl, me.zipr) AS zip, s.stusps AS state, c.name As  county, cs.name AS city, cs.statefp
FROM tiger_data.ma_faces AS f INNER JOIN tiger.county AS c ON (c.countyfp = f.countyfp AND c.statefp = f.statefp)
			  INNER JOIN tiger_data.ma_cousub AS cs ON (f.cousubfp = cs.cousubfp AND cs.countyfp = c.countyfp AND cs.statefp = c.statefp)
	INNER JOIN tiger.state As s ON s.statefp = f.statefp
			INNER JOIN tiger_data.ma_edges AS me ON me.tfidl = f.tfid
		WHERE  (me.zipl IS NOT NULL or me.zipr IS NOT NULL) )
SELECT csub.zip, csub.state, csub.county, csub.city, csub.statefp
FROM csub LEFT JOIN tiger.zip_lookup_base AS zb ON (csub.city =zb.city AND csub.statefp = zb.statefp  AND zb.zip = csub.zip)
WHERE zb.zip IS NULL;
Version 0, edited 3 years ago by robe (next)

comment:4 by turova, 3 years ago

Thanks for the quick turnaround! I think the city now works correctly, but the result as a whole becomes a mix of the old (incorrect) result and the new one. e.g.

# SELECT ST_X(ST_SnapToGrid(g.geomout, 0.00001)) As longitude, ST_Y(ST_SnapToGrid(g.geomout, 0.00001)) As latitude,addy FROM geocode('1 Main St, Hanover, MA',1) as g;
     longitude      | latitude |                addy                 
--------------------+----------+-------------------------------------
 -70.84437000000001 | 42.15834 | (1,,Main,St,,,Hanover,MA,02061,t,,)

The city looks right, but the coordinates and zip code point to what I'm guessing would be the result if the custom zip lookup table wasn't present - Jacobs Lane, Norwell, MA.

(ST_SnapToGrid also doesn't seem to abbreviate the result properly the way I'm using it, but that's a question I'll address separately)

comment:5 by turova, 3 years ago

Reverse engineering your solution a bit, I think I see where this might go wrong, though I don't yet have a clear understanding of how to solve it. I think the issue is that cities can have multiple zip codes and having the wrong one for the street can give a result that's in a different city. e.g. for the example we've been looking at, I see this:

# SELECT * FROM zip_lookup_base WHERE city = 'Hanover';
  zip  | state |   county   |  city   | statefp 
-------+-------+------------+---------+---------
... Unrelated rows skipped ...
 02061 | MA    | Plymouth   | Hanover | 25
 02239 | MA    | Plymouth   | Hanover | 25
 02339 | MA    | Plymouth   | Hanover | 25

So Hanover MA is partially in 3 different zip codes (and shares 02061 with Norwell, which is the city that has a 1 Main St at that zip code). Hanover's 1 Main St is in 02339 and the correct result is returned if 02339 (or even 02239, which I can't find on a map) is explicitly passed. If 02061 is passed explicitly or implicitly via the new lookup table, the result comes back as Norwell.

I believe this means that you can't have a zip→city lookup with the current geocode function because it will only be correct for some of the addresses. Slow solutions would be to look the address up with each zip code and then return the one with the matching city, or to have a lookup table of city,street→zip. Adjusting geocode() to use the zip code to find the proper city, but then somehow correctly adjust the zip code during the search sounds like a possible solution if the data allows it.

If having a big lookup table ends up taking up 2x the space, that's not a big deal to me as long as it returns the correct results. Likewise, if there's a short-term way to do 2 sequential lookups that would get me the correct result, I could make that work as well, but hopefully you have a better idea of how to wrangle this data to get the desired result.

comment:6 by turova, 3 years ago

I actually do see a correct result show up in geocode, when I print all the results:

# SELECT rating, ST_X(g.geomout), ST_Y(g.geomout), addy from geocode('1 Main st, Hanover, MA', 100) as g;
 rating |        st_x        |        st_y        |                    addy                     
--------+--------------------+--------------------+---------------------------------------------
      3 | -70.84580124395251 |  42.11770418442778 | (1,,Main,St,,,Norwell,MA,02339,t,,)
      3 | -70.84580124395251 |  42.11770418442778 | (1,,Main,St,,,Hanover,MA,02339,t,,)
      3 | -70.84437177401414 |  42.15834177639814 | (1,,Main,St,,,Hanover,MA,02061,t,,)
      9 | -70.84437177401414 |  42.15834177639814 | (1,,Main,St,,,Norwell,MA,02061,t,,)
     10 | -72.59865296286365 | 42.215393856763214 | (1,N,Main,St,,,Chicopee,MA,01075,t,,)
     10 |  -72.9734286784644 |  42.49264706267212 | (1,W,Main,St,,,Windsor,MA,01026,t,,)
     10 | -72.87441852410113 |  42.50110417638284 | (1,E,Main,St,,,Windsor,MA,01070,t,,)
     12 |  -72.9734286784644 |  42.49264706267212 | (1,W,Main,St,,,Plainfield,MA,01026,t,,)
     12 | -72.87441852410113 |  42.50110417638284 | (1,E,Main,St,,,Plainfield,MA,01070,t,,)
     13 | -72.87736680839346 |  42.23484564776451 | (1,E,Main,St,,,Huntington,MA,01050,t,,)
     14 |  -72.9734286784644 |  42.49264706267212 | (1,W,Main,St,,,Worthington,MA,01026,t,,)
     14 |  -72.9734286784644 |  42.49264706267212 | (1,W,Main,St,,,Cummington,MA,01026,t,,)
     14 |  -72.9734286784644 |  42.49264706267212 | (1,W,Main,St,,,Chesterfield,MA,01026,t,,)
     15 | -72.59865296286365 | 42.215393856763214 | (1,N,Main,St,,,"South Hadley",MA,01075,t,,)
(14 rows)

The the result I want is the 2nd one in this case, though I saw it as the 3rd result when set the second argument of geocode to 3. I think the new lookup table will now give an extra result for any city sharing the related zip code, adding the correct location to the results list, but also making it programmatically indistinguishable from the incorrect result.

comment:7 by robe, 3 years ago

Can you try this:

DELETE FROM tiger_data.custom_zip_lookup_base_custom
WHERE zip IN(SELECT zip FROM tiger.zip_lookup_base WHERE tableoid::regclass::text != 'tiger_data.custom_zip_lookup_base_custom');

and then retest. Hopefully that should get rid of all the ones that are unlikely.

comment:8 by turova, 3 years ago

@robe, that did get rid of ~580 rows, but the top 4 results are still the 4 variations of correct/incorrect city and zip.

comment:9 by robe, 3 years ago

hmm which one are you testing with? After I ran the delete and tried this:

 SELECT ST_X(ST_SnapToGrid(g.geomout, 0.00001)) As longitude, ST_Y(ST_SnapToGrid(g.geomout, 0.00001)) As latitude,addy FROM geocode('1 Main St, Hanover, MA',1) as g;

outputs

-70.84508000000001	42.117270000000005	(1,,Main,St,,,Hanover,MA,02339,t,,)
 SELECT ST_X(ST_SnapToGrid(g.geomout, 0.00001)) As longitude, ST_Y(ST_SnapToGrid(g.geomout, 0.00001)) As latitude,addy FROM geocode('1 Main St, Hanover, MA 02339',1) as g;

outputs:

-70.84508000000001	42.117270000000005	(1,,Main,St,,,Hanover,MA,02339,t,,)

comment:10 by robe, 3 years ago

Try replacing your geocode_address with this one - make sure to prefix with tiger.

[ba2f659c001142687caf05a3999501ecd37e307a/git]

With the above patch, my results for:

SELECT rating, ST_X(g.geomout), ST_Y(g.geomout), addy from geocode('1 Main st, Hanover, MA', 100) as g;

Look like:

3	-70.84508298309264	42.11726584896208	(1,,Main,St,,,Hanover,MA,02339,t,,)
9	-70.84437177401415	42.15834177639816	(1,,Main,St,,,Norwell,MA,02061,t,,)
10	-72.23530265382281	42.260596267836945	(1,E,Main,St,,,Ware,MA,01082,t,,)
11	-72.79730604439531	42.43541221562752	(1,S,Main,St,,,Goshen,MA,01096,t,,)
12	-72.87424511880401	42.500985260983285	(1,E,Main,St,,,Plainfield,MA,01070,t,,)
14	-72.40080978104714	42.276919680561925	(1,S,Main,St,,,Belchertown,MA,01007,t,,)
14	-72.65981893720061	42.328655966035754	(1,S,Main,St,,,Northampton,MA,01062,t,,)
14	-72.40355779373047	42.27997542236744	(1,N,Main,St,,,Belchertown,MA,01007,t,,)
14	-72.3321618003692	42.18801537873117	(1,S,Main,St,,,"Palmer Town",MA,01079,t,,)
15	-72.59183570986102	42.214047537431945	(1,S,Main,St,,,"South Hadley",MA,01075,t,,)
15	-72.73052592739397	42.39331433997838	(1,N,Main,St,,,Williamsburg,MA,01096,t,,)
15	-72.59865296286365	42.215393856763214	(1,N,Main,St,,,"South Hadley",MA,01075,t,,)
21	-70.75985323124343	42.16990860652304	(,,Main,St,,,Scituate,MA,02061,t,,)

comment:11 by pramsey, 3 years ago

Milestone: PostGIS 3.1.1PostGIS 3.1.2

comment:12 by Regina Obe <lr@…>, 3 years ago

Resolution: fixed
Status: newclosed

In 3b6b318/git:

Better answers when no zip is provided. Closes #4826 for PostGIS 3.1.2

Note: See TracTickets for help on using tickets.