How to set up address search by coordinates (and where to get the necessary directory)





In the spring, we added the “Reverse Geocoding” feature to the DaData.ru API, which is also called “ Address by Coordinates ”. The name hints: the method takes geocoordinates and gives data about the address.



A solid product with the same functionality offers Yandex - it is called Geocoder . But the Yandex service is free only for open non-commercial projects. The standard tariff - from 120 000 ₽ per year - is not suitable for everyone.



We thought - if you make a free or low-cost alternative to Geocoder, the developers will surely say thanks. And they did. In the article I’ll tell you how the “Address by Coordinates” structure works: how we set up the search, put together a directory, and packaged it into a ready-made method.



Where do we get the data and how are we looking for the address



Approaching the task, we studied ready-made solutions: where to get the coordinate directory with addresses and how then to search for geographical objects from this directory. It turned out that you don’t even have to go far for the right tools.



We take the address objects in FIAS - Federal Information Address System . This is the most comprehensive of the open and official address directories. We already wrote about it in detail on Habré , and now four facts are important:





Address objects downloaded from FIAS together with ID are the basis of our directory for reverse geocoding.



Download coordinates from OpenStreetMap (OSM). OSM is a project with a free license: enthusiasts collect the coordinates of various objects and post it to everyone.



In simple terms, OSM is a set of points, lines, and polygons on a map. Each object has its own description, type and set of coordinates. OSM data for Russia are located at needgeo.com , osm.sbin.ru/osm_dump/ and osmosis.svimik.com/latest/ .





The list of sources is published on a special page on the project's “Wiki”



Unloading consists of PBF files - this format is used instead of XML as a more compact one. Turning PBF into OSM XML costs nothing; a bunch of community-approved utilities can handle this.



For our own directory, we take the address objects from FIAS, and then look for their coordinates in OSM. If found, save the combined data. It turns out such an intersection of FIAS and OSM.



And all this is wonderful, but there is one problem: things are not easy with the quality of data in OSM. The coordinates of objects often do not correspond to reality. For example, polygons for regions and districts are adequate. But for cities and below - not so much.





Polygons are polygons that delimit areas on a map. They consist of a linked set of points with coordinates. Polygons indicate the boundaries of regions, districts, cities, and even buildings



The main work, and by a wide margin, is to collect adequate data from OSM and weed out the marriage. The task is so voluminous that I dedicated a separate section to it in the article.



We also download houses that are not in FIAS from OSM. As I said above, the FIAS lacks tens of thousands of houses. This is not even a problem, but simply a reality, a background. Therefore, we replenish our directory with houses from OSM. But only those for whom there is a street in FIAS. The buildings that came from OSM do not have a FIAS ID, so we identify them as the parent's FIAS code + house number .



In the directory we are looking for with the help of the beautiful Lucene - our long-term assistant. Thanks for the tip, to a well-versed Indian who wrote a post on Indexing Geographical Data With Lucene (a good addition is A dive into spatial search algorithms - about the kd trees on which the search algorithm is built).



As soon as we found out about Lucene, the search problem was solved almost by itself. Business remains - go for sandpaper.



  1. We loaded in Lucene your directory of coordinates and addresses, got a search index. For ease, almost everything was removed from it, leaving only the address IDs and coordinates.
  2. We set up a search by index: input - coordinates, output - ID of the found address objects. The search does not return any other information, since the index was extremely short-cut.
  3. Satisfied the issuance, loading data from the "big" FIAS by the found IDs. We add a lot of everything, from the address everyone needs in one line to the sign of the regional capital of cities.
  4. We figured out how to sort and give the received objects.


So far, everything looks simple, but this is only a small part of the work. No search for the address by coordinates would have worked, if we had not compiled a decent directory.



How to collect the base of coordinates and addresses



To begin with, I’ll lay out my luggage: after reading the article, quickly creating a similar guide will not work. We have been collecting it since 2014, constantly supplementing it. I’ll tell you about this damn long way.



The most difficult part when compiling a directory is to sort out the coordinates that came from OSM. At the start, we verified them as best we could, including with our hands. The main goal then was to obtain reference points in large cities and make a reference guide from them. Now that there are many such points, there is almost no need to manually check for new data. At a time, we add 200,000-300,000 addresses with coordinates to the reference directory, and this is how we do it.



We form complete addresses from OSM tags. In OSM uploads, the component parts of addresses are scattered by different tags:





We run through the tags and collect from them the full address: Bulatnikovo village, 103 Central Street .



We run each new address through the Dadat standardization API . The service converts addresses to a single format "Like in FIAS":





The addresses from the API are clean, even though they are now sending a letter or parcel.

Before standardization After
Bulatnikovo village, 103 Central Street

142718, Moscow region, Leninsky district, with Bulatnikovo, Central St., 103.







FIAS code - a8b6a52f-e96d-4ec3-a0ff-641013ab0445





We store standardized houses, streets and settlements as one point. For the street and the village, this point is the center. As a result, all the address objects are in the same table, inside - the address, FIAS ID, latitude and longitude.

Address FIAS ID Latitude Longitude
142718, Moscow region, Leninsky district, with Bulatnikovo, Central St., 103 a8b6a52f-e96d-4ec3-a0ff-641013ab0445 55.558773 37.667103
119034, Moscow, lane Turchaninov, d 6 bldg. 2 8c925e61-9173-48b3-999e-dc85c86d89e7 55.737096 37.597190
We analyze addresses that Dadata did not standardize. Addresses that could not be matched with FIAS are marked with a flag by the service. We check them manually, there are several options.



  1. The address did not come in the proper OSM upload tags, but the devil knows where. Met and not filled out address tags, and the city in the street tag, and much more.
  2. In OSM lies an exotic object like a playground, college football field or even a cemetery. There is nothing like this in FIAS, and for our purposes these results are not suitable. Such objects are simply screened out.
  3. A mistake - and not a mistake at all. For example, a district of a city that does not exist in FIAS came from OSM. Or in OSM, the object is located in a settlement, but in FIAS this settlement was attached to the city and removed. Then we finish the algorithm for the loaded data and run it again.




Parsed unloading, and there - confusion in tags



We check how adequate the loaded coordinates are. To do this, we look with a special utility whether the coordinates of the new object fall into the polygon of the parent region or district. If the address informs that the object is located in the Omsk region, please be kind enough to get into its landfill. Entry into the city is not required - not all cities are accurately covered in OSM, for many, the data does not update.



We load reference polygons from OSM and store it as is - in GeoJSON format. To choose which polygon to try on a point, look in a separate table. In it, we compared the prefixes of the CLADR codes and the polygon IDs: you find the CLADR code for the address and you see which polygon to choose.





The KLADR code is a unique identifier that was used before the FIAS. A million services can find this code for an address



The utility allows the object to stand 1,700 meters from the landfill. This rule was added because of highways that often go beyond the borders of the region. But a distance greater than 1,700 meters is a sign of error, statistics say.



This is the end of the test for cities and streets.



Once again, more strictly, we check the loaded coordinates of the houses. The utility mentioned again comes into play, and this is what it does.



  1. He takes the address of the new house and finds neighbors for him in the reference directory.
  2. According to the coordinates, it considers the distance between the unverified new house and reliable neighbors.




It’s easy to find neighbors: 1. We take a new home and find the FIAS ID of the parent. 2. We select from the reference directory houses whose parents have the same FIAS ID



Inspection is carried out only at home, which are no more than 150 meters from reliable colleagues. Moreover, we consider each new approved house when analyzing the following. Here's how it works.



Suppose, in the reference guide, houses No. 1, 2 and 3 are stored along Kommunarov Street . In the new data came houses No. 5, 6 and 7 on the same street. Judging by the coordinates, the new houses are nearby. The utility sees that house number 5 is next to houses number 1, 2 and 3 and adds it to the reference directory. So, houses number 6 and 7 are also being tested.



And then the fate of the data that came from OSM is decided:





We divide the tested objects into two parts. They will go to different plates of our reference manual.





In the first table - all objects with FIAS ID to houses: regions, settlements, streets. In the second - at home and a link to the parent from the first table



Two tables are needed to assign keys to homes that are missing in FIAS. They do not have their own FIAS code, so here's how:





As a result of a building without a FIAS code, we identify the parent ID + house number using the FIAS key.



The reference is ready, it remains to test. We run a functional test service overnight and test performance. We check the speed in Moscow, requesting all the houses within a radius of three kilometers. To be sure. Of course, they overlaid everything with autotests.



The main thing after the update is not to get worse.



Reverse geocoding through the eyes of the user



The input method takes three parameters: coordinates, number of results, and search radius. The default radius is 100 meters, the maximum is a kilometer. The exact value is set in the settings.



curl -X POST \ -H "Content-Type: application/json" \ -H "Accept: application/json" \ -H "Authorization: Token ${API_KEY}" \ -d '{ "lat": 55.878, "lon": 37.653, "radius_meters": 50 }' \ https://suggestions.dadata.ru/suggestions/api/4_1/rs/geolocate/address
      
      





The method returns the found objects: houses, streets and settlements. It sorts them in descending order of accuracy.



  1. Houses.
  2. Streets.
  3. Settlements.
  4. Cities.


Then it sorts again - by distance from the given coordinates. If the method found four houses and a street, the houses will first stand in order of distance from a given point. Behind them is the street.



After all these castles, the method finally returns the objects that it found.



 { "suggestions": [ { "value": " ,  ,  11", "unrestricted_value": " ,  ,  11", "data": {...} }, { "value": " ,  ,  11", "unrestricted_value": " ,  ,  11", "data": {...} } ] }
      
      





Inside - a lot of different things about the objects found: lines with the full and abbreviated address, current and outdated names, zip code, FIAS code of the parent object and so on.





All data that the method gives is in the documentation



The coverage by coordinates for different regions is different, like this with houses:





And here it is - covering the streets:





They didn’t consider the cities - on the scale of Russia, even the fact of belonging to the proud title of the city was unsteady. For example, Yaroslavl region, Poshekhonsky district, s / o Fedorkovsky is a city, according to the official FIAS directory. And in fact, and at the address - the rural district. Physically, the rural district resembles the union of several villages in a large blot. It is difficult not only to determine the center, but even to find the settlement on the map.



We are already thinking about what to add to the method: allow filtering by type of object, return distance to a given point, something else. We monitor demand and decide whether to invest.



Otherwise, everything is already on the prod. Up to 10,000 requests per day - for free, more - by subscription from 5 000 ₽ per year. If you need addresses by coordinates for a commercial project, and the Geocoder is too expensive, try the Dadati API .



The original article is published on the HFLabs blog .



All Articles