HOW OLD IS THIS HOUSE
Big City Russia

A project of Kon-Tiki Publishers about the age of buildings in the 80 largest cities of the country.

🤖📊🛠️🕵️‍♀️
The process of data collection and processing
Arthur Kislitsyn
curator of how-old-is-this.house
After the publication of 14 maps dedicated to individual cities in Russia, a large interactive all-Russian map of the age of houses appeared in the special project of Kon-Tiki Publishers as the how-old-is-this.house project. In addition to the article about the research results, I have prepared a behind-the-scenes story about the data.

I will tell you about how the data was collected and how it can be used.

🎬📚 Prologue

The life of the how-old-is-this.house project began with the release of the map of St. Petersburg in 2020, which was compiled by Nikita Slavin, the founder of the Kon-Tiki publishing house. Then, 6 more maps were released for Moscow, Vladimir, Kazan, Yekaterinburg, Voronezh, Kaliningrad, and Arkhangelsk, for which data were collected from various sources and meticulously processed by the authors (the Kaliningrad map was my first painstaking project). In the spring of 2021, we met programmer Alexander Kachkaev, who became interested in the project and wanted to automate the process of creating maps of the age of houses. This is how the script came into being.

Tooling for how-old-is-this.house — is a script and a tool kit, written in TypeScript and which can be used via the command line interface. A configuration file needs to be created for each individual territory

The automation brought significant relief to the authors and had several important advantages:
  • Data collection became faster – up to 2-3 days;
  • A unified algorithm allows for data standardization and eliminates the human factor;
  • The script conducts address standardization, minimizing the reliance on geocoding;
  • The script provides a convenient tool for collecting data from the Rosreestr (Federal Service for State Registration, Cadastre and Cartography).

With the help of the new script assistant, maps of Penza, Nizhny Novgorod, Kirov, Krasnodar, Volgograd, and Tomsk have appeared. However, in practice, the creative role of the author has not lost its significance: solving technical problems related to the source data, as well as extensive post-processing, still rests on the shoulders of the cartographers.

🌐🚀 New scale

The emergence of a universal tool led us to think if could we gather all the information about buildings in Russia in one go.

The answer to this ambitious question turned out to be negative. Russia is very vast, computational resources are limited, and the open data available are imperfect.

We chose a different task and decided to collect data on the cities of Russia, the largest ones, where almost half of the country's population lives in total. The list was formed based on the 2010 census data, to which we expertly added some fast-growing cities.
We also enriched and standardized the data for those cities whose maps were published earlier.

🛠️🤖 Techniques

Despite its apparent ease, collecting data for dozens of cities turned out to be a long and challenging task.

The mechanism of operation is interestingly described in Kachkaev's longread about the Penza map project. Here, I'll describe the main structure.

The script collects various data from the following seven sources:
  • "MINZHKH" Project
  • Ministry of Culture Registry
  • OpenStreetMap (OSM)
  • Rosreestr (Federal Service for State Registration, Cadastre and Cartography)
  • Wikidata
  • Wikimapia
  • Wikivoyage
Data sources. By Alexander Kachkaev
Rosreestr is the main source of information about the age and number of floors of non-residential buildings and individual houses. Unfortunately, this part of data collection caused the most trouble because parsing one large city required more than two days of continuous scraping. As of the current moment (January 2024), the script tools do not allow parsing objects using the Rosreestr API, as it requires rebuilding the script and its requests.

Other problems were related to the script's limitations in processing the boundaries of cities with complex geometry and shortcomings in the open data - this required manual intervention throughout the process.

The next task was to merge the previously collected data with the new data, further standardize it, search for errors and outliers, and analyze the entire dataset.

🔍✨ What useful information can be found here?

Despite all the obstacles, after a year, I managed to collect a huge dataset of 3 million buildings, resulting in 1.5 gigabytes across 80 cities.

In addition to the cool interactive map that allows to find information about specific buildings in your city, there are several other useful features. In the prepared dataset, along with data on the construction period of the building, we also saved values for:
  • number of floors;
  • number of apartments;
  • status of the apartment building;
  • information about architects and architectural styles;
  • links to cultural heritage object cards;
  • and photos.
We believe that such massive aggregation of open and semi-open data will serve as the foundation for exciting future research on Russian cities.

😬⚡ Why you should be careful

An honest data project should openly declare the limitations imposed by the conditions of its data sources. Our dataset is not devoid of errors: a significant percentage of objects may have inaccuracies in their name, address, photographs, floor count, status of multi-story buildings, or, most importantly, in the year of construction.

There are several reasons for this:
  • Government open data in Russia, primarily referring to Rosreestr, cannot be considered as a source of satisfactory quality. A low proportion of geotagged objects and insufficient accuracy of address lists sometimes lead to unpleasant outcome when merging layers of information: some buildings literally end up not being in their correct location within a quarter.
  • Address lists of cultural heritage objects from the Ministry of Culture also require careful attention. If an analyst uses automatic tools, they risk losing a lot of data because heritage objects' groups, such as parts of large estates and monasteries, may have addresses with different letters and lack geotagging.
  • OpenStreetMap, the main portal for open spatial data, developed by volunteers over many years, naturally cannot guarantee complete coverage of Russian cities with building geometry data. While coverage gaps are minimal in major cities, in small towns (e.g., Vladikavkaz), entire blocks, especially of individual residential buildings, may remain unmapped.

In short, when working with data at the scale of urban blocks, cautious cross-checking of information is necessary.

Now, let's address the global issues with our dataset:
  • Data on the construction year of most pre-revolutionary buildings was lost a century ago — the databases for them indicate the year of 1917. The problem is also evident in Kaliningrad and Vyborg, with erroneous years of 1945 and 1940, respectively (Vyborg District is not included in the dataset but may be added in the future). In Kaliningrad, an artificial dating of 1900 is sometimes used as well.
  • The data contain outliers, primarily inherited from Rosreestr. The only major identified flaw so far is that about fifty Soviet-era buildings in Chelyabinsk are dated to the year 1900 (these were corrected during data analysis but retained in the final dataset).
  • The script rounds off inaccurate dating (e.g., early 19th century) for the purposes of quantitative analysis and data visualization. The textual format is preserved in a separate field.
  • Information about the number of floors is missing for a significant portion of buildings. During analysis, such buildings were considered single-story.
And a few words about the pleasant features of our dataset:
  • The dataset of Arkhangelsk also includes data for Severodvinsk and Novodvinsk (not included into the data analysis).
  • The dataset of Moscow includes data for New Moscow. In the data analysis, these two parts of the capital are treated separately.

🙅‍♂️📉 When can you not use the data?

With our data, it is not possible to assess the volumes of construction in different historical periods because we lack information on how many buildings from each period were demolished throughout the history of the city. To compare cities by their size and growth rate, historical (archival) plans and statistics should be used.

Additionally, when comparing and analyzing, one should be cautious when using absolute values.

🗺️🎮 Interactive mapping

We have prepared an interactive seamless map, which exists simultaneously with local projects for each city or agglomeration. For the all-Russian project, a special color scale was developed: we took the very first, St. Petersburg palette as the basis, slightly smoothed its shades so that each historical period could be easily identified visually on the map of any city. Additionally, this palette is user-friendly for colorblind users.
You can read about the data here.
I express my sincere gratitude to Alexander Kachkaev for providing us with a powerful script; to Nikita Slavin for creating how-old-is-this.house; to the Geosemantica team for supporting the project technically, and to all twelve initial authors of house age maps!