Combinig TERYT code changes with maps

Changes in TERYT codes

In last note I showed how to merge Polish BDL local data with municipality map for Poland and how to make nicely looking maps out of these data. Here I want to continue this topic further.

When merging 2016 data with 2018 municipality map we encountered some problems due to the fact that not all municipality numbers stay the same. Generally municipalities in Poland are coded using 7 digit TERYT code with the following format:

WWPPGGT, where:

  • WW: two digits for region (województwo) code;
  • PP: two digits for county (powiat) code;
  • GG: two digits for municipality code;
  • T: One digit for type of municipality code. There are several types:
    • 1: cities;
    • 2: villages;
    • 3: mix of cities and villages;
    • 4: city area in the city-village mix;
    • 5: village area in the city-village mix;
    • 8 or 9: district code for larger cities.

Understandingly these codes can change. Moreover municipalities can change counties, regions and types. Sometimes they can merge, be created and be deleted. Also municipalities can change areas, but we are unable to account for this with this data.

Thankfully GUS allows to download code changes since 1999, before the administrative reform. We can use this information to create recode tables.

First we load the needed libraries.

In [25]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import geopandas as gpd

from os.path import join as pj

# My local project directory
pDir = "/home/mateusz/projects/gis/"

The update lines are provided in XML format with each change described as a single element with children. For example:

    <NazwaDodatkowaPrzed>gmina wiejska</NazwaDodatkowaPrzed>
    <NazwaPo />
    <NazwaDodatkowaPo>gmina miejsko-wiejska</NazwaDodatkowaPo>
    <WyodrebnionoZIdentyfikatora1 />
    <WyodrebnionoZIdentyfikatora2 />
    <WyodrebnionoZIdentyfikatora3 />
    <WlaczonoDoIdentyfikatora1 />
    <WlaczonoDoIdentyfikatora2 />
    <WlaczonoDoIdentyfikatora3 />

The trick is now to read them into Pandas dataframe. We will be using Python's builtin XML parser.

In [4]:
import xml.etree.ElementTree as et

parsedXML = et.parse(pj(pDir, 'partitions', 'TERC_Urzedowy_zmiany_1999-01-01_2018-07-15.xml'))
<xml.etree.ElementTree.ElementTree object at 0x7fde3a643a90>

We can see the first parsed elements as Element class. We can access its contents using the .text attribute.

In [7]:
root = parsedXML.getroot()
print('TypKorekty', root[0].find('TypKorekty').text)
[<Element 'TypKorekty' at 0x7fde36dc2598>, <Element 'WojPrzed' at 0x7fde36dc2638>, <Element 'PowPrzed' at 0x7fde36dc2728>, <Element 'GmiPrzed' at 0x7fde36dc2778>, <Element 'RodzPrzed' at 0x7fde36dc27c8>, <Element 'NazwaPrzed' at 0x7fde36dc2818>, <Element 'NazwaDodatkowaPrzed' at 0x7fde36dc2868>, <Element 'StanPrzed' at 0x7fde36dc28b8>, <Element 'WojPo' at 0x7fde36dc2908>, <Element 'PowPo' at 0x7fde36dc2958>, <Element 'GmiPo' at 0x7fde36dc29a8>, <Element 'RodzPo' at 0x7fde36dc29f8>, <Element 'NazwaPo' at 0x7fde36dc2a48>, <Element 'NazwaDodatkowaPo' at 0x7fde36dc2a98>, <Element 'WyodrebnionoZIdentyfikatora1' at 0x7fde36dc2b38>, <Element 'WyodrebnionoZIdentyfikatora2' at 0x7fde36dc2bd8>, <Element 'WyodrebnionoZIdentyfikatora3' at 0x7fde36dc2c78>, <Element 'WlaczonoDoIdentyfikatora1' at 0x7fde36dc2d18>, <Element 'WlaczonoDoIdentyfikatora2' at 0x7fde36dc2db8>, <Element 'WlaczonoDoIdentyfikatora3' at 0x7fde36dc2e58>, <Element 'StanPo' at 0x7fde36dc2ea8>]
TypKorekty M

What we need to do is to iterate by elements of the parsed XML.

We can get the names of attributes from GUS. Some keys are coded as before (Przed) and after ( Po ).

In [10]:
ch_st = ['Stan', 'Woj', 'Pow', 'Gmi', 'Rodz', 'Nazwa', 'NazwaDodatkowa']
ch_oth = ['TypKorekty', 'WlaczonoDoIdentyfikatora1', 'WlaczonoDoIdentyfikatora2', 
          'WlaczonoDoIdentyfikatora3', 'WyodrebnionoZIdentyfikatora1', 
          'WyodrebnionoZIdentyfikatora2', 'WyodrebnionoZIdentyfikatora3']
teryt_keys = ch_oth + [ch + it for it in ['Przed', 'Po'] for ch in ch_st]

We construct a simple iterator returning a dictionary with values for each change usingthe yield keyword.

In [12]:
def iter_changes(doc, keys):
    for zmiana in doc.iter('zmiana'):
        out = {}
        for k in keys:
            out[k] = zmiana.find(k).text
        yield out
doc_df = pd.DataFrame(list(iter_changes(parsedXML.getroot(), teryt_keys)))

# Add full TERYT codes
for p in ['Przed', 'Po']:
    doc_df['TERYT' + p] = doc_df['Woj'+p] + doc_df['Pow'+p] + doc_df['Gmi'+p] + doc_df['Rodz'+p]
GmiPo GmiPrzed NazwaDodatkowaPo NazwaDodatkowaPrzed NazwaPo NazwaPrzed PowPo PowPrzed RodzPo RodzPrzed ... WlaczonoDoIdentyfikatora1 WlaczonoDoIdentyfikatora2 WlaczonoDoIdentyfikatora3 WojPo WojPrzed WyodrebnionoZIdentyfikatora1 WyodrebnionoZIdentyfikatora2 WyodrebnionoZIdentyfikatora3 TERYTPrzed TERYTPo
0 02 02 gmina miejsko-wiejska gmina wiejska None Prusice 20 20 3 2 ... None None None 02 02 None None None 0220022 0220023
1 02 None miasto None Prusice None 20 None 4 None ... None None None 02 None None None None NaN 0220024
2 02 None obszar wiejski None Prusice None 20 None 5 None ... None None None 02 None None None None NaN 0220025
3 12 12 gmina miejsko-wiejska gmina wiejska None Tyszowce 18 18 3 2 ... None None None 06 06 None None None 0618122 0618123
4 12 None miasto None Tyszowce None 18 None 4 None ... None None None 06 None None None None NaN 0618124

5 rows × 23 columns

At this stage we are interested only in before and after TERYT codes and when the change has been made.

In [13]:
changes_99_18 = doc_df[['TERYTPrzed', 'TERYTPo', 'StanPo', 'StanPrzed']].dropna()
TERYTPrzed TERYTPo StanPo StanPrzed
0 0220022 0220023 2000-01-01 1999-01-01
3 0618122 0618123 2000-01-01 1999-01-01
6 1202032 1202033 2000-01-01 1999-01-01
11 1429052 1429053 2000-01-01 1999-01-01
14 3030032 3030033 2000-01-01 1999-01-01

We next want to keep only changes after 2016 to 2018.

In [14]:
# Build final recode table including cumulative changes
keep = changes_99_18['StanPo'].apply(lambda x: int(x[0:4]))
# Keep changes for years greater than 2016
changes_16_18 = changes_99_18[keep  > 2016]

There still is a possibility that a municipality changed it's number more than once. To leave only the last change we merge first the changes_16_18 before column with after column with respect to when the change occured.

In [15]:
# Merge sequential changes
dif = changes_16_18.merge(changes_16_18, left_on=['TERYTPo', 'StanPo'], right_on=['TERYTPrzed', 'StanPrzed'])
TERYTPrzed_x TERYTPo_x StanPo_x StanPrzed_x TERYTPrzed_y TERYTPo_y StanPo_y StanPrzed_y
0 1210022 1210023 2018-01-01 2017-01-01 1210023 1210022 2018-01-02 2018-01-01

From this we can se that there was a municipality that in 1st Jan 2018 changed its type from village to village-city mix and then reverteb back to it a day later. Administration can be daunting...

So next we merge the final state TERYTPo_y to the initial dataset by the before change values TERYTPrzed_x. We next merge the initial values with duplicating after change values TERYTPo_x. We use this information only to delete rows that do not have the final change.

In [20]:
# Merge initial TERYT with final TERYT
teryt_recode = changes_16_18.merge(
        dif[['TERYTPrzed_x', 'TERYTPo_y']], how='left',  
        left_on='TERYTPrzed', right_on='TERYTPrzed_x', 
        ).drop(columns='TERYTPrzed_x').rename(columns={'_merge': 'update', 'TERYTPo_y':'last'})
# Drop +1 changes
teryt_recode = teryt_recode.merge(
        dif, how='left',  
        left_on='TERYTPrzed', right_on='TERYTPo_x', 

teryt_recode.loc[teryt_recode['update'] == 'both', 'TERYTPo'] = teryt_recode['last']
teryt_recode = teryt_recode[teryt_recode['_merge'] != "both"]
teryt_recode = teryt_recode[['TERYTPrzed', 'TERYTPo']]

print("Number of changes: {0}".format(teryt_recode.shape[0]))

Number of changes: 13
0 0603152 0603153
1 2211021 2211023
2 2604122 2604123
3 3007082 3007083
4 3209052 3209053

Previously we were missing 12 municipalities from the map. Given that one of the changes did not happen in th end we can say that we have found are missing municipalities. Let's find that out.

In [21]:
# Load SHP files
pl = gpd.read_file(pj(pDir, 'data', 'prg2018', 'Państwo.shp'))
pl_gminy = gpd.read_file(pj(pDir, 'data', 'prg2018', 'gminy_utf.shp'))

# Merge and update TERYT codes
merged = pl_gminy.merge(teryt_recode, left_on='jpt_kod_je', right_on='TERYTPo', how='left', indicator=True)
merged.loc[merged['TERYTPrzed'].notnull(), 'jpt_kod_je'] = merged['TERYTPrzed']
merged = merged.drop(columns=['TERYTPrzed', 'TERYTPo'])

left_only     2465
both            13
right_only       0
Name: _merge, dtype: int64

We see that we got all of our missing municipalities recoded.

Lets check it on a map!

In [29]:
# Load BDL data
bdl_pers = pd.read_csv(
        pj(pDir, 'data', 'bdl', 'LUDN_1336_CREL_20180713222221.csv'), 
        sep=';', dtype={'Kod': str}).drop(columns=['Atrybut', 'Unnamed: 10'])
bdl_pers = bdl_pers[bdl_pers['Płeć']=='ogółem']
bdl_pers = bdl_pers[['Kod', 'Wartosc']]
bdl_pers.columns = ['Kod', 'pers']

# Get area im km^2
merged['area'] = merged['geometry'].area

# Merge BDL & map
pl_merged = merged.merge(bdl_pers, left_on='jpt_kod_je', right_on='Kod', how='left')
# Calculate density
pl_merged['pop_dens'] = pl_merged['pers'] / pl_merged['area'] * 1000

# Plot population density
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pop_dens', cmap='Spectral', legend=True, scheme='quantiles', k=10)
ax.set_title('Population density in 2016, 1000 pers per km$^2$', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)


In the end we did not get any 'black holes' on our map. The merging worked well. It can probably be extended to other years. Unfortunately we still miss municipalities that had merged or ceased to exist. We also miss border changes during years. Moreover the information before 1999 bout changes is missing in current approach. This can be burdensome when our local data are from 1995.

A full database with changes made with a different approach can be found on Tomasz Żółtak's private website. This data is additionally expanded by information on partitions. The author claims that the methodology of how this partition classification has been made is unknown. I'll try to replicate this classification using QGIS soon...