Changes in TERYT codes¶

In last note I showed how to merge Polish BDL local data with municipality map for Poland and how to make nicely looking maps out of these data. Here I want to continue this topic further.

When merging 2016 data with 2018 municipality map we encountered some problems due to the fact that not all municipality numbers stay the same. Generally municipalities in Poland are coded using 7 digit TERYT code with the following format:

WWPPGGT, where:

WW: two digits for region (województwo) code;
PP: two digits for county (powiat) code;
GG: two digits for municipality code;
T: One digit for type of municipality code. There are several types:
- 1: cities;
- 2: villages;
- 3: mix of cities and villages;
- 4: city area in the city-village mix;
- 5: village area in the city-village mix;
- 8 or 9: district code for larger cities.

Understandingly these codes can change. Moreover municipalities can change counties, regions and types. Sometimes they can merge, be created and be deleted. Also municipalities can change areas, but we are unable to account for this with this data.

Thankfully GUS allows to download code changes since 1999, before the administrative reform. We can use this information to create recode tables.

First we load the needed libraries.

In [25]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import geopandas as gpd

from os.path import join as pj

# My local project directory
pDir = "/home/mateusz/projects/gis/"

The update lines are provided in XML format with each change described as a single element with children. For example:

<zmiana>
    <TypKorekty>M</TypKorekty>
    <WojPrzed>02</WojPrzed>
    <PowPrzed>20</PowPrzed>
    <GmiPrzed>02</GmiPrzed>
    <RodzPrzed>2</RodzPrzed>
    <NazwaPrzed>Prusice</NazwaPrzed>
    <NazwaDodatkowaPrzed>gmina wiejska</NazwaDodatkowaPrzed>
    <StanPrzed>1999-01-01</StanPrzed>
    <WojPo>02</WojPo>
    <PowPo>20</PowPo>
    <GmiPo>02</GmiPo>
    <RodzPo>3</RodzPo>
    <NazwaPo />
    <NazwaDodatkowaPo>gmina miejsko-wiejska</NazwaDodatkowaPo>
    <WyodrebnionoZIdentyfikatora1 />
    <WyodrebnionoZIdentyfikatora2 />
    <WyodrebnionoZIdentyfikatora3 />
    <WlaczonoDoIdentyfikatora1 />
    <WlaczonoDoIdentyfikatora2 />
    <WlaczonoDoIdentyfikatora3 />
    <StanPo>2000-01-01</StanPo>
  </zmiana>

The trick is now to read them into Pandas dataframe. We will be using Python's builtin XML parser.

In [4]:

import xml.etree.ElementTree as et

parsedXML = et.parse(pj(pDir, 'partitions', 'TERC_Urzedowy_zmiany_1999-01-01_2018-07-15.xml'))
print(parsedXML)

<xml.etree.ElementTree.ElementTree object at 0x7fde3a643a90>

We can see the first parsed elements as Element class. We can access its contents using the .text attribute.

In [7]:

root = parsedXML.getroot()
print(root[0].getchildren())
print('TypKorekty', root[0].find('TypKorekty').text)

[<Element 'TypKorekty' at 0x7fde36dc2598>, <Element 'WojPrzed' at 0x7fde36dc2638>, <Element 'PowPrzed' at 0x7fde36dc2728>, <Element 'GmiPrzed' at 0x7fde36dc2778>, <Element 'RodzPrzed' at 0x7fde36dc27c8>, <Element 'NazwaPrzed' at 0x7fde36dc2818>, <Element 'NazwaDodatkowaPrzed' at 0x7fde36dc2868>, <Element 'StanPrzed' at 0x7fde36dc28b8>, <Element 'WojPo' at 0x7fde36dc2908>, <Element 'PowPo' at 0x7fde36dc2958>, <Element 'GmiPo' at 0x7fde36dc29a8>, <Element 'RodzPo' at 0x7fde36dc29f8>, <Element 'NazwaPo' at 0x7fde36dc2a48>, <Element 'NazwaDodatkowaPo' at 0x7fde36dc2a98>, <Element 'WyodrebnionoZIdentyfikatora1' at 0x7fde36dc2b38>, <Element 'WyodrebnionoZIdentyfikatora2' at 0x7fde36dc2bd8>, <Element 'WyodrebnionoZIdentyfikatora3' at 0x7fde36dc2c78>, <Element 'WlaczonoDoIdentyfikatora1' at 0x7fde36dc2d18>, <Element 'WlaczonoDoIdentyfikatora2' at 0x7fde36dc2db8>, <Element 'WlaczonoDoIdentyfikatora3' at 0x7fde36dc2e58>, <Element 'StanPo' at 0x7fde36dc2ea8>]
TypKorekty M

What we need to do is to iterate by elements of the parsed XML.

We can get the names of attributes from GUS. Some keys are coded as before (Przed) and after ( Po ).

In [10]:

ch_st = ['Stan', 'Woj', 'Pow', 'Gmi', 'Rodz', 'Nazwa', 'NazwaDodatkowa']
ch_oth = ['TypKorekty', 'WlaczonoDoIdentyfikatora1', 'WlaczonoDoIdentyfikatora2', 
          'WlaczonoDoIdentyfikatora3', 'WyodrebnionoZIdentyfikatora1', 
          'WyodrebnionoZIdentyfikatora2', 'WyodrebnionoZIdentyfikatora3']
teryt_keys = ch_oth + [ch + it for it in ['Przed', 'Po'] for ch in ch_st]

We construct a simple iterator returning a dictionary with values for each change usingthe yield keyword.

In [12]:

def iter_changes(doc, keys):
    for zmiana in doc.iter('zmiana'):
        out = {}
        for k in keys:
            out[k] = zmiana.find(k).text
        yield out
        
doc_df = pd.DataFrame(list(iter_changes(parsedXML.getroot(), teryt_keys)))

# Add full TERYT codes
for p in ['Przed', 'Po']:
    doc_df['TERYT' + p] = doc_df['Woj'+p] + doc_df['Pow'+p] + doc_df['Gmi'+p] + doc_df['Rodz'+p]
doc_df.head()

Out[12]:

	GmiPo	GmiPrzed	NazwaDodatkowaPo	NazwaDodatkowaPrzed	NazwaPo	NazwaPrzed	PowPo	PowPrzed	RodzPo	RodzPrzed	...	WlaczonoDoIdentyfikatora1	WlaczonoDoIdentyfikatora2	WlaczonoDoIdentyfikatora3	WojPo	WojPrzed	WyodrebnionoZIdentyfikatora1	WyodrebnionoZIdentyfikatora2	WyodrebnionoZIdentyfikatora3	TERYTPrzed	TERYTPo
0	02	02	gmina miejsko-wiejska	gmina wiejska	None	Prusice	20	20	3	2	...	None	None	None	02	02	None	None	None	0220022	0220023
1	02	None	miasto	None	Prusice	None	20	None	4	None	...	None	None	None	02	None	None	None	None	NaN	0220024
2	02	None	obszar wiejski	None	Prusice	None	20	None	5	None	...	None	None	None	02	None	None	None	None	NaN	0220025
3	12	12	gmina miejsko-wiejska	gmina wiejska	None	Tyszowce	18	18	3	2	...	None	None	None	06	06	None	None	None	0618122	0618123
4	12	None	miasto	None	Tyszowce	None	18	None	4	None	...	None	None	None	06	None	None	None	None	NaN	0618124

5 rows × 23 columns

At this stage we are interested only in before and after TERYT codes and when the change has been made.

In [13]:

changes_99_18 = doc_df[['TERYTPrzed', 'TERYTPo', 'StanPo', 'StanPrzed']].dropna()
changes_99_18.head()

Out[13]:

	TERYTPrzed	TERYTPo	StanPo	StanPrzed
0	0220022	0220023	2000-01-01	1999-01-01
3	0618122	0618123	2000-01-01	1999-01-01
6	1202032	1202033	2000-01-01	1999-01-01
11	1429052	1429053	2000-01-01	1999-01-01
14	3030032	3030033	2000-01-01	1999-01-01

We next want to keep only changes after 2016 to 2018.

In [14]:

# Build final recode table including cumulative changes
keep = changes_99_18['StanPo'].apply(lambda x: int(x[0:4]))
# Keep changes for years greater than 2016
changes_16_18 = changes_99_18[keep  > 2016]

There still is a possibility that a municipality changed it's number more than once. To leave only the last change we merge first the changes_16_18 before column with after column with respect to when the change occured.

In [15]:

# Merge sequential changes
dif = changes_16_18.merge(changes_16_18, left_on=['TERYTPo', 'StanPo'], right_on=['TERYTPrzed', 'StanPrzed'])
dif.head()

Out[15]:

	TERYTPrzed_x	TERYTPo_x	StanPo_x	StanPrzed_x	TERYTPrzed_y	TERYTPo_y	StanPo_y	StanPrzed_y
0	1210022	1210023	2018-01-01	2017-01-01	1210023	1210022	2018-01-02	2018-01-01

From this we can se that there was a municipality that in 1st Jan 2018 changed its type from village to village-city mix and then reverteb back to it a day later. Administration can be daunting...

So next we merge the final state TERYTPo_y to the initial dataset by the before change values TERYTPrzed_x. We next merge the initial values with duplicating after change values TERYTPo_x. We use this information only to delete rows that do not have the final change.

In [20]:

# Merge initial TERYT with final TERYT
teryt_recode = changes_16_18.merge(
        dif[['TERYTPrzed_x', 'TERYTPo_y']], how='left',  
        left_on='TERYTPrzed', right_on='TERYTPrzed_x', 
        indicator=True
        ).drop(columns='TERYTPrzed_x').rename(columns={'_merge': 'update', 'TERYTPo_y':'last'})
# Drop +1 changes
teryt_recode = teryt_recode.merge(
        dif, how='left',  
        left_on='TERYTPrzed', right_on='TERYTPo_x', 
        indicator=True).drop(columns=dif.columns)

teryt_recode.loc[teryt_recode['update'] == 'both', 'TERYTPo'] = teryt_recode['last']
teryt_recode = teryt_recode[teryt_recode['_merge'] != "both"]
teryt_recode = teryt_recode[['TERYTPrzed', 'TERYTPo']]

print("Number of changes: {0}".format(teryt_recode.shape[0]))

teryt_recode.head()

Number of changes: 13

Out[20]:

	TERYTPrzed	TERYTPo
0	0603152	0603153
1	2211021	2211023
2	2604122	2604123
3	3007082	3007083
4	3209052	3209053

Previously we were missing 12 municipalities from the map. Given that one of the changes did not happen in th end we can say that we have found are missing municipalities. Let's find that out.

In [21]:

# Load SHP files
pl = gpd.read_file(pj(pDir, 'data', 'prg2018', 'Państwo.shp'))
pl_gminy = gpd.read_file(pj(pDir, 'data', 'prg2018', 'gminy_utf.shp'))

# Merge and update TERYT codes
merged = pl_gminy.merge(teryt_recode, left_on='jpt_kod_je', right_on='TERYTPo', how='left', indicator=True)
merged.loc[merged['TERYTPrzed'].notnull(), 'jpt_kod_je'] = merged['TERYTPrzed']
merged = merged.drop(columns=['TERYTPrzed', 'TERYTPo'])

merged['_merge'].value_counts()

Out[21]:

left_only     2465
both            13
right_only       0
Name: _merge, dtype: int64

We see that we got all of our missing municipalities recoded.

Lets check it on a map!

In [29]:

# Load BDL data
bdl_pers = pd.read_csv(
        pj(pDir, 'data', 'bdl', 'LUDN_1336_CREL_20180713222221.csv'), 
        sep=';', dtype={'Kod': str}).drop(columns=['Atrybut', 'Unnamed: 10'])
bdl_pers = bdl_pers[bdl_pers['Płeć']=='ogółem']
bdl_pers = bdl_pers[['Kod', 'Wartosc']]
bdl_pers.columns = ['Kod', 'pers']

# Get area im km^2
merged['area'] = merged['geometry'].area

# Merge BDL & map
pl_merged = merged.merge(bdl_pers, left_on='jpt_kod_je', right_on='Kod', how='left')
# Calculate density
pl_merged['pop_dens'] = pl_merged['pers'] / pl_merged['area'] * 1000

# Plot population density
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pop_dens', cmap='Spectral', legend=True, scheme='quantiles', k=10)
ax.set_title('Population density in 2016, 1000 pers per km$^2$', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')

Out[29]:

(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)

Summary¶

In the end we did not get any 'black holes' on our map. The merging worked well. It can probably be extended to other years. Unfortunately we still miss municipalities that had merged or ceased to exist. We also miss border changes during years. Moreover the information before 1999 bout changes is missing in current approach. This can be burdensome when our local data are from 1995.

A full database with changes made with a different approach can be found on Tomasz Żółtak's private website. This data is additionally expanded by information on partitions. The author claims that the methodology of how this partition classification has been made is unknown. I'll try to replicate this classification using QGIS soon...

Combinig TERYT code changes with maps

Changes in TERYT codes¶

Summary¶