Changes in TERYT codes¶
In last note I showed how to merge Polish BDL local data with municipality map for Poland and how to make nicely looking maps out of these data. Here I want to continue this topic further.
When merging 2016 data with 2018 municipality map we encountered some problems due to the fact that not all municipality numbers stay the same. Generally municipalities in Poland are coded using 7 digit TERYT code with the following format:
WWPPGGT, where:
- WW: two digits for region (województwo) code;
- PP: two digits for county (powiat) code;
- GG: two digits for municipality code;
- T: One digit for type of municipality code. There are several types:
- 1: cities;
- 2: villages;
- 3: mix of cities and villages;
- 4: city area in the city-village mix;
- 5: village area in the city-village mix;
- 8 or 9: district code for larger cities.
Understandingly these codes can change. Moreover municipalities can change counties, regions and types. Sometimes they can merge, be created and be deleted. Also municipalities can change areas, but we are unable to account for this with this data.
Thankfully GUS allows to download code changes since 1999, before the administrative reform. We can use this information to create recode tables.
First we load the needed libraries.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import geopandas as gpd
from os.path import join as pj
# My local project directory
pDir = "/home/mateusz/projects/gis/"
The update lines are provided in XML format with each change described as a single element with children. For example:
<zmiana>
<TypKorekty>M</TypKorekty>
<WojPrzed>02</WojPrzed>
<PowPrzed>20</PowPrzed>
<GmiPrzed>02</GmiPrzed>
<RodzPrzed>2</RodzPrzed>
<NazwaPrzed>Prusice</NazwaPrzed>
<NazwaDodatkowaPrzed>gmina wiejska</NazwaDodatkowaPrzed>
<StanPrzed>1999-01-01</StanPrzed>
<WojPo>02</WojPo>
<PowPo>20</PowPo>
<GmiPo>02</GmiPo>
<RodzPo>3</RodzPo>
<NazwaPo />
<NazwaDodatkowaPo>gmina miejsko-wiejska</NazwaDodatkowaPo>
<WyodrebnionoZIdentyfikatora1 />
<WyodrebnionoZIdentyfikatora2 />
<WyodrebnionoZIdentyfikatora3 />
<WlaczonoDoIdentyfikatora1 />
<WlaczonoDoIdentyfikatora2 />
<WlaczonoDoIdentyfikatora3 />
<StanPo>2000-01-01</StanPo>
</zmiana>
The trick is now to read them into Pandas dataframe. We will be using Python's builtin XML parser.
import xml.etree.ElementTree as et
parsedXML = et.parse(pj(pDir, 'partitions', 'TERC_Urzedowy_zmiany_1999-01-01_2018-07-15.xml'))
print(parsedXML)
We can see the first parsed elements as Element
class. We can access its contents using the .text
attribute.
root = parsedXML.getroot()
print(root[0].getchildren())
print('TypKorekty', root[0].find('TypKorekty').text)
What we need to do is to iterate by elements of the parsed XML.
We can get the names of attributes from GUS. Some keys are coded as before (Przed) and after ( Po ).
ch_st = ['Stan', 'Woj', 'Pow', 'Gmi', 'Rodz', 'Nazwa', 'NazwaDodatkowa']
ch_oth = ['TypKorekty', 'WlaczonoDoIdentyfikatora1', 'WlaczonoDoIdentyfikatora2',
'WlaczonoDoIdentyfikatora3', 'WyodrebnionoZIdentyfikatora1',
'WyodrebnionoZIdentyfikatora2', 'WyodrebnionoZIdentyfikatora3']
teryt_keys = ch_oth + [ch + it for it in ['Przed', 'Po'] for ch in ch_st]
We construct a simple iterator returning a dictionary with values for each change usingthe yield
keyword.
def iter_changes(doc, keys):
for zmiana in doc.iter('zmiana'):
out = {}
for k in keys:
out[k] = zmiana.find(k).text
yield out
doc_df = pd.DataFrame(list(iter_changes(parsedXML.getroot(), teryt_keys)))
# Add full TERYT codes
for p in ['Przed', 'Po']:
doc_df['TERYT' + p] = doc_df['Woj'+p] + doc_df['Pow'+p] + doc_df['Gmi'+p] + doc_df['Rodz'+p]
doc_df.head()
At this stage we are interested only in before and after TERYT codes and when the change has been made.
changes_99_18 = doc_df[['TERYTPrzed', 'TERYTPo', 'StanPo', 'StanPrzed']].dropna()
changes_99_18.head()
We next want to keep only changes after 2016 to 2018.
# Build final recode table including cumulative changes
keep = changes_99_18['StanPo'].apply(lambda x: int(x[0:4]))
# Keep changes for years greater than 2016
changes_16_18 = changes_99_18[keep > 2016]
There still is a possibility that a municipality changed it's number more than once. To leave only the last change we merge first the changes_16_18
before column with after column with respect to when the change occured.
# Merge sequential changes
dif = changes_16_18.merge(changes_16_18, left_on=['TERYTPo', 'StanPo'], right_on=['TERYTPrzed', 'StanPrzed'])
dif.head()
From this we can se that there was a municipality that in 1st Jan 2018 changed its type from village to village-city mix and then reverteb back to it a day later. Administration can be daunting...
So next we merge the final state TERYTPo_y
to the initial dataset by the before change values TERYTPrzed_x
. We next merge the initial values with duplicating after change values TERYTPo_x
. We use this information only to delete rows that do not have the final change.
# Merge initial TERYT with final TERYT
teryt_recode = changes_16_18.merge(
dif[['TERYTPrzed_x', 'TERYTPo_y']], how='left',
left_on='TERYTPrzed', right_on='TERYTPrzed_x',
indicator=True
).drop(columns='TERYTPrzed_x').rename(columns={'_merge': 'update', 'TERYTPo_y':'last'})
# Drop +1 changes
teryt_recode = teryt_recode.merge(
dif, how='left',
left_on='TERYTPrzed', right_on='TERYTPo_x',
indicator=True).drop(columns=dif.columns)
teryt_recode.loc[teryt_recode['update'] == 'both', 'TERYTPo'] = teryt_recode['last']
teryt_recode = teryt_recode[teryt_recode['_merge'] != "both"]
teryt_recode = teryt_recode[['TERYTPrzed', 'TERYTPo']]
print("Number of changes: {0}".format(teryt_recode.shape[0]))
teryt_recode.head()
Previously we were missing 12 municipalities from the map. Given that one of the changes did not happen in th end we can say that we have found are missing municipalities. Let's find that out.
# Load SHP files
pl = gpd.read_file(pj(pDir, 'data', 'prg2018', 'Państwo.shp'))
pl_gminy = gpd.read_file(pj(pDir, 'data', 'prg2018', 'gminy_utf.shp'))
# Merge and update TERYT codes
merged = pl_gminy.merge(teryt_recode, left_on='jpt_kod_je', right_on='TERYTPo', how='left', indicator=True)
merged.loc[merged['TERYTPrzed'].notnull(), 'jpt_kod_je'] = merged['TERYTPrzed']
merged = merged.drop(columns=['TERYTPrzed', 'TERYTPo'])
merged['_merge'].value_counts()
We see that we got all of our missing municipalities recoded.
Lets check it on a map!
# Load BDL data
bdl_pers = pd.read_csv(
pj(pDir, 'data', 'bdl', 'LUDN_1336_CREL_20180713222221.csv'),
sep=';', dtype={'Kod': str}).drop(columns=['Atrybut', 'Unnamed: 10'])
bdl_pers = bdl_pers[bdl_pers['Płeć']=='ogółem']
bdl_pers = bdl_pers[['Kod', 'Wartosc']]
bdl_pers.columns = ['Kod', 'pers']
# Get area im km^2
merged['area'] = merged['geometry'].area
# Merge BDL & map
pl_merged = merged.merge(bdl_pers, left_on='jpt_kod_je', right_on='Kod', how='left')
# Calculate density
pl_merged['pop_dens'] = pl_merged['pers'] / pl_merged['area'] * 1000
# Plot population density
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pop_dens', cmap='Spectral', legend=True, scheme='quantiles', k=10)
ax.set_title('Population density in 2016, 1000 pers per km$^2$', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
Summary¶
In the end we did not get any 'black holes' on our map. The merging worked well. It can probably be extended to other years. Unfortunately we still miss municipalities that had merged or ceased to exist. We also miss border changes during years. Moreover the information before 1999 bout changes is missing in current approach. This can be burdensome when our local data are from 1995.
A full database with changes made with a different approach can be found on Tomasz Żółtak's private website. This data is additionally expanded by information on partitions. The author claims that the methodology of how this partition classification has been made is unknown. I'll try to replicate this classification using QGIS soon...