Recently I've been working a lot with maps and geographical data. Most of the work has been dome in Stata and I've been using mostly QGIS for visualization. Also I did some simple spatial operations in GDAL and ArcGIS. But when it comes to producing maps, QGIS was my tool of choice. Ideas change quite a lot and QGIS is nice tool for data visualization, but unfortunately I've found it's Python API a bit troublesome to help me automatize production of maps. So I kept doing things manually.
Tools¶
Here I turn to Python and its libraries to help me automatize map production. It turn out not much effort is needed and the maps being produced are visually appealing just after some minor tweaks.
In this short excercise I use Geopandas which enables a simple way to work with Geo data. Except that I also use the standard scientific libraries like Matplotlib, Pandas and Numpy.
Data sources¶
In this example I use administrative boundaries for Poland and connect it with Local Data Bank, BDL data from the Polish national statistics office.
Shapefiles for administrative boundaries for Poland are available for free to download here. Unfortunately only last year is available, but still it's something.
First we need to setup our environment. I always try to use absolute paths, so it's easy for me to switch between computers.
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import geopandas as gpd
import os
# Project directory
pDir = "/home/mateusz/projects/gis/"
Openinig a Shapefiles is extremly easy. Geopandas documentation says that the only thing that you jus need one line of code to do that. Lets try reading the municipalities data...
try:
pl_gminy = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'gminy.shp'))
except ValueError as e:
print(e)
Unfortunately I got an error. That's probably because the SHP files are created on Windows machines using Polish localization. The default encoding in use is the CP1250. To be able to read the municipalities file I had to open the original first in QGIS indicating the encoding and then save the file with UTF-8 encoding selected. It worked like a charm.
I also left just the columns with names and municipality codes (TERYT) jpt_kod_je
.
pl_gminy = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'gminy_utf.shp'))
pl_gminy.head()
So the geometry is coded in the geometry
column. Nice to know. Let's see what is the projection in use.
pl_gminy.crs
The used units are meters, so we can easily calculate area.
pl_gminy['area'] = pl_gminy['geometry'].area
pl_gminy['area'].describe()
np.sum(pl_gminy['area'])/1e6 # I divide by 1000x1000 m to get km^2
Google says that area of Poland is 312 679 km$^2$, so there is lack of precission here.
Let's try to plot the map.
pl_gminy.plot()
It worked, but the map still needs some minor improvements.
fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, linewidth=0.5, edgecolor='0.4', color='0.8')
ax.axis('off')
Much better! Lets try to add some color! First let's plot the area.
fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, edgecolor='0.4', linewidth=0.5,
column='area', cmap='Greens', legend=True, scheme='quantiles', k=5)
ax.axis('off')
It seems that biggest municipalities are located in the north and west of Poland. This somehow seems as the municipalities located outside the pre WWII borders of Poland are generally larger.
We can download 1931 SHP files for Poland from MPIDR Population History GIS Collection Data Files. The site requires registration, but the material they provide is worth it!
pl_31 = gpd.read_file(os.path.join(pDir, 'data', 'Poland 1770-1978', 'Poland_1931_v.1.0.shp'))
print(pl_31.crs)
pl_31.head()
pl_31.plot()
The map looks OK, but we still need to reproject it to fit with the Polish projection. We will also load the current country shapefile and intersect it with a dissolved 1931 map.
# Change projection
pl_31 = pl_31.to_crs(pl_gminy.crs)
# Dissolve into one polygon
pl_31dis = pl_31
pl_31dis['one'] = 1
pl_31dis = pl_31dis[['geometry', 'one']].dissolve('one')
# Read Poland current boundaries
pl = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'Państwo.shp'))
# Intersect the two
pl31_int = gpd.overlay(pl, pl_31dis, how='intersection')
pl31_int.plot()
fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, edgecolor='0.4', linewidth=0.5,
column='area', cmap='Greens', legend=True, scheme='quantiles', k=5)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.axis('off')
There are some errors visible on the dissolved maps, but the general picture is clear, that on lands that Poland got after WWII the municipalities seem to be bigger.
Merging with BDL data¶
The BDL website contains data from 1995 on various topics. I am interested mostly in population and municipal finance. The latest dataset with municipal revenues is available from 2016. I take total population data from this year as well. It is important to switch to Polish NTS 2013 identifiers to match ids used in the 2018 boundary map.
The data are downloaded in CSV format for easy input to Python. I will be using Pandas, because it nicely integrates with Geopandas.
bdl_own = pd.read_csv(
os.path.join(pDir, 'data', 'bdl', 'FINA_2622_CREL_20180713215941.csv'),
sep=';', dtype={'Kod': str}, decimal=',').drop(columns=['Atrybut', 'Unnamed: 8'])
bdl_own.head()
# Types of own income
bdl_own['Rodzaje dochodów'].unique()
OK, so I want only total own revenues and revenues coming from PIT and CIT. The latter are proportional to peoples' incomes and they can use as a nice indicator of which municipality is richer.
keep = ['razem', 'udziały w podatkach stanowiących dochody budżetu państwa razem']
bdl_own = bdl_own.pivot('Kod', 'Rodzaje dochodów', 'Wartosc')[keep]
bdl_own.columns = ['own_total', 'own_bud']
bdl_own.head()
We do similar transformations to population data.
bdl_pers = pd.read_csv(
os.path.join(pDir, 'data', 'bdl', 'LUDN_1336_CREL_20180713222221.csv'),
sep=';', dtype={'Kod': str}, decimal=',').drop(columns=['Atrybut', 'Unnamed: 10'])
bdl_pers.head()
bdl_pers = bdl_pers[bdl_pers['Płeć']=='ogółem']
bdl_pers = bdl_pers[['Kod', 'Wartosc']]
bdl_pers.columns = ['Kod', 'pers']
bdl_pers.head()
In the next step I merge everyting together and plot the maps.
# Merge data with gminy_dataset
pl_merged = pl_gminy.merge(bdl_pers, left_on='jpt_kod_je', right_on='Kod', how='left')
pl_merged = pl_merged.merge(bdl_own, left_on='jpt_kod_je', right_index=True, how='left')
pl_merged.isnull().sum()
Above we can see that 12 municipalities did not merge correctly. This is porbably due to changes in TERYT coding between 2016 and 2016. We continue nevertheless. They wittl appear as transparent, so we add black Poland in background.
# Plot number of people
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pers', cmap='coolwarm', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Number of persons in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
# Plot population density
pl_merged['pop_dens'] = pl_merged['pers'] / pl_merged['area'] * 1000
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pop_dens', cmap='Reds', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Population density in 2016, 1000 pers per km$^2$', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
# Plot own revenues
pl_merged['rev_pc'] = pl_merged['own_total'] / pl_merged['pers']
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='rev_pc', cmap='Spectral_r', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Own revenues p.c. in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
# Plot PIT+CIT revenues
pl_merged['rev_bud_pc'] = pl_merged['own_bud'] / pl_merged['pers']
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='rev_bud_pc', cmap='Spectral_r', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('PIT+CIT share p.c. in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
We want to identify the richest and poorest municipality in 2016. The best option is to look on PIT+CIT revenues in 2016.
pl_merged[['rev_pc', 'rev_bud_pc']].describe()
poorest_id = pl_merged['rev_bud_pc'].idxmin()
richest_id = pl_merged['rev_bud_pc'].idxmax()
pl_merged.iloc[[poorest_id, richest_id]]
Conclusions¶
So! Now we know that the wealthiest municipality in Poland in 2016 was Podkowa Leśna. Incomes in general have a pattern similar to Polish partitions before 1918.
Geopandas is a nice way to load and modify maps in Python. Adding to that Pandas functionality we can get a nice pipeline. It of course requires much more work than other GUI tools, but has the share advantage of repeatability. It is easy to modify the whole work process and not to have to click everything through from the beginning. One drawback is in geo data preparation. Here QGIS offers more tools for geometric manipulation. Combining the two can be quite a time saving experience.