Recently I've been working a lot with maps and geographical data. Most of the work has been dome in Stata and I've been using mostly QGIS for visualization. Also I did some simple spatial operations in GDAL and ArcGIS. But when it comes to producing maps, QGIS was my tool of choice. Ideas change quite a lot and QGIS is nice tool for data visualization, but unfortunately I've found it's Python API a bit troublesome to help me automatize production of maps. So I kept doing things manually.

Tools¶

Here I turn to Python and its libraries to help me automatize map production. It turn out not much effort is needed and the maps being produced are visually appealing just after some minor tweaks.

In this short excercise I use Geopandas which enables a simple way to work with Geo data. Except that I also use the standard scientific libraries like Matplotlib, Pandas and Numpy.

Data sources¶

In this example I use administrative boundaries for Poland and connect it with Local Data Bank, BDL data from the Polish national statistics office.

Shapefiles for administrative boundaries for Poland are available for free to download here. Unfortunately only last year is available, but still it's something.

First we need to setup our environment. I always try to use absolute paths, so it's easy for me to switch between computers.

In [1]:

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import geopandas as gpd

import os

# Project directory
pDir = "/home/mateusz/projects/gis/"

Openinig a Shapefiles is extremly easy. Geopandas documentation says that the only thing that you jus need one line of code to do that. Lets try reading the municipalities data...

In [2]:

try:
    pl_gminy = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'gminy.shp'))
except ValueError as e:
    print(e)

year 0 is out of range

Unfortunately I got an error. That's probably because the SHP files are created on Windows machines using Polish localization. The default encoding in use is the CP1250. To be able to read the municipalities file I had to open the original first in QGIS indicating the encoding and then save the file with UTF-8 encoding selected. It worked like a charm.

I also left just the columns with names and municipality codes (TERYT) jpt_kod_je.

In [3]:

pl_gminy = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'gminy_utf.shp'))
pl_gminy.head()

Out[3]:

	jpt_kod_je	jpt_nazwa_	geometry
0	0222033	Wołów	POLYGON ((319872.625979769 383951.756238694, 3...
1	0214062	Oleśnica	POLYGON ((380436.400932976 368874.817494596, 3...
2	0223022	Długołęka	POLYGON ((365569.581931299 376008.352757658, 3...
3	0214073	Syców	POLYGON ((397867.372079491 383092.820273452, 3...
4	0223012	Czernica	POLYGON ((370352.948437282 358781.965472659, 3...

So the geometry is coded in the geometry column. Nice to know. Let's see what is the projection in use.

In [4]:

pl_gminy.crs

Out[4]:

{'proj': 'tmerc',
 'lat_0': 0,
 'lon_0': 19,
 'k': 0.9993,
 'x_0': 500000,
 'y_0': -5300000,
 'ellps': 'GRS80',
 'units': 'm',
 'no_defs': True}

The used units are meters, so we can easily calculate area.

In [5]:

pl_gminy['area'] = pl_gminy['geometry'].area
pl_gminy['area'].describe()

Out[5]:

count    2.478000e+03
mean     1.261083e+08
std      7.858655e+07
min      3.312067e+06
25%      7.676658e+07
50%      1.117959e+08
75%      1.594393e+08
max      6.340781e+08
Name: area, dtype: float64

In [6]:

np.sum(pl_gminy['area'])/1e6 # I divide by 1000x1000 m to get km^2

Out[6]:

312496.4701659419

Google says that area of Poland is 312 679 km$^2$, so there is lack of precission here.

Let's try to plot the map.

In [7]:

pl_gminy.plot()

Out[7]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f70891f5518>

It worked, but the map still needs some minor improvements.

In [8]:

fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, linewidth=0.5, edgecolor='0.4', color='0.8')
ax.axis('off')

Out[8]:

(137166.64559966073, 896406.6565782153, 101133.95416049592, 807108.9159837493)

Much better! Lets try to add some color! First let's plot the area.

In [9]:

fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, edgecolor='0.4', linewidth=0.5, 
              column='area', cmap='Greens', legend=True, scheme='quantiles', k=5)
ax.axis('off')

Out[9]:

(137166.64559966073, 896406.6565782153, 101133.95416049592, 807108.9159837493)

It seems that biggest municipalities are located in the north and west of Poland. This somehow seems as the municipalities located outside the pre WWII borders of Poland are generally larger.

We can download 1931 SHP files for Poland from MPIDR Population History GIS Collection Data Files. The site requires registration, but the material they provide is worth it!

In [10]:

pl_31 = gpd.read_file(os.path.join(pDir, 'data', 'Poland 1770-1978', 'Poland_1931_v.1.0.shp'))
print(pl_31.crs)
pl_31.head()

{'init': 'epsg:32633'}

Out[10]:

	ID_	NAME1_	Control	AREA	PERIMETER	geometry
0	P1403	Kowel district	ja	5713.195797	526.052173	POLYGON ((1170828.9243 5764522.9299, 1169997.6...
1	P1401	Luboml district	ja	2085.227769	262.833301	POLYGON ((1106927.4986 5762288.3598, 1107340.9...
2	P1410	Kostopol district	ja	3829.038222	325.007644	POLYGON ((1304054.5048 5744319.3704, 1304437.6...
3	P1405	£uck district	ja	4963.729259	508.742412	POLYGON ((1248251.6271 5746498.978399999, 1248...
4	P1407	Krzemieniec district	ja	2701.112880	277.950716	POLYGON ((1296489.327 5633844.818600001, 12979...

In [11]:

pl_31.plot()

Out[11]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f706f81aba8>

The map looks OK, but we still need to reproject it to fit with the Polish projection. We will also load the current country shapefile and intersect it with a dissolved 1931 map.

In [12]:

# Change projection
pl_31 = pl_31.to_crs(pl_gminy.crs)

# Dissolve into one polygon
pl_31dis = pl_31
pl_31dis['one'] = 1
pl_31dis = pl_31dis[['geometry', 'one']].dissolve('one')

# Read Poland current boundaries
pl = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'Państwo.shp'))

# Intersect the two
pl31_int = gpd.overlay(pl, pl_31dis, how='intersection')
pl31_int.plot()

/usr/lib/python3.6/site-packages/geopandas/base.py:75: UserWarning: Cannot generate spatial index: Missing package `rtree`.
  warn("Cannot generate spatial index: Missing package `rtree`.")

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7f706f796898>

In [13]:

fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, edgecolor='0.4', linewidth=0.5, 
              column='area', cmap='Greens', legend=True, scheme='quantiles', k=5)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.axis('off')

Out[13]:

(137166.64559966073, 896406.6565782153, 101133.95416049592, 807108.9159837493)

There are some errors visible on the dissolved maps, but the general picture is clear, that on lands that Poland got after WWII the municipalities seem to be bigger.

Merging with BDL data¶

The BDL website contains data from 1995 on various topics. I am interested mostly in population and municipal finance. The latest dataset with municipal revenues is available from 2016. I take total population data from this year as well. It is important to switch to Polish NTS 2013 identifiers to match ids used in the 2018 boundary map.

The data are downloaded in CSV format for easy input to Python. I will be using Pandas, because it nicely integrates with Geopandas.

In [14]:

bdl_own = pd.read_csv(
        os.path.join(pDir, 'data', 'bdl', 'FINA_2622_CREL_20180713215941.csv'), 
        sep=';', dtype={'Kod': str}, decimal=',').drop(columns=['Atrybut', 'Unnamed: 8'])
bdl_own.head()

Out[14]:

	Nazwa	Jednostki terytorialne	Rodzaje dochodów	Rok	Wartosc	Jednostka miary
0	POLSKA	gminy łącznie z miastami na prawach powiatu	razem	2016	9.100380e+10	zł
1	POLSKA	gminy łącznie z miastami na prawach powiatu	dochody podatkowe - podatek rolny	2016	1.513458e+09	zł
2	POLSKA	gminy łącznie z miastami na prawach powiatu	dochody podatkowe - podatek leśny	2016	2.959435e+08	zł
3	POLSKA	gminy łącznie z miastami na prawach powiatu	dochody podatkowe - podatek od nieruchomości	2016	2.077447e+10	zł
4	POLSKA	gminy łącznie z miastami na prawach powiatu	dochody podatkowe - podatek od środków transpo...	2016	1.055224e+09	zł

In [15]:

# Types of own income
bdl_own['Rodzaje dochodów'].unique()

Out[15]:

array(['razem', 'dochody podatkowe - podatek rolny',
       'dochody podatkowe - podatek leśny',
       'dochody podatkowe - podatek od nieruchomości',
       'dochody podatkowe - podatek od środków transportowych',
       'wpływy z opłaty skarbowej',
       'dochody podatkowe - podatek od czynności cywilnoprawnych',
       'dochody podatkowe - podatek od działalności gospodarczej osób fizycznych, opłacany w formie karty podatkowej',
       'wpływy z opłaty eksploatacyjnej', 'wpływy z opłaty targowej',
       'dochody z majątku',
       'udziały w podatkach stanowiących dochody budżetu państwa razem',
       'udziały w podatkach stanowiących dochody budżetu państwa podatek dochodowy od osób fizycznych',
       'udziały w podatkach stanowiących dochody budżetu państwa podatek dochodowy od osób prawnych',
       'wpływy z innych lokalnych opłat pobieranych przez jednostki samorządu terytorialnego na podstawie odrębnych ustaw',
       'wpływy z usług',
       'pozostałe dochody - środki na dofinansowanie własnych zadań pozyskane z innych źródeł - razem',
       'pozostałe dochody - środki na dofinansowanie własnych zadań pozyskane z innych źródeł - inwestycyjne',
       'dochody z majątku - dochody z najmu i dzierżawy składników majątkowych JST oraz innych umów o podobnym charakterze',
       'dochody podatkowe - ustalone i pobierane na podstawie odrębnych ustaw'],
      dtype=object)

OK, so I want only total own revenues and revenues coming from PIT and CIT. The latter are proportional to peoples' incomes and they can use as a nice indicator of which municipality is richer.

In [16]:

keep = ['razem', 'udziały w podatkach stanowiących dochody budżetu państwa razem']
bdl_own = bdl_own.pivot('Kod', 'Rodzaje dochodów', 'Wartosc')[keep]
bdl_own.columns = ['own_total', 'own_bud']
bdl_own.head()

Out[16]:

	own_total	own_bud
Kod
0000000	9.100380e+10	3.775226e+10
0200000	7.924310e+09	3.017760e+09
0201000	2.015347e+08	6.388365e+07
0201011	9.658595e+07	3.560506e+07
0201022	2.913815e+07	1.013819e+07

We do similar transformations to population data.

In [17]:

bdl_pers = pd.read_csv(
        os.path.join(pDir, 'data', 'bdl', 'LUDN_1336_CREL_20180713222221.csv'), 
        sep=';', dtype={'Kod': str}, decimal=',').drop(columns=['Atrybut', 'Unnamed: 10'])
bdl_pers.head()

Out[17]:

	Kod	Nazwa	Lokalizacje	Miejsce zamieszkania/zameldowania	Stan na dzień	Płeć	Rok	Wartosc	Jednostka miary
0	0000000	POLSKA	ogółem	miejsce zamieszkania	stan na 31 XII	ogółem	2016	38432992	osoba
1	0000000	POLSKA	ogółem	miejsce zamieszkania	stan na 31 XII	mężczyźni	2016	18593166	osoba
2	0000000	POLSKA	ogółem	miejsce zamieszkania	stan na 31 XII	kobiety	2016	19839826	osoba
3	0200000	DOLNOŚLĄSKIE	ogółem	miejsce zamieszkania	stan na 31 XII	ogółem	2016	2903710	osoba
4	0200000	DOLNOŚLĄSKIE	ogółem	miejsce zamieszkania	stan na 31 XII	mężczyźni	2016	1395960	osoba

In [18]:

bdl_pers = bdl_pers[bdl_pers['Płeć']=='ogółem']
bdl_pers = bdl_pers[['Kod', 'Wartosc']]
bdl_pers.columns = ['Kod', 'pers']
bdl_pers.head()

Out[18]:

	Kod	pers
0	0000000	38432992
3	0200000	2903710
6	0201000	90180
9	0201011	39167
12	0201022	14210

In the next step I merge everyting together and plot the maps.

In [19]:

# Merge data with gminy_dataset
pl_merged = pl_gminy.merge(bdl_pers, left_on='jpt_kod_je', right_on='Kod', how='left')
pl_merged = pl_merged.merge(bdl_own, left_on='jpt_kod_je', right_index=True, how='left')

pl_merged.isnull().sum()

Out[19]:

jpt_kod_je     0
jpt_nazwa_     0
geometry       0
area           0
Kod           12
pers          12
own_total     12
own_bud       12
dtype: int64

Above we can see that 12 municipalities did not merge correctly. This is porbably due to changes in TERYT coding between 2016 and 2016. We continue nevertheless. They wittl appear as transparent, so we add black Poland in background.

In [20]:

# Plot number of people
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pers', cmap='coolwarm', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Number of persons in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')

Out[20]:

(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)

In [21]:

# Plot population density
pl_merged['pop_dens'] = pl_merged['pers'] / pl_merged['area'] * 1000

fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pop_dens', cmap='Reds', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Population density in 2016, 1000 pers per km$^2$', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')

Out[21]:

(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)

In [22]:

# Plot  own revenues
pl_merged['rev_pc'] = pl_merged['own_total'] / pl_merged['pers'] 

fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='rev_pc', cmap='Spectral_r', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Own revenues p.c. in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')

Out[22]:

(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)

In [23]:

# Plot PIT+CIT revenues
pl_merged['rev_bud_pc'] = pl_merged['own_bud'] / pl_merged['pers'] 

fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='rev_bud_pc', cmap='Spectral_r', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('PIT+CIT share p.c. in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')

Out[23]:

(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)

We want to identify the richest and poorest municipality in 2016. The best option is to look on PIT+CIT revenues in 2016.

In [24]:

pl_merged[['rev_pc', 'rev_bud_pc']].describe()

Out[24]:

	rev_pc	rev_bud_pc
count	2466.000000	2466.000000
mean	1608.749698	565.500313
std	1226.769594	312.981626
min	508.167882	137.535044
25%	1088.032933	360.831469
50%	1399.670300	480.206368
75%	1841.646733	683.643590
max	44239.790076	3211.449533

In [25]:

poorest_id = pl_merged['rev_bud_pc'].idxmin()
richest_id = pl_merged['rev_bud_pc'].idxmax()

pl_merged.iloc[[poorest_id, richest_id]]

Out[25]:

	jpt_kod_je	jpt_nazwa_	geometry	area	Kod	pers	own_total	own_bud	pop_dens	rev_pc	rev_bud_pc
1583	2007062	Przytuły	POLYGON ((715188.922301696 622743.605402404, 7...	7.114442e+07	2007062	2167.0	1227927.69	298038.44	0.030459	566.648680	137.535044
1213	1405021	Podkowa Leśna	POLYGON ((616354.062299964 474650.463718515, 6...	1.009900e+07	1405021	3853.0	19006721.24	12373715.05	0.381523	4932.966841	3211.449533

Conclusions¶

So! Now we know that the wealthiest municipality in Poland in 2016 was Podkowa Leśna. Incomes in general have a pattern similar to Polish partitions before 1918.

Geopandas is a nice way to load and modify maps in Python. Adding to that Pandas functionality we can get a nice pipeline. It of course requires much more work than other GUI tools, but has the share advantage of repeatability. It is easy to modify the whole work process and not to have to click everything through from the beginning. One drawback is in geo data preparation. Here QGIS offers more tools for geometric manipulation. Combining the two can be quite a time saving experience.

Maps and Python

Tools¶

Data sources¶

Merging with BDL data¶

Conclusions¶