Maps and Python

Recently I've been working a lot with maps and geographical data. Most of the work has been dome in Stata and I've been using mostly QGIS for visualization. Also I did some simple spatial operations in GDAL and ArcGIS. But when it comes to producing maps, QGIS was my tool of choice. Ideas change quite a lot and QGIS is nice tool for data visualization, but unfortunately I've found it's Python API a bit troublesome to help me automatize production of maps. So I kept doing things manually.

Tools

Here I turn to Python and its libraries to help me automatize map production. It turn out not much effort is needed and the maps being produced are visually appealing just after some minor tweaks.

In this short excercise I use Geopandas which enables a simple way to work with Geo data. Except that I also use the standard scientific libraries like Matplotlib, Pandas and Numpy.

Data sources

In this example I use administrative boundaries for Poland and connect it with Local Data Bank, BDL data from the Polish national statistics office.

Shapefiles for administrative boundaries for Poland are available for free to download here. Unfortunately only last year is available, but still it's something.

First we need to setup our environment. I always try to use absolute paths, so it's easy for me to switch between computers.

In [1]:
%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import geopandas as gpd

import os

# Project directory
pDir = "/home/mateusz/projects/gis/"

Openinig a Shapefiles is extremly easy. Geopandas documentation says that the only thing that you jus need one line of code to do that. Lets try reading the municipalities data...

In [2]:
try:
    pl_gminy = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'gminy.shp'))
except ValueError as e:
    print(e)
year 0 is out of range

Unfortunately I got an error. That's probably because the SHP files are created on Windows machines using Polish localization. The default encoding in use is the CP1250. To be able to read the municipalities file I had to open the original first in QGIS indicating the encoding and then save the file with UTF-8 encoding selected. It worked like a charm.

I also left just the columns with names and municipality codes (TERYT) jpt_kod_je.

In [3]:
pl_gminy = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'gminy_utf.shp'))
pl_gminy.head()
Out[3]:
jpt_kod_je jpt_nazwa_ geometry
0 0222033 Wołów POLYGON ((319872.625979769 383951.756238694, 3...
1 0214062 Oleśnica POLYGON ((380436.400932976 368874.817494596, 3...
2 0223022 Długołęka POLYGON ((365569.581931299 376008.352757658, 3...
3 0214073 Syców POLYGON ((397867.372079491 383092.820273452, 3...
4 0223012 Czernica POLYGON ((370352.948437282 358781.965472659, 3...

So the geometry is coded in the geometry column. Nice to know. Let's see what is the projection in use.

In [4]:
pl_gminy.crs
Out[4]:
{'proj': 'tmerc',
 'lat_0': 0,
 'lon_0': 19,
 'k': 0.9993,
 'x_0': 500000,
 'y_0': -5300000,
 'ellps': 'GRS80',
 'units': 'm',
 'no_defs': True}

The used units are meters, so we can easily calculate area.

In [5]:
pl_gminy['area'] = pl_gminy['geometry'].area
pl_gminy['area'].describe()
Out[5]:
count    2.478000e+03
mean     1.261083e+08
std      7.858655e+07
min      3.312067e+06
25%      7.676658e+07
50%      1.117959e+08
75%      1.594393e+08
max      6.340781e+08
Name: area, dtype: float64
In [6]:
np.sum(pl_gminy['area'])/1e6 # I divide by 1000x1000 m to get km^2
Out[6]:
312496.4701659419

Google says that area of Poland is 312 679 km$^2$, so there is lack of precission here.

Let's try to plot the map.

In [7]:
pl_gminy.plot()
Out[7]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f70891f5518>

It worked, but the map still needs some minor improvements.

In [8]:
fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, linewidth=0.5, edgecolor='0.4', color='0.8')
ax.axis('off')
Out[8]:
(137166.64559966073, 896406.6565782153, 101133.95416049592, 807108.9159837493)

Much better! Lets try to add some color! First let's plot the area.

In [9]:
fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, edgecolor='0.4', linewidth=0.5, 
              column='area', cmap='Greens', legend=True, scheme='quantiles', k=5)
ax.axis('off')
Out[9]:
(137166.64559966073, 896406.6565782153, 101133.95416049592, 807108.9159837493)

It seems that biggest municipalities are located in the north and west of Poland. This somehow seems as the municipalities located outside the pre WWII borders of Poland are generally larger.

We can download 1931 SHP files for Poland from MPIDR Population History GIS Collection Data Files. The site requires registration, but the material they provide is worth it!

In [10]:
pl_31 = gpd.read_file(os.path.join(pDir, 'data', 'Poland 1770-1978', 'Poland_1931_v.1.0.shp'))
print(pl_31.crs)
pl_31.head()
{'init': 'epsg:32633'}
Out[10]:
ID_ NAME1_ Control AREA PERIMETER geometry
0 P1403 Kowel district ja 5713.195797 526.052173 POLYGON ((1170828.9243 5764522.9299, 1169997.6...
1 P1401 Luboml district ja 2085.227769 262.833301 POLYGON ((1106927.4986 5762288.3598, 1107340.9...
2 P1410 Kostopol district ja 3829.038222 325.007644 POLYGON ((1304054.5048 5744319.3704, 1304437.6...
3 P1405 £uck district ja 4963.729259 508.742412 POLYGON ((1248251.6271 5746498.978399999, 1248...
4 P1407 Krzemieniec district ja 2701.112880 277.950716 POLYGON ((1296489.327 5633844.818600001, 12979...
In [11]:
pl_31.plot()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f706f81aba8>

The map looks OK, but we still need to reproject it to fit with the Polish projection. We will also load the current country shapefile and intersect it with a dissolved 1931 map.

In [12]:
# Change projection
pl_31 = pl_31.to_crs(pl_gminy.crs)

# Dissolve into one polygon
pl_31dis = pl_31
pl_31dis['one'] = 1
pl_31dis = pl_31dis[['geometry', 'one']].dissolve('one')

# Read Poland current boundaries
pl = gpd.read_file(os.path.join(pDir, 'data', 'prg2018', 'Państwo.shp'))

# Intersect the two
pl31_int = gpd.overlay(pl, pl_31dis, how='intersection')
pl31_int.plot()
/usr/lib/python3.6/site-packages/geopandas/base.py:75: UserWarning: Cannot generate spatial index: Missing package `rtree`.
  warn("Cannot generate spatial index: Missing package `rtree`.")
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f706f796898>
In [13]:
fig, ax = plt.subplots(1, figsize=(14, 14))
pl_gminy.plot(ax=ax, edgecolor='0.4', linewidth=0.5, 
              column='area', cmap='Greens', legend=True, scheme='quantiles', k=5)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.axis('off')
Out[13]:
(137166.64559966073, 896406.6565782153, 101133.95416049592, 807108.9159837493)

There are some errors visible on the dissolved maps, but the general picture is clear, that on lands that Poland got after WWII the municipalities seem to be bigger.

Merging with BDL data

The BDL website contains data from 1995 on various topics. I am interested mostly in population and municipal finance. The latest dataset with municipal revenues is available from 2016. I take total population data from this year as well. It is important to switch to Polish NTS 2013 identifiers to match ids used in the 2018 boundary map.

The data are downloaded in CSV format for easy input to Python. I will be using Pandas, because it nicely integrates with Geopandas.

In [14]:
bdl_own = pd.read_csv(
        os.path.join(pDir, 'data', 'bdl', 'FINA_2622_CREL_20180713215941.csv'), 
        sep=';', dtype={'Kod': str}, decimal=',').drop(columns=['Atrybut', 'Unnamed: 8'])
bdl_own.head()
Out[14]:
Kod Nazwa Jednostki terytorialne Rodzaje dochodów Rok Wartosc Jednostka miary
0 0000000 POLSKA gminy łącznie z miastami na prawach powiatu razem 2016 9.100380e+10
1 0000000 POLSKA gminy łącznie z miastami na prawach powiatu dochody podatkowe - podatek rolny 2016 1.513458e+09
2 0000000 POLSKA gminy łącznie z miastami na prawach powiatu dochody podatkowe - podatek leśny 2016 2.959435e+08
3 0000000 POLSKA gminy łącznie z miastami na prawach powiatu dochody podatkowe - podatek od nieruchomości 2016 2.077447e+10
4 0000000 POLSKA gminy łącznie z miastami na prawach powiatu dochody podatkowe - podatek od środków transpo... 2016 1.055224e+09
In [15]:
# Types of own income
bdl_own['Rodzaje dochodów'].unique()
Out[15]:
array(['razem', 'dochody podatkowe - podatek rolny',
       'dochody podatkowe - podatek leśny',
       'dochody podatkowe - podatek od nieruchomości',
       'dochody podatkowe - podatek od środków transportowych',
       'wpływy z opłaty skarbowej',
       'dochody podatkowe - podatek od czynności cywilnoprawnych',
       'dochody podatkowe - podatek od działalności gospodarczej osób fizycznych, opłacany w formie karty podatkowej',
       'wpływy z opłaty eksploatacyjnej', 'wpływy z opłaty targowej',
       'dochody z majątku',
       'udziały w podatkach stanowiących dochody budżetu państwa razem',
       'udziały w podatkach stanowiących dochody budżetu państwa podatek dochodowy od osób fizycznych',
       'udziały w podatkach stanowiących dochody budżetu państwa podatek dochodowy od osób prawnych',
       'wpływy z innych lokalnych opłat pobieranych przez jednostki samorządu terytorialnego na podstawie odrębnych ustaw',
       'wpływy z usług',
       'pozostałe dochody - środki na dofinansowanie własnych zadań pozyskane z innych źródeł - razem',
       'pozostałe dochody - środki na dofinansowanie własnych zadań pozyskane z innych źródeł - inwestycyjne',
       'dochody z majątku - dochody z najmu i dzierżawy składników majątkowych JST oraz innych umów o podobnym charakterze',
       'dochody podatkowe - ustalone i pobierane na podstawie odrębnych ustaw'],
      dtype=object)

OK, so I want only total own revenues and revenues coming from PIT and CIT. The latter are proportional to peoples' incomes and they can use as a nice indicator of which municipality is richer.

In [16]:
keep = ['razem', 'udziały w podatkach stanowiących dochody budżetu państwa razem']
bdl_own = bdl_own.pivot('Kod', 'Rodzaje dochodów', 'Wartosc')[keep]
bdl_own.columns = ['own_total', 'own_bud']
bdl_own.head()
Out[16]:
own_total own_bud
Kod
0000000 9.100380e+10 3.775226e+10
0200000 7.924310e+09 3.017760e+09
0201000 2.015347e+08 6.388365e+07
0201011 9.658595e+07 3.560506e+07
0201022 2.913815e+07 1.013819e+07

We do similar transformations to population data.

In [17]:
bdl_pers = pd.read_csv(
        os.path.join(pDir, 'data', 'bdl', 'LUDN_1336_CREL_20180713222221.csv'), 
        sep=';', dtype={'Kod': str}, decimal=',').drop(columns=['Atrybut', 'Unnamed: 10'])
bdl_pers.head()
Out[17]:
Kod Nazwa Lokalizacje Miejsce zamieszkania/zameldowania Stan na dzień Płeć Rok Wartosc Jednostka miary
0 0000000 POLSKA ogółem miejsce zamieszkania stan na 31 XII ogółem 2016 38432992 osoba
1 0000000 POLSKA ogółem miejsce zamieszkania stan na 31 XII mężczyźni 2016 18593166 osoba
2 0000000 POLSKA ogółem miejsce zamieszkania stan na 31 XII kobiety 2016 19839826 osoba
3 0200000 DOLNOŚLĄSKIE ogółem miejsce zamieszkania stan na 31 XII ogółem 2016 2903710 osoba
4 0200000 DOLNOŚLĄSKIE ogółem miejsce zamieszkania stan na 31 XII mężczyźni 2016 1395960 osoba
In [18]:
bdl_pers = bdl_pers[bdl_pers['Płeć']=='ogółem']
bdl_pers = bdl_pers[['Kod', 'Wartosc']]
bdl_pers.columns = ['Kod', 'pers']
bdl_pers.head()
Out[18]:
Kod pers
0 0000000 38432992
3 0200000 2903710
6 0201000 90180
9 0201011 39167
12 0201022 14210

In the next step I merge everyting together and plot the maps.

In [19]:
# Merge data with gminy_dataset
pl_merged = pl_gminy.merge(bdl_pers, left_on='jpt_kod_je', right_on='Kod', how='left')
pl_merged = pl_merged.merge(bdl_own, left_on='jpt_kod_je', right_index=True, how='left')

pl_merged.isnull().sum()
Out[19]:
jpt_kod_je     0
jpt_nazwa_     0
geometry       0
area           0
Kod           12
pers          12
own_total     12
own_bud       12
dtype: int64

Above we can see that 12 municipalities did not merge correctly. This is porbably due to changes in TERYT coding between 2016 and 2016. We continue nevertheless. They wittl appear as transparent, so we add black Poland in background.

In [20]:
# Plot number of people
fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pers', cmap='coolwarm', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Number of persons in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
Out[20]:
(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)
In [21]:
# Plot population density
pl_merged['pop_dens'] = pl_merged['pers'] / pl_merged['area'] * 1000

fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='pop_dens', cmap='Reds', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Population density in 2016, 1000 pers per km$^2$', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
Out[21]:
(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)
In [22]:
# Plot  own revenues
pl_merged['rev_pc'] = pl_merged['own_total'] / pl_merged['pers'] 

fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='rev_pc', cmap='Spectral_r', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('Own revenues p.c. in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
Out[22]:
(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)
In [23]:
# Plot PIT+CIT revenues
pl_merged['rev_bud_pc'] = pl_merged['own_bud'] / pl_merged['pers'] 

fig, ax = plt.subplots(1, figsize=(14, 14))
pl.plot(ax=ax, linewidth=0.5, edgecolor='0', color='0')
pl_merged.dropna().plot(ax=ax, column='rev_bud_pc', cmap='Spectral_r', legend=True, scheme='quantiles', k=10)
pl31_int.plot(ax=ax, edgecolor='blue', linewidth=2.5, color=(0, 0,0,0))
ax.set_title('PIT+CIT share p.c. in 2016', fontdict={'fontsize': 20}, loc='left')
ax.get_legend().set_bbox_to_anchor((.2, .3))
ax.axis('off')
Out[23]:
(137166.64559966067, 896406.6565782153, 101133.95412515872, 807108.9159854322)

We want to identify the richest and poorest municipality in 2016. The best option is to look on PIT+CIT revenues in 2016.

In [24]:
pl_merged[['rev_pc', 'rev_bud_pc']].describe()
Out[24]:
rev_pc rev_bud_pc
count 2466.000000 2466.000000
mean 1608.749698 565.500313
std 1226.769594 312.981626
min 508.167882 137.535044
25% 1088.032933 360.831469
50% 1399.670300 480.206368
75% 1841.646733 683.643590
max 44239.790076 3211.449533
In [25]:
poorest_id = pl_merged['rev_bud_pc'].idxmin()
richest_id = pl_merged['rev_bud_pc'].idxmax()

pl_merged.iloc[[poorest_id, richest_id]]
Out[25]:
jpt_kod_je jpt_nazwa_ geometry area Kod pers own_total own_bud pop_dens rev_pc rev_bud_pc
1583 2007062 Przytuły POLYGON ((715188.922301696 622743.605402404, 7... 7.114442e+07 2007062 2167.0 1227927.69 298038.44 0.030459 566.648680 137.535044
1213 1405021 Podkowa Leśna POLYGON ((616354.062299964 474650.463718515, 6... 1.009900e+07 1405021 3853.0 19006721.24 12373715.05 0.381523 4932.966841 3211.449533

Conclusions

So! Now we know that the wealthiest municipality in Poland in 2016 was Podkowa Leśna. Incomes in general have a pattern similar to Polish partitions before 1918.

Geopandas is a nice way to load and modify maps in Python. Adding to that Pandas functionality we can get a nice pipeline. It of course requires much more work than other GUI tools, but has the share advantage of repeatability. It is easy to modify the whole work process and not to have to click everything through from the beginning. One drawback is in geo data preparation. Here QGIS offers more tools for geometric manipulation. Combining the two can be quite a time saving experience.