I've started my analysis in part1 from simple data exploration and most basic OLS regression. Then in part 2 I proceeded to include more advanced Machine Learning methods from the sciki-learn library.

In today's part I will move to augmenting my data with geographical characteristics. For that I will explore the OpenStreetMap data for Szczecin and I will merge it with downloaded housing ads. I wanted to see if adding information about proximity to schools, shops, entertainment and other amenities has any influence on housing prices.

Most of the tools using in this chapter I've learned from two books. I already knew some basic SQL syntax, but by studying PostgreSQL: Up and Running I familiarized myself with some peculiarities of PostgreSQL. I tried using MySQL, sqlite and MongoDB, but somehow I grew a liking for the former. Though it might have been the case that was the first time that I learned a relational database systematically from a book, instead of web tutorials. Also whenever I forget a particular topic it is relatively easy for me to find it in a book.

Also I used a lot of GIS recently at work working mostly with QGIS and ArcGIS. I quickly got interested in easy routing calculation using pgRouting. When wanting to go deeper into it I naturally needed to learn PostGIS. For that I have studied it using PostGIS in Action. It was a lifesaver! The book was written brilliantly! I enjoyed going through it both learning more about PostgreSQL and about geo information processing in general. I will implement some of the ideas from the book further in this Notebook.

In the end I will compare different regression models. Finally I will construct a simple Neural Network regression model and compare the results with OLS.

In [8]:

import numpy as np
import psycopg2 as psql
import pandas as pd
import geopandas as gpd

# Matplotlib options
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)

# Helper function to display PostGIS geometries
def disp_poly(sql, con, geom="geom"):
    """
    Display polygon resulting from PostGIS SQL results.
    """
    df = gpd.GeoDataFrame.from_postgis(sql, con, geom_col=geom)
    if 'id' not in df.columns:
        df['id'] = df.index
    return df

Creating distance data¶

After loading basic libraries I will use the IPython SQL magic to connect to PostgreSQL. Also I will create a Psychopg2 connection as well. I will also use pgspecial python module to enable execution of psql commands.

In [9]:

%load_ext sql
%sql postgresql://mateusz@localhost/szczecin_data
con = psql.connect(database="szczecin_data", user="mateusz", host="localhost")

The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Before running this exercise I downloaded OpenStreetMap data for Zachodniopomorskie region from Geofabrik.

Next I used osm2pgsql to upload OSM data to PostgreSQL with PostGIS. Resulting tables can be seen as the planet_osm_* tables for points, lines, roads and polygons separately.

In [10]:

%sql \dt

 * postgresql://mateusz@localhost/szczecin_data
8 rows affected.

Out[10]:

Schema	Name	Type	Owner
public	adds	table	mateusz
public	dates	table	mateusz
public	planet_osm_line	table	mateusz
public	planet_osm_point	table	mateusz
public	planet_osm_polygon	table	mateusz
public	planet_osm_roads	table	mateusz
public	routing	table	mateusz
public	spatial_ref_sys	table	mateusz

Remembering from the previous exercise I wanted to filter out some of the observations. I created a view to include only filtered adds, so that I wouldn't need to write the filtering query every time.

DROP VIEW IF EXISTS filtered;
CREATE VIEW filtered AS
SELECT id, price, pow, rooms, floor, data_lat, data_lon, details, description, geom
FROM adds
WHERE
    pow < 200 AND
    price < 1e6 AND
    data_lat > 53.01 AND
    data_lat < 53.6 AND
    data_lon > 14.01 AND
    data_lon < 15 AND
    details != '';

Plotting maps in Python¶

I can now plot the selected adds according to their geographic coordinates. To achieve this I will use a wrapper function around GeoPandas module defined in the beginning.

In [11]:

query = 'SELECT id, pow, price, geom FROM filtered;'
df = disp_poly(query, con)
print(df.shape)
df.head()

(39373, 4)

Out[11]:

	id	pow	price	geom
0	13612	45.94	248000.0	POINT (14.51815711691893 53.38403163098405)
1	14847	90.00	309000.0	POINT (14.47106095 53.38718211)
2	15814	98.60	530000.0	POINT (14.4633004 53.3852974)
3	17486	53.00	289000.0	POINT (14.4755835 53.3887893)
4	17833	100.00	360000.0	POINT (14.4633004 53.3852974)

In [12]:

df.plot(column='price')

Out[12]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff5a63d6c88>

I will next add a map in the background following the GeoPandas documentation.

In [13]:

# Add Helper function for plotting
import contextily as ctx

def add_basemap(ax, zoom, url='http://tile.stamen.com/terrain/tileZ/tileX/tileY.png'):
    xmin, xmax, ymin, ymax = ax.axis()
    basemap, extent = ctx.bounds2img(xmin, ymin, xmax, ymax, zoom=zoom, url=url)
    ax.imshow(basemap, extent=extent, interpolation='bilinear')
    # restore original x/y limits
    ax.axis((xmin, xmax, ymin, ymax))

# Reproject geo data from WGS84 to Web Mercator
df = df.to_crs(epsg=3857)
# Plot the map with data
ax = df.plot(column='price', alpha=0.1)
minx, miny, maxx, maxy = df.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=10, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()

That's nice, but there's a lot of going on in the centre. I want to zoom in into a region I defined in a geojson file.

In [14]:

# Read centre
centre = gpd.read_file('/home/mateusz/projects/gis/szczecin_geo_analysis/centre.geojson')
ax = df.plot(column='price', alpha=0.1)
minx, miny, maxx, maxy = centre.to_crs(epsg=3857).total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=12, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()

From these simple maps we can already see that some of the locations are miscoded. Also there are may locations with the same coordinates. Apparently many locations were provided based on some POI or only approximated by the neighborhood ending up with the same centre.

Also couple of observations appear to be located on water or at the cemetery. It would be advisable to remove them for future analysis.

Below is a simple query counting same multiple locations.

In [15]:

%sql SELECT COUNT (*) as sum FROM filtered group by geom order by sum desc limit 10;

 * postgresql://mateusz@localhost/szczecin_data
10 rows affected.

Out[15]:

sum
3153
1569
1395
1023
985
912
854
819
687
388

Before continuing with our geographical analyses in PostGIS I would like to add a GeoIndex to speed my analyses.

It is done simply in SQL:

DROP INDEX IF EXISTS adds_gix;
CREATE INDEX adds_gix ON adds USING GIST (geom);
VACUUM ANALYZE adds;

Selecting amenities from OSM¶

I will be using the amenity key in OSM to first count interesting amenities. One thing that I saw is that same amenities can appear either as point, polygons or in both cases. What I want to do next is to see how many amenities are defined as those geometries.

In constructing my query I want to include amenities only withing the extent of the filtered adds (1). Then I count the total number of amenities in polygon and points tables (2). Finally I join the counts by amenity type for polygons and points (3).

The results show that there are many amenities unique to a geometry type (having NaN as counts in sum column), but still there are many amenities that can have both types. This means that I will have to include cases, when the same amenity will be defined both as polygon and point. I will have to assert that no double counting occurs.

To be frank this should not be a problem when the only thing that I'm interested in is counting the minimal distances to closest amenities. Double counting aside, it still will be the closest amenity to a house. By dropping duplicates I will just speed up my query.

In [16]:

# Count amenities
query = """
WITH ext as (
    -- 1. Select only amenities within the adds extent.
    SELECT ST_Transform(
            ST_SetSRID(ST_Extent(geom), 4326), 
            3857
        ) as ext
    FROM filtered
)
SELECT 
    pt.amenity, 
    poly.amenity,
    poly.sum as poly_sum,
    pt.sum as point_sum
FROM
(
    -- 2. Select amenities within the extent and count them by amenity type.
    SELECT amenity, count (*) as sum
    FROM planet_osm_polygon, ext
    WHERE amenity is not NULL
    AND way && ext.ext
    GROUP BY amenity
) as poly
-- 3. Join points and polygon amenities counts by amenity type.
FULL OUTER JOIN
(
    SELECT amenity, count (*) as sum
    FROM planet_osm_point, ext
    WHERE amenity is not NULL
    AND way && ext.ext
    GROUP BY amenity
) as pt
ON (pt.amenity=poly.amenity)
order by poly_sum desc
;
"""
amenities = pd.read_sql(query, con)
amenities

Out[16]:

	amenity	amenity	poly_sum	point_sum
0	nightclub	None	NaN	22.0
1	animal_boarding	None	NaN	1.0
2	animal_breeding	None	NaN	1.0
3	weighbridge	None	NaN	5.0
4	water_point	None	NaN	2.0
5	atm	None	NaN	162.0
6	baby_hatch	None	NaN	1.0
7	data_center	None	NaN	1.0
8	dentist	None	NaN	56.0
9	dive_centre	None	NaN	3.0
10	dojo	None	NaN	2.0
11	drinking_water	None	NaN	12.0
12	driving_school	None	NaN	4.0
13	embassy	None	NaN	5.0
14	ferry_terminal	None	NaN	2.0
15	food_court	None	NaN	2.0
16	gym	None	NaN	1.0
17	accounting_office	None	NaN	1.0
18	hunting_stand	None	NaN	24.0
19	parking_entrance	None	NaN	106.0
20	jobcentre	None	NaN	1.0
21	language_school	None	NaN	3.0
22	motorcycle_parking	None	NaN	1.0
23	waste_basket	None	NaN	59.0
24	bbq	None	NaN	11.0
25	vending_machine	None	NaN	213.0
26	bicycle_rental	None	NaN	87.0
27	trade_school	None	NaN	1.0
28	boat_rental	None	NaN	2.0
29	bureau_de_change	None	NaN	32.0
...	...	...	...	...
65	pharmacy	pharmacy	6.0	165.0
66	clinic	clinic	5.0	17.0
67	grave_yard	grave_yard	5.0	1.0
68	bank	bank	5.0	131.0
69	toilets	toilets	5.0	48.0
70	theatre	theatre	5.0	4.0
71	townhall	townhall	4.0	2.0
72	post_office	post_office	4.0	56.0
73	pub	pub	4.0	59.0
74	cafe	cafe	4.0	88.0
75	childcare	childcare	3.0	3.0
76	nursing_home	nursing_home	3.0	1.0
77	courthouse	courthouse	3.0	4.0
78	community_centre	community_centre	3.0	10.0
79	veterinary	veterinary	3.0	22.0
80	bar	bar	3.0	28.0
81	arts_centre	arts_centre	3.0	6.0
82	None	animal_shelter	2.0	NaN
83	None	public_garden	1.0	NaN
84	None	yes	1.0	NaN
85	biergarten	biergarten	1.0	1.0
86	None	proposed	1.0	NaN
87	vehicle_inspection	vehicle_inspection	1.0	6.0
88	None	prison	1.0	NaN
89	None	monastery	1.0	NaN
90	ice_cream	ice_cream	1.0	15.0
91	None	car_service	1.0	NaN
92	None	casino	1.0	NaN
93	clock	clock	1.0	6.0
94	bus_station	bus_station	1.0	1.0

95 rows × 4 columns

Just to see if this type of amenity selection works, I will load schools both as points and as polygons. What I will be doing is that first I select all polygons with schools. Next I UNION them with points that do not intersect (are outside) with already selected polygons. To achieve this I LEFT JOIN points with intersecting polygons and leave out only those points that did not join. These observations can be identified as having an NULL polygon identifier.

In [17]:

# Load schools
query = """
WITH polys AS (
    -- select 'school' polygons as CTE
    SELECT osm_id, way, name, tags
    FROM planet_osm_polygon
    WHERE amenity = 'school'
)
SELECT osm_id, ST_Centroid(way) as geom, name, tags
FROM 
(
    SELECT * FROM polys
    UNION ALL
    (
        SELECT pt.osm_id as osm_id, pt.way, pt.name, pt.tags
        -- Select points
        FROM planet_osm_point as pt
        -- Join them with polygons with which they intersect
        LEFT JOIN polys as pol
        on ST_Intersects(pt.way, pol.way)
        -- Leave only schools (points) which did not intersect with polygons
        WHERE pt.amenity = 'school' and pol.osm_id is NULL
    )
) as foo;
"""
schools = gpd.GeoDataFrame.from_postgis(query, con,  geom_col='geom')

In [18]:

ax = schools.to_crs(epsg=3857).plot(c='green')
minx, miny, maxx, maxy = centre.to_crs(epsg=3857).total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=12, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()

The previous example looks promising. I will next summarize amenities within the adds extent in the same way, selecting only unique points (centroids for polygons). I will be using the PostGIS && sign for intersection.

In [19]:

# Load all amenities
query = """
-- Define adds extent
WITH ext as (
    SELECT ST_Transform(
            ST_SetSRID(ST_Extent(geom), 4326), 
            3857
        ) as ext
    FROM filtered
),
-- Define polygons intersecting with extent
polys AS (
    SELECT osm_id, amenity, way, name, tags
    FROM planet_osm_polygon, ext
    WHERE amenity is not NULL
        AND way && ext.ext
)
SELECT osm_id, amenity, name, tags, ST_Centroid(way) as geom
FROM 
(   
    -- Union polygons with points NOT within those polygons
    SELECT * FROM polys
    UNION ALL
    (
        SELECT pt.osm_id, pt.amenity, pt.way, pt.name, pt.tags
        FROM ext, planet_osm_point as pt
        LEFT JOIN polys as pol
        on ST_Intersects(pt.way, pol.way)
            AND pt.amenity = pol.amenity
        WHERE 
            pt.amenity is not null 
            and pol.osm_id is NULL
            AND pt.way && ext.ext
    )
) as foo;
"""

amenities = pd.read_sql(query, con)
amenities['amenity'].value_counts()

Out[19]:

parking               2419
bench                  452
shelter                256
recycling              234
restaurant             231
fast_food              230
vending_machine        213
waste_disposal         190
pharmacy               171
atm                    162
bicycle_parking        148
school                 145
bank                   136
place_of_worship       131
doctors                108
fuel                   107
parking_entrance       106
cafe                    92
post_box                90
bicycle_rental          87
kindergarten            76
car_wash                75
pub                     63
parking_space           62
post_office             60
waste_basket            59
dentist                 56
toilets                 53
telephone               46
university              45
                      ... 
car_rental               3
dive_centre              3
language_school          3
boat_rental              2
shower                   2
animal_shelter           2
studio                   2
biergarten               2
water_point              2
bus_station              2
food_court               2
ferry_terminal           2
dojo                     2
baby_hatch               1
casino                   1
data_center              1
proposed                 1
accounting_office        1
public_garden            1
gym                      1
trade_school             1
motorcycle_parking       1
yes                      1
car_service              1
jobcentre                1
animal_breeding          1
public_bookcase          1
monastery                1
animal_boarding          1
prison                   1
Name: amenity, Length: 95, dtype: int64

Next I will build a table of amenities for my analysis. I decide to arbitrarily choose the following:

schools;
kindergartens;
university;
theatre;
restaurant;
library;
pub;
post_office;
place_of_worship;
parking;
pharmacy;
fuel;
cafe;
cinema
bank;
atm;

Also I will add bus stops using the highway key.

I will create a new poi schema to populate it with amenities. I will be using only the adds extent for selection. In the end I will add a primary key and a spatial index for quicker indexing using geometries.

The final SQL query to select amenities is the following:

---- Build amenities table in the poi schema
DROP schema if exists poi cascade;
CREATE SCHEMA poi;
WITH ext as (
    SELECT ST_Transform(
        ST_SetSRID(ST_Extent(geom), 4326),
        3857
    ) as ext
    FROM filtered
),
-- Define polygons intersecting with extent
polys AS (
    SELECT osm_id, amenity, way, name, tags
    FROM planet_osm_polygon, ext
    WHERE amenity is not NULL
    AND way && ext.ext
)
CREATE TABLE poi.amenities as
SELECT osm_id, amenity, name, tags, ST_Centroid(way) as geom
FROM
(
    -- Union polygons with points NOT within those polygons
    SELECT * FROM polys
    UNION ALL
    (
        SELECT pt.osm_id, pt.amenity, pt.way, pt.name, pt.tags
        FROM ext, planet_osm_point as pt
        LEFT JOIN polys as pol
        ON
        ST_Intersects(pt.way, pol.way)
        AND pt.amenity = pol.amenity
        WHERE
        pt.amenity is not null
        and pol.osm_id is NULL
        AND pt.way && ext.ext
    )
) as foo
WHERE
    amenity IN (
        'school',
        'kindergarten',
        'university',
        'theatre',
        'restaurant',
        'library',
        'pub',
        'post_office',
        'place_of_worship',
        'parking',
        'pharmacy',
        'fuel',
        'cafe',
        'cinema',
        'bank',
        'atm'
    )
;
-- Add bus stops
WITH ext as (
    SELECT ST_Transform(
        ST_SetSRID(ST_Extent(geom), 4326),
        3857
    ) as ext
    FROM filtered
)
INSERT INTO poi.amenities
SELECT osm_id, highway as amenity, name, tags, way as geom
FROM planet_osm_point as p, ext
WHERE
highway = 'bus_stop'
AND way && ext.ext;
--- Add id key
ALTER TABLE poi.amenities ADD COLUMN aid serial PRIMARY KEY;
--- Add spatial index
DROP INDEX IF EXISTS idx_poi_amenities_geom cascade;
CREATE INDEX idx_poi_amenities_geom
ON poi.amenities USING GIST (geom);
VACUUM ANALYZE poi.amenities;

Calculating distances¶

Having my amenities moved into a separate table I will now calculate minimal distances for each advertisement to the nearest amenity for each amenity int the pot.amenities table.

To do this I will use a PL/pgSQL loop over all amenity types. In the following query I will create a poi.distances table to hold all distances for each ad. I want a table in which each row is an ad and where each column is a distance to amenity type. Thus I will obtain a nice wide formatted table. A similar query can be constructed without resolving to PL, but it returns a long formatted table which I want to avoid.

To calculate minimal distances I will be using LATTERAL CROSS JOINS together with PostGIS <-> distance operator. Inside the second joined table I order by the distance to the current geometry and leave only the first observation. The drawback of this approach is that each time I create a table cross joining each ad with each amenity creating a huge table with the number of observations being the product of number of observations in the two tables. This can be mitigated by selecting only amenities within a specified distances before the cross join, but in my case I do not now the minimal distances.

-- Create wide table using PL/pgSQL
DO $$
DECLARE
    a record;
BEGIN
    -- Create distances table
    DROP TABLE IF EXISTS poi.distances;
    CREATE TABLE poi.distances AS SELECT id FROM filtered ORDER BY id;
    ALTER TABLE poi.distances ADD PRIMARY KEY (id);

    -- Loop over distinct amenity types
    FOR am IN SELECT DISTINCT amenity FROM poi.amenities
    LOOP
        -- Add distance column to the table
        EXECUTE 'ALTER TABLE poi.distances
            ADD COLUMN '|| am.amenity ||' real';
        -- Calculate minimal distances
        EXECUTE 'UPDATE poi.distances
            SET '|| am.amenity ||' = res.dist
            FROM
            (
                SELECT a.id,
                ST_Distance(ST_Transform(a.geom, 3857), r.geom) as dist
                FROM filtered as a
                CROSS JOIN LATERAL
                (
                    SELECT amenity, geom
                    FROM
                    poi.amenities as poi
                    WHERE amenity = '''|| am.amenity ||'''
                    ORDER BY ST_Transform(a.geom, 3857) <-> poi.geom
                    LIMIT 1
                ) as r
            ) as res
            WHERE
            poi.distances.id = res.id
            ';
        RAISE NOTICE 'Added amenity: %', am.amenity;
    END LOOP;
END
$$;

After executing the query I load my distances table to make sure if the results make any sense. Seeing that on average houses have a rather short distance to parkings and bus stops seem reasonable. Also seeing that houses can be on average far apart from a theater or a cinema is also a good sign, because there are not so many in Szczecin.

In [20]:

dist = pd.read_sql('SELECT * from poi.distances;', con)

In [21]:

ax = dist.drop(columns='id').boxplot(return_type='axes', rot=45)
ax.set_ylabel("Distance [m]")
ax.set_yscale('log')

In [22]:

dist.drop(columns='id').describe().T

Out[22]:

	count	mean	std	min	25%	50%	75%	max
theatre	39373.0	4241.286157	4279.667634	15.59080	769.681	2775.170	6302.340	36483.4
parking	39373.0	319.178194	451.610156	2.42636	102.473	164.301	357.678	14625.8
pub	39373.0	1925.989122	2211.144687	5.27204	330.650	1129.640	3124.790	28508.3
restaurant	39373.0	898.060149	965.736749	1.02008	187.036	591.896	1345.550	14690.0
kindergarten	39373.0	1278.232999	1286.057842	16.56040	540.658	910.684	1577.810	18008.9
fuel	39373.0	1188.197709	1020.230043	18.96350	485.988	924.042	1359.870	16737.8
atm	39373.0	881.009394	1094.209644	1.21013	202.523	540.397	1106.740	27183.1
school	39373.0	880.173768	892.018482	2.45281	309.925	560.521	1104.800	19217.4
place_of_worship	39373.0	977.524951	700.478951	10.97100	512.877	749.948	1243.370	10177.2
bank	39373.0	1412.203324	1828.883744	2.60237	366.094	857.935	1911.550	35192.3
library	39373.0	2106.069218	2018.212559	6.04991	542.046	1295.550	3156.330	23153.4
cinema	39373.0	3971.808676	3292.823386	58.20230	1030.520	3203.570	6181.470	37308.0
cafe	39373.0	1732.083662	2005.281291	2.21388	279.945	1196.040	2329.300	33420.5
bus_stop	39373.0	271.290598	315.583659	2.91645	105.322	205.364	339.057	12128.2
post_office	39373.0	1048.982426	1103.853554	12.55760	403.886	763.631	1306.700	18341.9
pharmacy	39373.0	772.613112	1037.299844	6.89423	205.596	432.107	940.677	24812.7
university	39373.0	2928.275677	4177.907081	8.42144	734.685	1359.920	3169.820	46649.4

In [23]:

dist.plot.scatter('place_of_worship', 'university')

Out[23]:

<matplotlib.axes._subplots.AxesSubplot at 0x7ff5a5a462e8>

Distance PCA¶

Next I want to see how the distances relate to each other. I will plot a correlation matrix from the distances to see if there are groups of amenities within a related distance.

In [24]:

import seaborn as sns
cm = dist.drop(columns='id').corr()
hm = sns.heatmap(
    cm,
    cbar=True,
    annot=True,
    square=False,
    fmt='.2f',
    cmap=sns.cm.rocket_r
)

There are evident clusters of some of the features analyzed that go together, like banks, libraries, cinemas and cafes. This can be explained by Szczecin having functional clusters of business establishing close to each other. It is also interesting to see that bus stops and churches have some of the lowest correlations with other amenities. This can be explained by the ubiquity of those types of amenities. They are just everywhere in Szczecin, not concentrated in any way.

Moreover it is somehow funny to see that the distance to a church is the least correlated to the distance to the university...

I will next search for clusters using Principal Component Analysis. I will be following the example from S. Raschka's Python Machine Learning book.

In [25]:

from sklearn.preprocessing import StandardScaler
cols = dist.drop(columns='id').columns
x_dist = dist.drop(columns='id').values
# I standardize features
sc = StandardScaler()
x_dist_std = sc.fit_transform(x_dist)
# I calculate the eigenvalues by hand
cov_mat = np.cov(x_dist_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)
print(eigen_vals)
# Example from Raschka, S., Mirjalili, V., "Python Machine Learning", p. 147, 2nd Edition
tot = sum(eigen_vals)
n_vals = len(eigen_vals)+1
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
plt.bar(
    range(1,n_vals),
    var_exp,
    alpha=0.5,
    align='center',
    label='individual explained variance'
)
plt.step(
    range(1,n_vals),
    cum_var_exp,
    alpha=0.8,
    where='mid',
    label='cumulative explained variance'
)
plt.ylabel('Explained variance')
plt.xlabel('Principal component')
plt.legend(loc='best')

[10.53782665  1.63914579  0.9924005   0.73868067  0.60038032  0.47985076
  0.45163674  0.01979887  0.04428182  0.07931634  0.29463112  0.28214146
  0.12409725  0.14462393  0.15912227  0.18680119  0.22569609]

Out[25]:

<matplotlib.legend.Legend at 0x7ff5ad030d68>

It turns out that over 60% of variance can be explained using only a single component. This can be used to construct a single measure of distance to the centre. Including next two components (where eigenvalues are close or above 1) would allow to preserve nearly 80% of variance.

In the end I will try to include all information possible to improve the fit as much as possible.

Merging with housing data¶

In previous note I ended up with a regression model including housing parameters, next I extended them by a vectorized advertisement representation. This provided useful increasing the Mean Squared Error (MSE) from initial 0.0429 to 0.0238 on the test set and $R^2$ from 0.6906 to 0.8280. This was a huge improvement.

In this step I will add the distances information to the model and check if this will bring a further improvement to the model.

I start from pulling in all the relevant data from SQL database.

In [26]:

# Pull in the data from SQL database
query = '''
    SELECT price, 
        pow, 
        rooms, 
        floor, 
        data_lat, 
        data_lon,
        d.*, 
        details
    FROM filtered as f
    INNER JOIN
    poi.distances as d
    ON f.id=d.id
    ;
'''
df = pd.read_sql_query(query, con)
df['lprice'] = np.log(df['price'])
df['lpow'] = np.log(df['pow'])
df = df.drop(columns=['price', 'pow'])
print(df.shape)
df.head()

(39373, 25)

Out[26]:

	rooms	floor	data_lat	data_lon	id	theatre	parking	pub	restaurant	kindergarten	...	library	cinema	cafe	bus_stop	post_office	pharmacy	university	details	lprice	lpow
0	2	0	53.384032	14.518157	13612	8446.02	811.0100	5559.87	4078.640	3313.04	...	5253.22	8448.19	1324.02	133.4190	2757.04	1208.150	4144.97	Piękne z ogródkiem mieszkanie dwupokojowe o p...	12.421184	3.827336
1	1	0	53.385297	14.463300	31694	12203.50	1146.8100	9105.86	2010.300	1443.41	...	1954.56	11723.70	4744.07	152.1240	1734.30	1970.070	5089.04	Oferujemy do sprzedaży piękną kawalerkę z ogro...	12.201060	3.475067
2	1	0	53.388789	14.475584	39792	10725.80	62.7438	7612.96	774.926	1922.46	...	2324.20	10277.00	3463.15	196.9160	1584.14	810.551	3642.52	Do sprzedaży komfortowe mieszkanie na prestiżo...	12.180755	3.555348
3	2	2	53.388831	14.475330	40051	10742.90	37.5584	7631.26	803.760	1921.99	...	2328.14	10292.00	3492.43	195.1190	1593.82	836.770	3656.69	Na sprzedaż piękne, dwupoziomowe, dwupokojowe ...	12.445090	3.766535
4	2	2	53.384557	14.463620	40456	12259.80	1038.0100	9155.52	1929.120	1310.24	...	1819.35	11789.50	4706.89	66.7791	1633.13	1947.060	5152.23	Dwupokojowe mieszkanie z balkonem na Warzymica...	12.608199	3.912023

5 rows × 25 columns

Having my data loaded I next apply the words preprocessor to the advertisements. After that I split the data into training and test sets.

In [27]:

import re
import string
def preprocessor(text):
    # Remove punctuation
    text = re.sub('[' + string.punctuation + ']+', ' ', text)
    # Replace numbers
    text = re.sub(r'\d+', '_liczba ', text)
    # Lowercase and strip spaces
    text = text.lower().strip()
    return text

df['details'] = df['details'].apply(preprocessor)

In [28]:

# Generate full train and test data
from sklearn.model_selection import train_test_split
# Get column ids
y_col = [df.columns.get_loc('lprice')]
x_col = [df.columns.get_loc(i) for i in ['lpow', 'rooms', 'floor', 'data_lat', 'data_lon']]
text_col = [df.columns.get_loc('details')]
dist_col = [df.columns.get_loc(i) for i in dist.drop(columns='id').columns]

x = df.iloc[:, x_col + dist_col + text_col].values
y = df.iloc[:, y_col].values

x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=.2,
    random_state=12345
)

I then create the word vectorization pipe limiting my final number of components to 400.

In [29]:

# Create word vectorization pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from nltk.corpus import stopwords
stop = stopwords.words('polish')

vec_words = Pipeline([
    ('tfidf', TfidfVectorizer(
        use_idf=True,
        norm='l2',
        max_df=1.0,
        smooth_idf=True,
        stop_words=stop
    )),
    ('svd', TruncatedSVD(
        n_components=400
    ))
])

Here I fit the vectorizer to my text data. I end up with 30121 unique words which are then reduced to 400 features.

In [30]:

# Train word vectorizer and add results to train and test dataset
text_col_id = x_train.shape[1]-1
x_train_word_vect = vec_words.fit_transform(x_train[:, text_col_id])
print('Numer of words vectorized: ', vec_words.named_steps['tfidf'].idf_.shape[0])
print('Train vectorized text data shape: ', x_train_word_vect.shape)
x_train = np.hstack((x_train[:, :text_col_id], x_train_word_vect))

x_test_word_vect = vec_words.transform(x_test[:, text_col_id])
print('Test vectorized text data shape: ', x_test_word_vect.shape)
x_test = np.hstack((x_test[:, :text_col_id], x_test_word_vect))

Numer of words vectorized:  30181
Train vectorized text data shape:  (31498, 400)
Test vectorized text data shape:  (7875, 400)

Simple OLS¶

I fit a simple OLS model using only information about the area, floor, coordinates and number of rooms.

In [31]:

# Simple OLS
from sklearn.linear_model import LinearRegression as OLS
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

ols_simple = OLS()
ols_simple.fit(x_train[:, :5], y_train)

y_pred_test = ols_simple.predict(x_test[:, :5])
y_pred_train = ols_simple.predict(x_train[:, :5])


print('OLS - area, floor, coords, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('OLS - area, floor, coords, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))

OLS - area, floor, coords, train MSE: 0.0430 R2: 0.6855
OLS - area, floor, coords, test MSE: 0.0425 R2: 0.6964

OLS with distances¶

Here I add the extracted distances to the previous model.

Overall the MSE in the test set improved, by only a meager 6.1%. That is not much compared to the improvement provided by using vectorized ads.

In [32]:

# OLS - with distances
ols_dist = OLS()
ols_dist.fit(x_train[:, :text_col_id], y_train)

y_pred_test = ols_dist.predict(x_test[:, :text_col_id])
y_pred_train = ols_dist.predict(x_train[:, :text_col_id])


print('OLS - with distances, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('OLS - with distances, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))

OLS - with distances, train MSE: 0.0403 R2: 0.7051
OLS - with distances, test MSE: 0.0398 R2: 0.7152

OLS with distances and ads - full model¶

In the end I included all variables in the OLS model. The MSE and $R^2$ on the test set improved from 0.0238 and 0.8280 to 0.0216 and 0.8387 accordingly. It seems small, but overall the MSE improved by over 9%. This is not negligible, but I expected a bigger improvement. It looks like including latitude and longitude in the model is enough to catch most information.

A possible explanation can be found in the shape of Szczecin, which is divided by the Odra river. This separation can be approximated by a plane in lat-lon coordinates.

In [33]:

# All variables
ols_full = OLS()
ols_full.fit(x_train, y_train)

y_pred_test = ols_full.predict(x_test)
y_pred_train = ols_full.predict(x_train)


print('OLS - all variables, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('OLS - all variables, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))

OLS - all variables, train MSE: 0.0225 R2: 0.8358
OLS - all variables, test MSE: 0.0221 R2: 0.8422

Random Forest Regression¶

To further explore my model I tried to fit a random forest estimator. Unfortunately, the estimator did worse than OLS. I did not tune any hyperparameters, so it is possible to obtain better fit by tuning them thoroughly.

In [34]:

# Regression with RandomForest regressor

from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(max_depth=2, random_state=0, n_estimators=100)
forest.fit(x_train, y_train)

y_pred_test = forest.predict(x_test)
y_pred_train = forest.predict(x_train)


print('Random Forets - all variables, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('Random Forest - all variables, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))

/home/mateusz/.venv/ml36/lib/python3.6/site-packages/ipykernel_launcher.py:6: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().

Random Forets - all variables, train MSE: 0.0515 R2: 0.6232
Random Forest - all variables, test MSE: 0.0517 R2: 0.6301

Neural net¶

Finally I wanted to try to fit my model using neural network approach. In the examples I encountered in Python Machine Learning and Hands-On Machine Learning with Scikit-Learn and TensorFlow they were dealing only with using Neural Networks for classification. After short search I found an example of using NN for regression modeling. It used Keras and was quite simple to implement.

I did add some modifications that came into my mind after the lecture of Hands-On Machine Learning with Scikit-Learn and TensorFlow. First I used a StandardScaler to standardize my input data. Next I changed the size of each layer, due to a larger number of parameters. Also I used the He normal initializer to obtain more efficient random weight during initialization. I also used the ELU activation function to minimize the vanishing gradient problem. Finally I used batch normalization for each layer to avoid exploding gradients.

I trained the model for 200 epochs, writing best weights at each step.

The final best weights fare much better than OLS. Although comparing MSE and $R^2$ for train and test data there are considerable differences suggesting that the model is overfitting the training data. This calls for further tuning of the network. Adding dropout layer would be a good idea, but I did not implement it yet for the sake of speed.

The MSE for the test dataset has fallen to 0.0099, that is 54.2% less than for the full OLS. This result is impressive regardless of the simple NN model which still needs much tuning.

In [35]:

# NN
# https://towardsdatascience.com/deep-neural-networks-for-regression-problems-81321897ca33
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)

NN_model = Sequential()

# The Input Layer :
NN_model.add(Dense(
    256, 
    kernel_initializer='he_normal',
    input_dim=x_train_std.shape[1], 
))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

# The Hidden Layers :
NN_model.add(Dense(256, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

#NN_model.add(Dropout(0.5))
NN_model.add(Dense(256, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

NN_model.add(Dense(128, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

# The Output Layer :
NN_model.add(Dense(1, kernel_initializer='he_normal',activation='linear'))

# Compile the network :
NN_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])
NN_model.summary()

Using TensorFlow backend.
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 256)               108288    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
batch_normalization_2 (Batch (None, 256)               1024      
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               65792     
_________________________________________________________________
batch_normalization_3 (Batch (None, 256)               1024      
_________________________________________________________________
activation_3 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               32896     
_________________________________________________________________
batch_normalization_4 (Batch (None, 128)               512       
_________________________________________________________________
activation_4 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 129       
=================================================================
Total params: 276,481
Trainable params: 274,689
Non-trainable params: 1,792
_________________________________________________________________

In [36]:

checkpoint_name = './weights/Weights-{epoch:03d}--{val_loss:.5f}.hdf5' 
checkpoint = ModelCheckpoint(
    checkpoint_name, 
    monitor='val_loss', 
    verbose = 1, 
    save_best_only = True, 
    mode ='auto'
    )
callbacks_list = [checkpoint]

In [37]:

hist = NN_model.fit(
    x_train_std, 
    y_train, 
    epochs=200, 
    batch_size=64, 
    validation_split = 0.2, 
    callbacks=callbacks_list, 
    verbose=2
)

Train on 25198 samples, validate on 6300 samples
Epoch 1/200
 - 16s - loss: 40.7007 - mean_absolute_error: 4.8475 - val_loss: 1.0702 - val_mean_absolute_error: 0.8036

Epoch 00001: val_loss improved from inf to 1.07023, saving model to ./weights/Weights-001--1.07023.hdf5
Epoch 2/200
 - 11s - loss: 0.2873 - mean_absolute_error: 0.4188 - val_loss: 0.1766 - val_mean_absolute_error: 0.3199

Epoch 00002: val_loss improved from 1.07023 to 0.17655, saving model to ./weights/Weights-002--0.17655.hdf5
...
Epoch 198/200
 - 9s - loss: 0.0042 - mean_absolute_error: 0.0517 - val_loss: 0.0105 - val_mean_absolute_error: 0.0632

Epoch 00198: val_loss improved from 0.01085 to 0.01055, saving model to ./weights/Weights-198--0.01055.hdf5
Epoch 199/200
 - 9s - loss: 0.0041 - mean_absolute_error: 0.0504 - val_loss: 0.0107 - val_mean_absolute_error: 0.0635

Epoch 00199: val_loss did not improve from 0.01055
Epoch 200/200
 - 10s - loss: 0.0041 - mean_absolute_error: 0.0502 - val_loss: 0.0118 - val_mean_absolute_error: 0.0717

Epoch 00200: val_loss did not improve from 0.01055

In [39]:

# Load wights file of the best model :
weights_file = './weights/Weights-198--0.01055.hdf5' # choose the best checkpoint 
NN_model.load_weights(weights_file) # load it
NN_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])

In [40]:

y_pred_test = NN_model.predict(x_test_std, batch_size=64)
y_pred_train = NN_model.predict(x_train_std, batch_size=64)

print('Neural network, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('Neural network, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))

Neural network, train MSE: 0.0030 R2: 0.9780
Neural network, test MSE: 0.0099 R2: 0.9289

Looking at the training and validation graph in time the overfitting becomes much visible. There is little improvement in the validation set MSE after epoch 50.

In [41]:

epoch = range(len(hist.history['loss'][1:]))
plt.plot(epoch, hist.history['val_loss'][1:], label='Validation set MSE')
plt.plot(epoch, hist.history['loss'][1:], label='Training set MSE')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.legend(loc='best', prop={'size': 20})

Out[41]:

<matplotlib.legend.Legend at 0x7ff538226d30>

Also looking at the prediction errors for the test set we can notice that there some patterns in errors visible. This is a clear indication of overfitting suggesting that we should use dropout for our network in the future.

In [42]:

# Analyze errors
plt.scatter(
    y_pred_test,
    y_pred_test - y_test,
    c='steelblue',
    marker='o',
    edgecolor='white',
    alpha=.4
)
plt.hlines(0.0, 11., 14., color='black', lw=2, linestyle='--')

Out[42]:

<matplotlib.collections.LineCollection at 0x7ff5381f30f0>

Looking at the geographical distribution of prediction errors the clear geographical pattern of errors visible for OLS results in previous entry seems to be gone. There are still some outliers visible.

In [43]:

# Plot geografical distirbution of errors
err =  y_pred_train - y_train
lat_train = x_train_std[:, 3]
lon_train = x_train_std[:, 4]

plt.figure(figsize=(12*1.1,12))
plt.hexbin(lon_train, lat_train, err, gridsize=40, cmap='bwr', vmin=-0.5, vmax=0.5, reduce_C_function=np.mean)
plt.colorbar()

Out[43]:

<matplotlib.colorbar.Colorbar at 0x7ff53810d358>

In the end I write the elements for the model for possible future usage using the joblib module.

In [44]:

import joblib
joblib.dump(
    {
        'word_pipe': vec_words,
        'nn_scaler': sc,
        'nn_model': NN_model
    },
    'nn_szczecin_model.pkl'
)

Out[44]:

['nn_szczecin_model.pkl']

Out of curiosity I want to see the input layers of my model. I will plot the weight matrix for the first 50 input variables (rest are words) as an image.

Below I can see that some input neurons had their inputs activated for the coordinates variable and at the same time most of the distances had low activation weights.

In [69]:

weights, bias = NN_model.layers[0].get_weights()
plt.imshow(weights[:201, :], cmap='seismic', clim=(-3, 3))
plt.colorbar()
plt.xlabel('2nd layer')
plt.ylabel('1st layer (variable input)')

Out[69]:

Text(0, 0.5, '1st layer (variable input)')

Conclusions¶

In the end the benefit of adding more geographical information was small. There seems to be some small improvement coming from it but the scale of this improvement makes further pursuit of this direction questionable. What does improve the prediction error is training a neural network for this data. But still using a simple out of the box model is not enough, because such simple model is prone to overfit the training dataset. This would require model tuning in the future.

It would be perfect to use the pickled model to estimate prices from real data. What would be great is to have a web service that after giving a link to the ad would then extract all the information and based on that estimate the price. It's doable, but maybe next time... ;)

There are some possible enhancements to the model:

Use road / travel time distances instead of simple Euclidean ones;
Exclude observations located in uninhabited areas;
Add Polish word stemmer for ad content vectorization;
Fine-tune the neural network model;
Add time dependence.

In [ ]:

Housing data analysis using Machine Learning - adding geo data