Housing data analysis using Machine Learning - adding geo data

I've started my analysis in part1 from simple data exploration and most basic OLS regression. Then in part 2 I proceeded to include more advanced Machine Learning methods from the sciki-learn library.

In today's part I will move to augmenting my data with geographical characteristics. For that I will explore the OpenStreetMap data for Szczecin and I will merge it with downloaded housing ads. I wanted to see if adding information about proximity to schools, shops, entertainment and other amenities has any influence on housing prices.

Most of the tools using in this chapter I've learned from two books. I already knew some basic SQL syntax, but by studying PostgreSQL: Up and Running I familiarized myself with some peculiarities of PostgreSQL. I tried using MySQL, sqlite and MongoDB, but somehow I grew a liking for the former. Though it might have been the case that was the first time that I learned a relational database systematically from a book, instead of web tutorials. Also whenever I forget a particular topic it is relatively easy for me to find it in a book.

Also I used a lot of GIS recently at work working mostly with QGIS and ArcGIS. I quickly got interested in easy routing calculation using pgRouting. When wanting to go deeper into it I naturally needed to learn PostGIS. For that I have studied it using PostGIS in Action. It was a lifesaver! The book was written brilliantly! I enjoyed going through it both learning more about PostgreSQL and about geo information processing in general. I will implement some of the ideas from the book further in this Notebook.

In the end I will compare different regression models. Finally I will construct a simple Neural Network regression model and compare the results with OLS.

In [8]:
import numpy as np
import psycopg2 as psql
import pandas as pd
import geopandas as gpd

# Matplotlib options
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)

# Helper function to display PostGIS geometries
def disp_poly(sql, con, geom="geom"):
    """
    Display polygon resulting from PostGIS SQL results.
    """
    df = gpd.GeoDataFrame.from_postgis(sql, con, geom_col=geom)
    if 'id' not in df.columns:
        df['id'] = df.index
    return df

Creating distance data

After loading basic libraries I will use the IPython SQL magic to connect to PostgreSQL. Also I will create a Psychopg2 connection as well. I will also use pgspecial python module to enable execution of psql commands.

In [9]:
%load_ext sql
%sql postgresql://mateusz@localhost/szczecin_data
con = psql.connect(database="szczecin_data", user="mateusz", host="localhost")
The sql extension is already loaded. To reload it, use:
  %reload_ext sql

Before running this exercise I downloaded OpenStreetMap data for Zachodniopomorskie region from Geofabrik.

Next I used osm2pgsql to upload OSM data to PostgreSQL with PostGIS. Resulting tables can be seen as the planet_osm_* tables for points, lines, roads and polygons separately.

In [10]:
%sql \dt
 * postgresql://mateusz@localhost/szczecin_data
8 rows affected.
Out[10]:
Schema Name Type Owner
public adds table mateusz
public dates table mateusz
public planet_osm_line table mateusz
public planet_osm_point table mateusz
public planet_osm_polygon table mateusz
public planet_osm_roads table mateusz
public routing table mateusz
public spatial_ref_sys table mateusz

Remembering from the previous exercise I wanted to filter out some of the observations. I created a view to include only filtered adds, so that I wouldn't need to write the filtering query every time.

DROP VIEW IF EXISTS filtered;
CREATE VIEW filtered AS
SELECT id, price, pow, rooms, floor, data_lat, data_lon, details, description, geom
FROM adds
WHERE
    pow < 200 AND
    price < 1e6 AND
    data_lat > 53.01 AND
    data_lat < 53.6 AND
    data_lon > 14.01 AND
    data_lon < 15 AND
    details != '';

Plotting maps in Python

I can now plot the selected adds according to their geographic coordinates. To achieve this I will use a wrapper function around GeoPandas module defined in the beginning.

In [11]:
query = 'SELECT id, pow, price, geom FROM filtered;'
df = disp_poly(query, con)
print(df.shape)
df.head()
(39373, 4)
Out[11]:
id pow price geom
0 13612 45.94 248000.0 POINT (14.51815711691893 53.38403163098405)
1 14847 90.00 309000.0 POINT (14.47106095 53.38718211)
2 15814 98.60 530000.0 POINT (14.4633004 53.3852974)
3 17486 53.00 289000.0 POINT (14.4755835 53.3887893)
4 17833 100.00 360000.0 POINT (14.4633004 53.3852974)
In [12]:
df.plot(column='price')
Out[12]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff5a63d6c88>

I will next add a map in the background following the GeoPandas documentation.

In [13]:
# Add Helper function for plotting
import contextily as ctx

def add_basemap(ax, zoom, url='http://tile.stamen.com/terrain/tileZ/tileX/tileY.png'):
    xmin, xmax, ymin, ymax = ax.axis()
    basemap, extent = ctx.bounds2img(xmin, ymin, xmax, ymax, zoom=zoom, url=url)
    ax.imshow(basemap, extent=extent, interpolation='bilinear')
    # restore original x/y limits
    ax.axis((xmin, xmax, ymin, ymax))

# Reproject geo data from WGS84 to Web Mercator
df = df.to_crs(epsg=3857)
# Plot the map with data
ax = df.plot(column='price', alpha=0.1)
minx, miny, maxx, maxy = df.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=10, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()

That's nice, but there's a lot of going on in the centre. I want to zoom in into a region I defined in a geojson file.

In [14]:
# Read centre
centre = gpd.read_file('/home/mateusz/projects/gis/szczecin_geo_analysis/centre.geojson')
ax = df.plot(column='price', alpha=0.1)
minx, miny, maxx, maxy = centre.to_crs(epsg=3857).total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=12, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()

From these simple maps we can already see that some of the locations are miscoded. Also there are may locations with the same coordinates. Apparently many locations were provided based on some POI or only approximated by the neighborhood ending up with the same centre.

Also couple of observations appear to be located on water or at the cemetery. It would be advisable to remove them for future analysis.

Below is a simple query counting same multiple locations.

In [15]:
%sql SELECT COUNT (*) as sum FROM filtered group by geom order by sum desc limit 10;
 * postgresql://mateusz@localhost/szczecin_data
10 rows affected.
Out[15]:
sum
3153
1569
1395
1023
985
912
854
819
687
388

Before continuing with our geographical analyses in PostGIS I would like to add a GeoIndex to speed my analyses.

It is done simply in SQL:

DROP INDEX IF EXISTS adds_gix;
CREATE INDEX adds_gix ON adds USING GIST (geom);
VACUUM ANALYZE adds;

Selecting amenities from OSM

I will be using the amenity key in OSM to first count interesting amenities. One thing that I saw is that same amenities can appear either as point, polygons or in both cases. What I want to do next is to see how many amenities are defined as those geometries.

In constructing my query I want to include amenities only withing the extent of the filtered adds (1). Then I count the total number of amenities in polygon and points tables (2). Finally I join the counts by amenity type for polygons and points (3).

The results show that there are many amenities unique to a geometry type (having NaN as counts in sum column), but still there are many amenities that can have both types. This means that I will have to include cases, when the same amenity will be defined both as polygon and point. I will have to assert that no double counting occurs.

To be frank this should not be a problem when the only thing that I'm interested in is counting the minimal distances to closest amenities. Double counting aside, it still will be the closest amenity to a house. By dropping duplicates I will just speed up my query.

In [16]:
# Count amenities
query = """
WITH ext as (
    -- 1. Select only amenities within the adds extent.
    SELECT ST_Transform(
            ST_SetSRID(ST_Extent(geom), 4326), 
            3857
        ) as ext
    FROM filtered
)
SELECT 
    pt.amenity, 
    poly.amenity,
    poly.sum as poly_sum,
    pt.sum as point_sum
FROM
(
    -- 2. Select amenities within the extent and count them by amenity type.
    SELECT amenity, count (*) as sum
    FROM planet_osm_polygon, ext
    WHERE amenity is not NULL
    AND way && ext.ext
    GROUP BY amenity
) as poly
-- 3. Join points and polygon amenities counts by amenity type.
FULL OUTER JOIN
(
    SELECT amenity, count (*) as sum
    FROM planet_osm_point, ext
    WHERE amenity is not NULL
    AND way && ext.ext
    GROUP BY amenity
) as pt
ON (pt.amenity=poly.amenity)
order by poly_sum desc
;
"""
amenities = pd.read_sql(query, con)
amenities
Out[16]:
amenity amenity poly_sum point_sum
0 nightclub None NaN 22.0
1 animal_boarding None NaN 1.0
2 animal_breeding None NaN 1.0
3 weighbridge None NaN 5.0
4 water_point None NaN 2.0
5 atm None NaN 162.0
6 baby_hatch None NaN 1.0
7 data_center None NaN 1.0
8 dentist None NaN 56.0
9 dive_centre None NaN 3.0
10 dojo None NaN 2.0
11 drinking_water None NaN 12.0
12 driving_school None NaN 4.0
13 embassy None NaN 5.0
14 ferry_terminal None NaN 2.0
15 food_court None NaN 2.0
16 gym None NaN 1.0
17 accounting_office None NaN 1.0
18 hunting_stand None NaN 24.0
19 parking_entrance None NaN 106.0
20 jobcentre None NaN 1.0
21 language_school None NaN 3.0
22 motorcycle_parking None NaN 1.0
23 waste_basket None NaN 59.0
24 bbq None NaN 11.0
25 vending_machine None NaN 213.0
26 bicycle_rental None NaN 87.0
27 trade_school None NaN 1.0
28 boat_rental None NaN 2.0
29 bureau_de_change None NaN 32.0
... ... ... ... ...
65 pharmacy pharmacy 6.0 165.0
66 clinic clinic 5.0 17.0
67 grave_yard grave_yard 5.0 1.0
68 bank bank 5.0 131.0
69 toilets toilets 5.0 48.0
70 theatre theatre 5.0 4.0
71 townhall townhall 4.0 2.0
72 post_office post_office 4.0 56.0
73 pub pub 4.0 59.0
74 cafe cafe 4.0 88.0
75 childcare childcare 3.0 3.0
76 nursing_home nursing_home 3.0 1.0
77 courthouse courthouse 3.0 4.0
78 community_centre community_centre 3.0 10.0
79 veterinary veterinary 3.0 22.0
80 bar bar 3.0 28.0
81 arts_centre arts_centre 3.0 6.0
82 None animal_shelter 2.0 NaN
83 None public_garden 1.0 NaN
84 None yes 1.0 NaN
85 biergarten biergarten 1.0 1.0
86 None proposed 1.0 NaN
87 vehicle_inspection vehicle_inspection 1.0 6.0
88 None prison 1.0 NaN
89 None monastery 1.0 NaN
90 ice_cream ice_cream 1.0 15.0
91 None car_service 1.0 NaN
92 None casino 1.0 NaN
93 clock clock 1.0 6.0
94 bus_station bus_station 1.0 1.0

95 rows × 4 columns

Just to see if this type of amenity selection works, I will load schools both as points and as polygons. What I will be doing is that first I select all polygons with schools. Next I UNION them with points that do not intersect (are outside) with already selected polygons. To achieve this I LEFT JOIN points with intersecting polygons and leave out only those points that did not join. These observations can be identified as having an NULL polygon identifier.

In [17]:
# Load schools
query = """
WITH polys AS (
    -- select 'school' polygons as CTE
    SELECT osm_id, way, name, tags
    FROM planet_osm_polygon
    WHERE amenity = 'school'
)
SELECT osm_id, ST_Centroid(way) as geom, name, tags
FROM 
(
    SELECT * FROM polys
    UNION ALL
    (
        SELECT pt.osm_id as osm_id, pt.way, pt.name, pt.tags
        -- Select points
        FROM planet_osm_point as pt
        -- Join them with polygons with which they intersect
        LEFT JOIN polys as pol
        on ST_Intersects(pt.way, pol.way)
        -- Leave only schools (points) which did not intersect with polygons
        WHERE pt.amenity = 'school' and pol.osm_id is NULL
    )
) as foo;
"""
schools = gpd.GeoDataFrame.from_postgis(query, con,  geom_col='geom')
In [18]:
ax = schools.to_crs(epsg=3857).plot(c='green')
minx, miny, maxx, maxy = centre.to_crs(epsg=3857).total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=12, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()

The previous example looks promising. I will next summarize amenities within the adds extent in the same way, selecting only unique points (centroids for polygons). I will be using the PostGIS && sign for intersection.

In [19]:
# Load all amenities
query = """
-- Define adds extent
WITH ext as (
    SELECT ST_Transform(
            ST_SetSRID(ST_Extent(geom), 4326), 
            3857
        ) as ext
    FROM filtered
),
-- Define polygons intersecting with extent
polys AS (
    SELECT osm_id, amenity, way, name, tags
    FROM planet_osm_polygon, ext
    WHERE amenity is not NULL
        AND way && ext.ext
)
SELECT osm_id, amenity, name, tags, ST_Centroid(way) as geom
FROM 
(   
    -- Union polygons with points NOT within those polygons
    SELECT * FROM polys
    UNION ALL
    (
        SELECT pt.osm_id, pt.amenity, pt.way, pt.name, pt.tags
        FROM ext, planet_osm_point as pt
        LEFT JOIN polys as pol
        on ST_Intersects(pt.way, pol.way)
            AND pt.amenity = pol.amenity
        WHERE 
            pt.amenity is not null 
            and pol.osm_id is NULL
            AND pt.way && ext.ext
    )
) as foo;
"""

amenities = pd.read_sql(query, con)
amenities['amenity'].value_counts()
Out[19]:
parking               2419
bench                  452
shelter                256
recycling              234
restaurant             231
fast_food              230
vending_machine        213
waste_disposal         190
pharmacy               171
atm                    162
bicycle_parking        148
school                 145
bank                   136
place_of_worship       131
doctors                108
fuel                   107
parking_entrance       106
cafe                    92
post_box                90
bicycle_rental          87
kindergarten            76
car_wash                75
pub                     63
parking_space           62
post_office             60
waste_basket            59
dentist                 56
toilets                 53
telephone               46
university              45
                      ... 
car_rental               3
dive_centre              3
language_school          3
boat_rental              2
shower                   2
animal_shelter           2
studio                   2
biergarten               2
water_point              2
bus_station              2
food_court               2
ferry_terminal           2
dojo                     2
baby_hatch               1
casino                   1
data_center              1
proposed                 1
accounting_office        1
public_garden            1
gym                      1
trade_school             1
motorcycle_parking       1
yes                      1
car_service              1
jobcentre                1
animal_breeding          1
public_bookcase          1
monastery                1
animal_boarding          1
prison                   1
Name: amenity, Length: 95, dtype: int64

Next I will build a table of amenities for my analysis. I decide to arbitrarily choose the following:

  • schools;
  • kindergartens;
  • university;
  • theatre;
  • restaurant;
  • library;
  • pub;
  • post_office;
  • place_of_worship;
  • parking;
  • pharmacy;
  • fuel;
  • cafe;
  • cinema
  • bank;
  • atm;

Also I will add bus stops using the highway key.

I will create a new poi schema to populate it with amenities. I will be using only the adds extent for selection. In the end I will add a primary key and a spatial index for quicker indexing using geometries.

The final SQL query to select amenities is the following:

---- Build amenities table in the poi schema
DROP schema if exists poi cascade;
CREATE SCHEMA poi;
WITH ext as (
    SELECT ST_Transform(
        ST_SetSRID(ST_Extent(geom), 4326),
        3857
    ) as ext
    FROM filtered
),
-- Define polygons intersecting with extent
polys AS (
    SELECT osm_id, amenity, way, name, tags
    FROM planet_osm_polygon, ext
    WHERE amenity is not NULL
    AND way && ext.ext
)
CREATE TABLE poi.amenities as
SELECT osm_id, amenity, name, tags, ST_Centroid(way) as geom
FROM
(
    -- Union polygons with points NOT within those polygons
    SELECT * FROM polys
    UNION ALL
    (
        SELECT pt.osm_id, pt.amenity, pt.way, pt.name, pt.tags
        FROM ext, planet_osm_point as pt
        LEFT JOIN polys as pol
        ON
        ST_Intersects(pt.way, pol.way)
        AND pt.amenity = pol.amenity
        WHERE
        pt.amenity is not null
        and pol.osm_id is NULL
        AND pt.way && ext.ext
    )
) as foo
WHERE
    amenity IN (
        'school',
        'kindergarten',
        'university',
        'theatre',
        'restaurant',
        'library',
        'pub',
        'post_office',
        'place_of_worship',
        'parking',
        'pharmacy',
        'fuel',
        'cafe',
        'cinema',
        'bank',
        'atm'
    )
;
-- Add bus stops
WITH ext as (
    SELECT ST_Transform(
        ST_SetSRID(ST_Extent(geom), 4326),
        3857
    ) as ext
    FROM filtered
)
INSERT INTO poi.amenities
SELECT osm_id, highway as amenity, name, tags, way as geom
FROM planet_osm_point as p, ext
WHERE
highway = 'bus_stop'
AND way && ext.ext;
--- Add id key
ALTER TABLE poi.amenities ADD COLUMN aid serial PRIMARY KEY;
--- Add spatial index
DROP INDEX IF EXISTS idx_poi_amenities_geom cascade;
CREATE INDEX idx_poi_amenities_geom
ON poi.amenities USING GIST (geom);
VACUUM ANALYZE poi.amenities;

Calculating distances

Having my amenities moved into a separate table I will now calculate minimal distances for each advertisement to the nearest amenity for each amenity int the pot.amenities table.

To do this I will use a PL/pgSQL loop over all amenity types. In the following query I will create a poi.distances table to hold all distances for each ad. I want a table in which each row is an ad and where each column is a distance to amenity type. Thus I will obtain a nice wide formatted table. A similar query can be constructed without resolving to PL, but it returns a long formatted table which I want to avoid.

To calculate minimal distances I will be using LATTERAL CROSS JOINS together with PostGIS <-> distance operator. Inside the second joined table I order by the distance to the current geometry and leave only the first observation. The drawback of this approach is that each time I create a table cross joining each ad with each amenity creating a huge table with the number of observations being the product of number of observations in the two tables. This can be mitigated by selecting only amenities within a specified distances before the cross join, but in my case I do not now the minimal distances.

-- Create wide table using PL/pgSQL
DO $$
DECLARE
    a record;
BEGIN
    -- Create distances table
    DROP TABLE IF EXISTS poi.distances;
    CREATE TABLE poi.distances AS SELECT id FROM filtered ORDER BY id;
    ALTER TABLE poi.distances ADD PRIMARY KEY (id);

    -- Loop over distinct amenity types
    FOR am IN SELECT DISTINCT amenity FROM poi.amenities
    LOOP
        -- Add distance column to the table
        EXECUTE 'ALTER TABLE poi.distances
            ADD COLUMN '|| am.amenity ||' real';
        -- Calculate minimal distances
        EXECUTE 'UPDATE poi.distances
            SET '|| am.amenity ||' = res.dist
            FROM
            (
                SELECT a.id,
                ST_Distance(ST_Transform(a.geom, 3857), r.geom) as dist
                FROM filtered as a
                CROSS JOIN LATERAL
                (
                    SELECT amenity, geom
                    FROM
                    poi.amenities as poi
                    WHERE amenity = '''|| am.amenity ||'''
                    ORDER BY ST_Transform(a.geom, 3857) <-> poi.geom
                    LIMIT 1
                ) as r
            ) as res
            WHERE
            poi.distances.id = res.id
            ';
        RAISE NOTICE 'Added amenity: %', am.amenity;
    END LOOP;
END
$$;

After executing the query I load my distances table to make sure if the results make any sense. Seeing that on average houses have a rather short distance to parkings and bus stops seem reasonable. Also seeing that houses can be on average far apart from a theater or a cinema is also a good sign, because there are not so many in Szczecin.

In [20]:
dist = pd.read_sql('SELECT * from poi.distances;', con)
In [21]:
ax = dist.drop(columns='id').boxplot(return_type='axes', rot=45)
ax.set_ylabel("Distance [m]")
ax.set_yscale('log')
In [22]:
dist.drop(columns='id').describe().T
Out[22]:
count mean std min 25% 50% 75% max
theatre 39373.0 4241.286157 4279.667634 15.59080 769.681 2775.170 6302.340 36483.4
parking 39373.0 319.178194 451.610156 2.42636 102.473 164.301 357.678 14625.8
pub 39373.0 1925.989122 2211.144687 5.27204 330.650 1129.640 3124.790 28508.3
restaurant 39373.0 898.060149 965.736749 1.02008 187.036 591.896 1345.550 14690.0
kindergarten 39373.0 1278.232999 1286.057842 16.56040 540.658 910.684 1577.810 18008.9
fuel 39373.0 1188.197709 1020.230043 18.96350 485.988 924.042 1359.870 16737.8
atm 39373.0 881.009394 1094.209644 1.21013 202.523 540.397 1106.740 27183.1
school 39373.0 880.173768 892.018482 2.45281 309.925 560.521 1104.800 19217.4
place_of_worship 39373.0 977.524951 700.478951 10.97100 512.877 749.948 1243.370 10177.2
bank 39373.0 1412.203324 1828.883744 2.60237 366.094 857.935 1911.550 35192.3
library 39373.0 2106.069218 2018.212559 6.04991 542.046 1295.550 3156.330 23153.4
cinema 39373.0 3971.808676 3292.823386 58.20230 1030.520 3203.570 6181.470 37308.0
cafe 39373.0 1732.083662 2005.281291 2.21388 279.945 1196.040 2329.300 33420.5
bus_stop 39373.0 271.290598 315.583659 2.91645 105.322 205.364 339.057 12128.2
post_office 39373.0 1048.982426 1103.853554 12.55760 403.886 763.631 1306.700 18341.9
pharmacy 39373.0 772.613112 1037.299844 6.89423 205.596 432.107 940.677 24812.7
university 39373.0 2928.275677 4177.907081 8.42144 734.685 1359.920 3169.820 46649.4
In [23]:
dist.plot.scatter('place_of_worship', 'university')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7ff5a5a462e8>

Distance PCA

Next I want to see how the distances relate to each other. I will plot a correlation matrix from the distances to see if there are groups of amenities within a related distance.

In [24]:
import seaborn as sns
cm = dist.drop(columns='id').corr()
hm = sns.heatmap(
    cm,
    cbar=True,
    annot=True,
    square=False,
    fmt='.2f',
    cmap=sns.cm.rocket_r
)

There are evident clusters of some of the features analyzed that go together, like banks, libraries, cinemas and cafes. This can be explained by Szczecin having functional clusters of business establishing close to each other. It is also interesting to see that bus stops and churches have some of the lowest correlations with other amenities. This can be explained by the ubiquity of those types of amenities. They are just everywhere in Szczecin, not concentrated in any way.

Moreover it is somehow funny to see that the distance to a church is the least correlated to the distance to the university...

I will next search for clusters using Principal Component Analysis. I will be following the example from S. Raschka's Python Machine Learning book.

In [25]:
from sklearn.preprocessing import StandardScaler
cols = dist.drop(columns='id').columns
x_dist = dist.drop(columns='id').values
# I standardize features
sc = StandardScaler()
x_dist_std = sc.fit_transform(x_dist)
# I calculate the eigenvalues by hand
cov_mat = np.cov(x_dist_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)
print(eigen_vals)
# Example from Raschka, S., Mirjalili, V., "Python Machine Learning", p. 147, 2nd Edition
tot = sum(eigen_vals)
n_vals = len(eigen_vals)+1
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
plt.bar(
    range(1,n_vals),
    var_exp,
    alpha=0.5,
    align='center',
    label='individual explained variance'
)
plt.step(
    range(1,n_vals),
    cum_var_exp,
    alpha=0.8,
    where='mid',
    label='cumulative explained variance'
)
plt.ylabel('Explained variance')
plt.xlabel('Principal component')
plt.legend(loc='best')
[10.53782665  1.63914579  0.9924005   0.73868067  0.60038032  0.47985076
  0.45163674  0.01979887  0.04428182  0.07931634  0.29463112  0.28214146
  0.12409725  0.14462393  0.15912227  0.18680119  0.22569609]
Out[25]:
<matplotlib.legend.Legend at 0x7ff5ad030d68>

It turns out that over 60% of variance can be explained using only a single component. This can be used to construct a single measure of distance to the centre. Including next two components (where eigenvalues are close or above 1) would allow to preserve nearly 80% of variance.

In the end I will try to include all information possible to improve the fit as much as possible.

Merging with housing data

In previous note I ended up with a regression model including housing parameters, next I extended them by a vectorized advertisement representation. This provided useful increasing the Mean Squared Error (MSE) from initial 0.0429 to 0.0238 on the test set and $R^2$ from 0.6906 to 0.8280. This was a huge improvement.

In this step I will add the distances information to the model and check if this will bring a further improvement to the model.

I start from pulling in all the relevant data from SQL database.

In [26]:
# Pull in the data from SQL database
query = '''
    SELECT price, 
        pow, 
        rooms, 
        floor, 
        data_lat, 
        data_lon,
        d.*, 
        details
    FROM filtered as f
    INNER JOIN
    poi.distances as d
    ON f.id=d.id
    ;
'''
df = pd.read_sql_query(query, con)
df['lprice'] = np.log(df['price'])
df['lpow'] = np.log(df['pow'])
df = df.drop(columns=['price', 'pow'])
print(df.shape)
df.head()
(39373, 25)
Out[26]:
rooms floor data_lat data_lon id theatre parking pub restaurant kindergarten ... library cinema cafe bus_stop post_office pharmacy university details lprice lpow
0 2 0 53.384032 14.518157 13612 8446.02 811.0100 5559.87 4078.640 3313.04 ... 5253.22 8448.19 1324.02 133.4190 2757.04 1208.150 4144.97 Piękne z ogródkiem mieszkanie dwupokojowe  o p... 12.421184 3.827336
1 1 0 53.385297 14.463300 31694 12203.50 1146.8100 9105.86 2010.300 1443.41 ... 1954.56 11723.70 4744.07 152.1240 1734.30 1970.070 5089.04 Oferujemy do sprzedaży piękną kawalerkę z ogro... 12.201060 3.475067
2 1 0 53.388789 14.475584 39792 10725.80 62.7438 7612.96 774.926 1922.46 ... 2324.20 10277.00 3463.15 196.9160 1584.14 810.551 3642.52 Do sprzedaży komfortowe mieszkanie na prestiżo... 12.180755 3.555348
3 2 2 53.388831 14.475330 40051 10742.90 37.5584 7631.26 803.760 1921.99 ... 2328.14 10292.00 3492.43 195.1190 1593.82 836.770 3656.69 Na sprzedaż piękne, dwupoziomowe, dwupokojowe ... 12.445090 3.766535
4 2 2 53.384557 14.463620 40456 12259.80 1038.0100 9155.52 1929.120 1310.24 ... 1819.35 11789.50 4706.89 66.7791 1633.13 1947.060 5152.23 Dwupokojowe mieszkanie z balkonem na Warzymica... 12.608199 3.912023

5 rows × 25 columns

Having my data loaded I next apply the words preprocessor to the advertisements. After that I split the data into training and test sets.

In [27]:
import re
import string
def preprocessor(text):
    # Remove punctuation
    text = re.sub('[' + string.punctuation + ']+', ' ', text)
    # Replace numbers
    text = re.sub(r'\d+', '_liczba ', text)
    # Lowercase and strip spaces
    text = text.lower().strip()
    return text

df['details'] = df['details'].apply(preprocessor)
In [28]:
# Generate full train and test data
from sklearn.model_selection import train_test_split
# Get column ids
y_col = [df.columns.get_loc('lprice')]
x_col = [df.columns.get_loc(i) for i in ['lpow', 'rooms', 'floor', 'data_lat', 'data_lon']]
text_col = [df.columns.get_loc('details')]
dist_col = [df.columns.get_loc(i) for i in dist.drop(columns='id').columns]

x = df.iloc[:, x_col + dist_col + text_col].values
y = df.iloc[:, y_col].values

x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=.2,
    random_state=12345
)

I then create the word vectorization pipe limiting my final number of components to 400.

In [29]:
# Create word vectorization pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

from nltk.corpus import stopwords
stop = stopwords.words('polish')

vec_words = Pipeline([
    ('tfidf', TfidfVectorizer(
        use_idf=True,
        norm='l2',
        max_df=1.0,
        smooth_idf=True,
        stop_words=stop
    )),
    ('svd', TruncatedSVD(
        n_components=400
    ))
])

Here I fit the vectorizer to my text data. I end up with 30121 unique words which are then reduced to 400 features.

In [30]:
# Train word vectorizer and add results to train and test dataset
text_col_id = x_train.shape[1]-1
x_train_word_vect = vec_words.fit_transform(x_train[:, text_col_id])
print('Numer of words vectorized: ', vec_words.named_steps['tfidf'].idf_.shape[0])
print('Train vectorized text data shape: ', x_train_word_vect.shape)
x_train = np.hstack((x_train[:, :text_col_id], x_train_word_vect))

x_test_word_vect = vec_words.transform(x_test[:, text_col_id])
print('Test vectorized text data shape: ', x_test_word_vect.shape)
x_test = np.hstack((x_test[:, :text_col_id], x_test_word_vect))
Numer of words vectorized:  30181
Train vectorized text data shape:  (31498, 400)
Test vectorized text data shape:  (7875, 400)

Simple OLS

I fit a simple OLS model using only information about the area, floor, coordinates and number of rooms.

In [31]:
# Simple OLS
from sklearn.linear_model import LinearRegression as OLS
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2

ols_simple = OLS()
ols_simple.fit(x_train[:, :5], y_train)

y_pred_test = ols_simple.predict(x_test[:, :5])
y_pred_train = ols_simple.predict(x_train[:, :5])


print('OLS - area, floor, coords, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('OLS - area, floor, coords, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))
OLS - area, floor, coords, train MSE: 0.0430 R2: 0.6855
OLS - area, floor, coords, test MSE: 0.0425 R2: 0.6964

OLS with distances

Here I add the extracted distances to the previous model.

Overall the MSE in the test set improved, by only a meager 6.1%. That is not much compared to the improvement provided by using vectorized ads.

In [32]:
# OLS - with distances
ols_dist = OLS()
ols_dist.fit(x_train[:, :text_col_id], y_train)

y_pred_test = ols_dist.predict(x_test[:, :text_col_id])
y_pred_train = ols_dist.predict(x_train[:, :text_col_id])


print('OLS - with distances, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('OLS - with distances, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))
OLS - with distances, train MSE: 0.0403 R2: 0.7051
OLS - with distances, test MSE: 0.0398 R2: 0.7152

OLS with distances and ads - full model

In the end I included all variables in the OLS model. The MSE and $R^2$ on the test set improved from 0.0238 and 0.8280 to 0.0216 and 0.8387 accordingly. It seems small, but overall the MSE improved by over 9%. This is not negligible, but I expected a bigger improvement. It looks like including latitude and longitude in the model is enough to catch most information.

A possible explanation can be found in the shape of Szczecin, which is divided by the Odra river. This separation can be approximated by a plane in lat-lon coordinates.

In [33]:
# All variables
ols_full = OLS()
ols_full.fit(x_train, y_train)

y_pred_test = ols_full.predict(x_test)
y_pred_train = ols_full.predict(x_train)


print('OLS - all variables, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('OLS - all variables, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))
OLS - all variables, train MSE: 0.0225 R2: 0.8358
OLS - all variables, test MSE: 0.0221 R2: 0.8422

Random Forest Regression

To further explore my model I tried to fit a random forest estimator. Unfortunately, the estimator did worse than OLS. I did not tune any hyperparameters, so it is possible to obtain better fit by tuning them thoroughly.

In [34]:
# Regression with RandomForest regressor

from sklearn.ensemble import RandomForestRegressor

forest = RandomForestRegressor(max_depth=2, random_state=0, n_estimators=100)
forest.fit(x_train, y_train)

y_pred_test = forest.predict(x_test)
y_pred_train = forest.predict(x_train)


print('Random Forets - all variables, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('Random Forest - all variables, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/ipykernel_launcher.py:6: DataConversionWarning: A column-vector y was passed when a 1d array was expected. Please change the shape of y to (n_samples,), for example using ravel().
  
Random Forets - all variables, train MSE: 0.0515 R2: 0.6232
Random Forest - all variables, test MSE: 0.0517 R2: 0.6301

Neural net

Finally I wanted to try to fit my model using neural network approach. In the examples I encountered in Python Machine Learning and Hands-On Machine Learning with Scikit-Learn and TensorFlow they were dealing only with using Neural Networks for classification. After short search I found an example of using NN for regression modeling. It used Keras and was quite simple to implement.

I did add some modifications that came into my mind after the lecture of Hands-On Machine Learning with Scikit-Learn and TensorFlow. First I used a StandardScaler to standardize my input data. Next I changed the size of each layer, due to a larger number of parameters. Also I used the He normal initializer to obtain more efficient random weight during initialization. I also used the ELU activation function to minimize the vanishing gradient problem. Finally I used batch normalization for each layer to avoid exploding gradients.

I trained the model for 200 epochs, writing best weights at each step.

The final best weights fare much better than OLS. Although comparing MSE and $R^2$ for train and test data there are considerable differences suggesting that the model is overfitting the training data. This calls for further tuning of the network. Adding dropout layer would be a good idea, but I did not implement it yet for the sake of speed.

The MSE for the test dataset has fallen to 0.0099, that is 54.2% less than for the full OLS. This result is impressive regardless of the simple NN model which still needs much tuning.

In [35]:
# NN
# https://towardsdatascience.com/deep-neural-networks-for-regression-problems-81321897ca33
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization

from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)

NN_model = Sequential()

# The Input Layer :
NN_model.add(Dense(
    256, 
    kernel_initializer='he_normal',
    input_dim=x_train_std.shape[1], 
))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

# The Hidden Layers :
NN_model.add(Dense(256, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

#NN_model.add(Dropout(0.5))
NN_model.add(Dense(256, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

NN_model.add(Dense(128, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))

# The Output Layer :
NN_model.add(Dense(1, kernel_initializer='he_normal',activation='linear'))

# Compile the network :
NN_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])
NN_model.summary()
Using TensorFlow backend.
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
/home/mateusz/.venv/ml36/lib/python3.6/site-packages/sklearn/utils/validation.py:595: DataConversionWarning: Data with input dtype object was converted to float64 by StandardScaler.
  warnings.warn(msg, DataConversionWarning)
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 256)               108288    
_________________________________________________________________
batch_normalization_1 (Batch (None, 256)               1024      
_________________________________________________________________
activation_1 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_2 (Dense)              (None, 256)               65792     
_________________________________________________________________
batch_normalization_2 (Batch (None, 256)               1024      
_________________________________________________________________
activation_2 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_3 (Dense)              (None, 256)               65792     
_________________________________________________________________
batch_normalization_3 (Batch (None, 256)               1024      
_________________________________________________________________
activation_3 (Activation)    (None, 256)               0         
_________________________________________________________________
dense_4 (Dense)              (None, 128)               32896     
_________________________________________________________________
batch_normalization_4 (Batch (None, 128)               512       
_________________________________________________________________
activation_4 (Activation)    (None, 128)               0         
_________________________________________________________________
dense_5 (Dense)              (None, 1)                 129       
=================================================================
Total params: 276,481
Trainable params: 274,689
Non-trainable params: 1,792
_________________________________________________________________
In [36]:
checkpoint_name = './weights/Weights-{epoch:03d}--{val_loss:.5f}.hdf5' 
checkpoint = ModelCheckpoint(
    checkpoint_name, 
    monitor='val_loss', 
    verbose = 1, 
    save_best_only = True, 
    mode ='auto'
    )
callbacks_list = [checkpoint]
In [37]:
hist = NN_model.fit(
    x_train_std, 
    y_train, 
    epochs=200, 
    batch_size=64, 
    validation_split = 0.2, 
    callbacks=callbacks_list, 
    verbose=2
)
Train on 25198 samples, validate on 6300 samples
Epoch 1/200
 - 16s - loss: 40.7007 - mean_absolute_error: 4.8475 - val_loss: 1.0702 - val_mean_absolute_error: 0.8036

Epoch 00001: val_loss improved from inf to 1.07023, saving model to ./weights/Weights-001--1.07023.hdf5
Epoch 2/200
 - 11s - loss: 0.2873 - mean_absolute_error: 0.4188 - val_loss: 0.1766 - val_mean_absolute_error: 0.3199

Epoch 00002: val_loss improved from 1.07023 to 0.17655, saving model to ./weights/Weights-002--0.17655.hdf5
...
Epoch 198/200
 - 9s - loss: 0.0042 - mean_absolute_error: 0.0517 - val_loss: 0.0105 - val_mean_absolute_error: 0.0632

Epoch 00198: val_loss improved from 0.01085 to 0.01055, saving model to ./weights/Weights-198--0.01055.hdf5
Epoch 199/200
 - 9s - loss: 0.0041 - mean_absolute_error: 0.0504 - val_loss: 0.0107 - val_mean_absolute_error: 0.0635

Epoch 00199: val_loss did not improve from 0.01055
Epoch 200/200
 - 10s - loss: 0.0041 - mean_absolute_error: 0.0502 - val_loss: 0.0118 - val_mean_absolute_error: 0.0717

Epoch 00200: val_loss did not improve from 0.01055
In [39]:
# Load wights file of the best model :
weights_file = './weights/Weights-198--0.01055.hdf5' # choose the best checkpoint 
NN_model.load_weights(weights_file) # load it
NN_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])
In [40]:
y_pred_test = NN_model.predict(x_test_std, batch_size=64)
y_pred_train = NN_model.predict(x_train_std, batch_size=64)

print('Neural network, train MSE: %.4f R2: %.4f' % (
    mse(y_train, y_pred_train),
    r2(y_train, y_pred_train)
))
print('Neural network, test MSE: %.4f R2: %.4f' % (
    mse(y_test, y_pred_test),
    r2(y_test, y_pred_test)
))
Neural network, train MSE: 0.0030 R2: 0.9780
Neural network, test MSE: 0.0099 R2: 0.9289

Looking at the training and validation graph in time the overfitting becomes much visible. There is little improvement in the validation set MSE after epoch 50.

In [41]:
epoch = range(len(hist.history['loss'][1:]))
plt.plot(epoch, hist.history['val_loss'][1:], label='Validation set MSE')
plt.plot(epoch, hist.history['loss'][1:], label='Training set MSE')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.legend(loc='best', prop={'size': 20})
Out[41]:
<matplotlib.legend.Legend at 0x7ff538226d30>

Also looking at the prediction errors for the test set we can notice that there some patterns in errors visible. This is a clear indication of overfitting suggesting that we should use dropout for our network in the future.

In [42]:
# Analyze errors
plt.scatter(
    y_pred_test,
    y_pred_test - y_test,
    c='steelblue',
    marker='o',
    edgecolor='white',
    alpha=.4
)
plt.hlines(0.0, 11., 14., color='black', lw=2, linestyle='--')
Out[42]:
<matplotlib.collections.LineCollection at 0x7ff5381f30f0>

Looking at the geographical distribution of prediction errors the clear geographical pattern of errors visible for OLS results in previous entry seems to be gone. There are still some outliers visible.

In [43]:
# Plot geografical distirbution of errors
err =  y_pred_train - y_train
lat_train = x_train_std[:, 3]
lon_train = x_train_std[:, 4]

plt.figure(figsize=(12*1.1,12))
plt.hexbin(lon_train, lat_train, err, gridsize=40, cmap='bwr', vmin=-0.5, vmax=0.5, reduce_C_function=np.mean)
plt.colorbar()
Out[43]:
<matplotlib.colorbar.Colorbar at 0x7ff53810d358>

In the end I write the elements for the model for possible future usage using the joblib module.

In [44]:
import joblib
joblib.dump(
    {
        'word_pipe': vec_words,
        'nn_scaler': sc,
        'nn_model': NN_model
    },
    'nn_szczecin_model.pkl'
)
Out[44]:
['nn_szczecin_model.pkl']

Out of curiosity I want to see the input layers of my model. I will plot the weight matrix for the first 50 input variables (rest are words) as an image.

Below I can see that some input neurons had their inputs activated for the coordinates variable and at the same time most of the distances had low activation weights.

In [69]:
weights, bias = NN_model.layers[0].get_weights()
plt.imshow(weights[:201, :], cmap='seismic', clim=(-3, 3))
plt.colorbar()
plt.xlabel('2nd layer')
plt.ylabel('1st layer (variable input)')
Out[69]:
Text(0, 0.5, '1st layer (variable input)')

Conclusions

In the end the benefit of adding more geographical information was small. There seems to be some small improvement coming from it but the scale of this improvement makes further pursuit of this direction questionable. What does improve the prediction error is training a neural network for this data. But still using a simple out of the box model is not enough, because such simple model is prone to overfit the training dataset. This would require model tuning in the future.

It would be perfect to use the pickled model to estimate prices from real data. What would be great is to have a web service that after giving a link to the ad would then extract all the information and based on that estimate the price. It's doable, but maybe next time... ;)

There are some possible enhancements to the model:

  1. Use road / travel time distances instead of simple Euclidean ones;
  2. Exclude observations located in uninhabited areas;
  3. Add Polish word stemmer for ad content vectorization;
  4. Fine-tune the neural network model;
  5. Add time dependence.
In [ ]: