I've started my analysis in part1 from simple data exploration and most basic OLS regression. Then in part 2 I proceeded to include more advanced Machine Learning methods from the sciki-learn
library.
In today's part I will move to augmenting my data with geographical characteristics. For that I will explore the OpenStreetMap data for Szczecin and I will merge it with downloaded housing ads. I wanted to see if adding information about proximity to schools, shops, entertainment and other amenities has any influence on housing prices.
Most of the tools using in this chapter I've learned from two books. I already knew some basic SQL syntax, but by studying PostgreSQL: Up and Running I familiarized myself with some peculiarities of PostgreSQL. I tried using MySQL, sqlite and MongoDB, but somehow I grew a liking for the former. Though it might have been the case that was the first time that I learned a relational database systematically from a book, instead of web tutorials. Also whenever I forget a particular topic it is relatively easy for me to find it in a book.
Also I used a lot of GIS recently at work working mostly with QGIS and ArcGIS. I quickly got interested in easy routing calculation using pgRouting. When wanting to go deeper into it I naturally needed to learn PostGIS. For that I have studied it using PostGIS in Action. It was a lifesaver! The book was written brilliantly! I enjoyed going through it both learning more about PostgreSQL and about geo information processing in general. I will implement some of the ideas from the book further in this Notebook.
In the end I will compare different regression models. Finally I will construct a simple Neural Network regression model and compare the results with OLS.
import numpy as np
import psycopg2 as psql
import pandas as pd
import geopandas as gpd
# Matplotlib options
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)
# Helper function to display PostGIS geometries
def disp_poly(sql, con, geom="geom"):
"""
Display polygon resulting from PostGIS SQL results.
"""
df = gpd.GeoDataFrame.from_postgis(sql, con, geom_col=geom)
if 'id' not in df.columns:
df['id'] = df.index
return df
Creating distance data¶
After loading basic libraries I will use the IPython SQL magic
to connect to PostgreSQL. Also I will create a Psychopg2 connection as well.
I will also use pgspecial
python module to enable execution of psql
commands.
%load_ext sql
%sql postgresql://mateusz@localhost/szczecin_data
con = psql.connect(database="szczecin_data", user="mateusz", host="localhost")
Before running this exercise I downloaded OpenStreetMap data for Zachodniopomorskie region from Geofabrik.
Next I used osm2pgsql to upload OSM data to PostgreSQL with PostGIS. Resulting tables can be seen as the planet_osm_*
tables for points, lines, roads and polygons separately.
%sql \dt
Remembering from the previous exercise I wanted to filter out some of the observations. I created a view to include only filtered adds, so that I wouldn't need to write the filtering query every time.
DROP VIEW IF EXISTS filtered;
CREATE VIEW filtered AS
SELECT id, price, pow, rooms, floor, data_lat, data_lon, details, description, geom
FROM adds
WHERE
pow < 200 AND
price < 1e6 AND
data_lat > 53.01 AND
data_lat < 53.6 AND
data_lon > 14.01 AND
data_lon < 15 AND
details != '';
Plotting maps in Python¶
I can now plot the selected adds according to their geographic coordinates. To achieve this I will use a wrapper function around GeoPandas
module defined in the beginning.
query = 'SELECT id, pow, price, geom FROM filtered;'
df = disp_poly(query, con)
print(df.shape)
df.head()
df.plot(column='price')
I will next add a map in the background following the GeoPandas
documentation.
# Add Helper function for plotting
import contextily as ctx
def add_basemap(ax, zoom, url='http://tile.stamen.com/terrain/tileZ/tileX/tileY.png'):
xmin, xmax, ymin, ymax = ax.axis()
basemap, extent = ctx.bounds2img(xmin, ymin, xmax, ymax, zoom=zoom, url=url)
ax.imshow(basemap, extent=extent, interpolation='bilinear')
# restore original x/y limits
ax.axis((xmin, xmax, ymin, ymax))
# Reproject geo data from WGS84 to Web Mercator
df = df.to_crs(epsg=3857)
# Plot the map with data
ax = df.plot(column='price', alpha=0.1)
minx, miny, maxx, maxy = df.total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=10, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()
That's nice, but there's a lot of going on in the centre. I want to zoom in into a region I defined in a geojson
file.
# Read centre
centre = gpd.read_file('/home/mateusz/projects/gis/szczecin_geo_analysis/centre.geojson')
ax = df.plot(column='price', alpha=0.1)
minx, miny, maxx, maxy = centre.to_crs(epsg=3857).total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=12, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()
From these simple maps we can already see that some of the locations are miscoded. Also there are may locations with the same coordinates. Apparently many locations were provided based on some POI or only approximated by the neighborhood ending up with the same centre.
Also couple of observations appear to be located on water or at the cemetery. It would be advisable to remove them for future analysis.
Below is a simple query counting same multiple locations.
%sql SELECT COUNT (*) as sum FROM filtered group by geom order by sum desc limit 10;
Before continuing with our geographical analyses in PostGIS I would like to add a GeoIndex to speed my analyses.
It is done simply in SQL:
DROP INDEX IF EXISTS adds_gix;
CREATE INDEX adds_gix ON adds USING GIST (geom);
VACUUM ANALYZE adds;
Selecting amenities from OSM¶
I will be using the amenity
key in OSM to first count interesting amenities. One thing that I saw is that same amenities can appear either as point, polygons or in both cases. What I want to do next is to see how many amenities are defined as those geometries.
In constructing my query I want to include amenities only withing the extent of the filtered adds (1). Then I count the total number of amenities in polygon and points tables (2). Finally I join the counts by amenity type for polygons and points (3).
The results show that there are many amenities unique to a geometry type (having NaN as counts in sum
column), but still there are many amenities that can have both types. This means that I will have to include cases, when the same amenity will be defined both as polygon and point. I will have to assert that no double counting occurs.
To be frank this should not be a problem when the only thing that I'm interested in is counting the minimal distances to closest amenities. Double counting aside, it still will be the closest amenity to a house. By dropping duplicates I will just speed up my query.
# Count amenities
query = """
WITH ext as (
-- 1. Select only amenities within the adds extent.
SELECT ST_Transform(
ST_SetSRID(ST_Extent(geom), 4326),
3857
) as ext
FROM filtered
)
SELECT
pt.amenity,
poly.amenity,
poly.sum as poly_sum,
pt.sum as point_sum
FROM
(
-- 2. Select amenities within the extent and count them by amenity type.
SELECT amenity, count (*) as sum
FROM planet_osm_polygon, ext
WHERE amenity is not NULL
AND way && ext.ext
GROUP BY amenity
) as poly
-- 3. Join points and polygon amenities counts by amenity type.
FULL OUTER JOIN
(
SELECT amenity, count (*) as sum
FROM planet_osm_point, ext
WHERE amenity is not NULL
AND way && ext.ext
GROUP BY amenity
) as pt
ON (pt.amenity=poly.amenity)
order by poly_sum desc
;
"""
amenities = pd.read_sql(query, con)
amenities
Just to see if this type of amenity selection works, I will load schools both as points and as polygons. What I will be doing is that first I select all polygons with schools. Next I UNION
them with points that do not intersect (are outside) with already selected polygons. To achieve this I LEFT JOIN
points with intersecting polygons and leave out only those points that did not join. These observations can be identified as having an NULL
polygon identifier.
# Load schools
query = """
WITH polys AS (
-- select 'school' polygons as CTE
SELECT osm_id, way, name, tags
FROM planet_osm_polygon
WHERE amenity = 'school'
)
SELECT osm_id, ST_Centroid(way) as geom, name, tags
FROM
(
SELECT * FROM polys
UNION ALL
(
SELECT pt.osm_id as osm_id, pt.way, pt.name, pt.tags
-- Select points
FROM planet_osm_point as pt
-- Join them with polygons with which they intersect
LEFT JOIN polys as pol
on ST_Intersects(pt.way, pol.way)
-- Leave only schools (points) which did not intersect with polygons
WHERE pt.amenity = 'school' and pol.osm_id is NULL
)
) as foo;
"""
schools = gpd.GeoDataFrame.from_postgis(query, con, geom_col='geom')
ax = schools.to_crs(epsg=3857).plot(c='green')
minx, miny, maxx, maxy = centre.to_crs(epsg=3857).total_bounds
ax.set_xlim(minx, maxx)
ax.set_ylim(miny, maxy)
add_basemap(ax, zoom=12, url=getattr(ctx.sources, 'OSM_A'))
ax.set_axis_off()
The previous example looks promising. I will next summarize amenities within the adds extent in the same way, selecting only unique points (centroids for polygons). I will be using the PostGIS && sign for intersection.
# Load all amenities
query = """
-- Define adds extent
WITH ext as (
SELECT ST_Transform(
ST_SetSRID(ST_Extent(geom), 4326),
3857
) as ext
FROM filtered
),
-- Define polygons intersecting with extent
polys AS (
SELECT osm_id, amenity, way, name, tags
FROM planet_osm_polygon, ext
WHERE amenity is not NULL
AND way && ext.ext
)
SELECT osm_id, amenity, name, tags, ST_Centroid(way) as geom
FROM
(
-- Union polygons with points NOT within those polygons
SELECT * FROM polys
UNION ALL
(
SELECT pt.osm_id, pt.amenity, pt.way, pt.name, pt.tags
FROM ext, planet_osm_point as pt
LEFT JOIN polys as pol
on ST_Intersects(pt.way, pol.way)
AND pt.amenity = pol.amenity
WHERE
pt.amenity is not null
and pol.osm_id is NULL
AND pt.way && ext.ext
)
) as foo;
"""
amenities = pd.read_sql(query, con)
amenities['amenity'].value_counts()
Next I will build a table of amenities for my analysis. I decide to arbitrarily choose the following:
- schools;
- kindergartens;
- university;
- theatre;
- restaurant;
- library;
- pub;
- post_office;
- place_of_worship;
- parking;
- pharmacy;
- fuel;
- cafe;
- cinema
- bank;
- atm;
Also I will add bus stops using the highway
key.
I will create a new poi
schema to populate it with amenities. I will be using only the adds extent for selection. In the end I will add a primary key and a spatial index for quicker indexing using geometries.
The final SQL query to select amenities is the following:
---- Build amenities table in the poi schema
DROP schema if exists poi cascade;
CREATE SCHEMA poi;
WITH ext as (
SELECT ST_Transform(
ST_SetSRID(ST_Extent(geom), 4326),
3857
) as ext
FROM filtered
),
-- Define polygons intersecting with extent
polys AS (
SELECT osm_id, amenity, way, name, tags
FROM planet_osm_polygon, ext
WHERE amenity is not NULL
AND way && ext.ext
)
CREATE TABLE poi.amenities as
SELECT osm_id, amenity, name, tags, ST_Centroid(way) as geom
FROM
(
-- Union polygons with points NOT within those polygons
SELECT * FROM polys
UNION ALL
(
SELECT pt.osm_id, pt.amenity, pt.way, pt.name, pt.tags
FROM ext, planet_osm_point as pt
LEFT JOIN polys as pol
ON
ST_Intersects(pt.way, pol.way)
AND pt.amenity = pol.amenity
WHERE
pt.amenity is not null
and pol.osm_id is NULL
AND pt.way && ext.ext
)
) as foo
WHERE
amenity IN (
'school',
'kindergarten',
'university',
'theatre',
'restaurant',
'library',
'pub',
'post_office',
'place_of_worship',
'parking',
'pharmacy',
'fuel',
'cafe',
'cinema',
'bank',
'atm'
)
;
-- Add bus stops
WITH ext as (
SELECT ST_Transform(
ST_SetSRID(ST_Extent(geom), 4326),
3857
) as ext
FROM filtered
)
INSERT INTO poi.amenities
SELECT osm_id, highway as amenity, name, tags, way as geom
FROM planet_osm_point as p, ext
WHERE
highway = 'bus_stop'
AND way && ext.ext;
--- Add id key
ALTER TABLE poi.amenities ADD COLUMN aid serial PRIMARY KEY;
--- Add spatial index
DROP INDEX IF EXISTS idx_poi_amenities_geom cascade;
CREATE INDEX idx_poi_amenities_geom
ON poi.amenities USING GIST (geom);
VACUUM ANALYZE poi.amenities;
Calculating distances¶
Having my amenities moved into a separate table I will now calculate minimal distances for each advertisement to the nearest amenity for each amenity int the pot.amenities
table.
To do this I will use a PL/pgSQL loop over all amenity types. In the following query I will create a poi.distances
table to hold all distances for each ad. I want a table in which each row is an ad and where each column is a distance to amenity type. Thus I will obtain a nice wide formatted table. A similar query can be constructed without resolving to PL, but it returns a long formatted table which I want to avoid.
To calculate minimal distances I will be using LATTERAL CROSS JOINS
together with PostGIS <-> distance operator. Inside the second joined table I order by the distance to the current geometry and leave only the first observation. The drawback of this approach is that each time I create a table cross joining each ad with each amenity creating a huge table with the number of observations being the product of number of observations in the two tables. This can be mitigated by selecting only amenities within a specified distances before the cross join, but in my case I do not now the minimal distances.
-- Create wide table using PL/pgSQL
DO $$
DECLARE
a record;
BEGIN
-- Create distances table
DROP TABLE IF EXISTS poi.distances;
CREATE TABLE poi.distances AS SELECT id FROM filtered ORDER BY id;
ALTER TABLE poi.distances ADD PRIMARY KEY (id);
-- Loop over distinct amenity types
FOR am IN SELECT DISTINCT amenity FROM poi.amenities
LOOP
-- Add distance column to the table
EXECUTE 'ALTER TABLE poi.distances
ADD COLUMN '|| am.amenity ||' real';
-- Calculate minimal distances
EXECUTE 'UPDATE poi.distances
SET '|| am.amenity ||' = res.dist
FROM
(
SELECT a.id,
ST_Distance(ST_Transform(a.geom, 3857), r.geom) as dist
FROM filtered as a
CROSS JOIN LATERAL
(
SELECT amenity, geom
FROM
poi.amenities as poi
WHERE amenity = '''|| am.amenity ||'''
ORDER BY ST_Transform(a.geom, 3857) <-> poi.geom
LIMIT 1
) as r
) as res
WHERE
poi.distances.id = res.id
';
RAISE NOTICE 'Added amenity: %', am.amenity;
END LOOP;
END
$$;
After executing the query I load my distances table to make sure if the results make any sense. Seeing that on average houses have a rather short distance to parkings and bus stops seem reasonable. Also seeing that houses can be on average far apart from a theater or a cinema is also a good sign, because there are not so many in Szczecin.
dist = pd.read_sql('SELECT * from poi.distances;', con)
ax = dist.drop(columns='id').boxplot(return_type='axes', rot=45)
ax.set_ylabel("Distance [m]")
ax.set_yscale('log')
dist.drop(columns='id').describe().T
dist.plot.scatter('place_of_worship', 'university')
Distance PCA¶
Next I want to see how the distances relate to each other. I will plot a correlation matrix from the distances to see if there are groups of amenities within a related distance.
import seaborn as sns
cm = dist.drop(columns='id').corr()
hm = sns.heatmap(
cm,
cbar=True,
annot=True,
square=False,
fmt='.2f',
cmap=sns.cm.rocket_r
)
There are evident clusters of some of the features analyzed that go together, like banks, libraries, cinemas and cafes. This can be explained by Szczecin having functional clusters of business establishing close to each other. It is also interesting to see that bus stops and churches have some of the lowest correlations with other amenities. This can be explained by the ubiquity of those types of amenities. They are just everywhere in Szczecin, not concentrated in any way.
Moreover it is somehow funny to see that the distance to a church is the least correlated to the distance to the university...
I will next search for clusters using Principal Component Analysis. I will be following the example from S. Raschka's Python Machine Learning book.
from sklearn.preprocessing import StandardScaler
cols = dist.drop(columns='id').columns
x_dist = dist.drop(columns='id').values
# I standardize features
sc = StandardScaler()
x_dist_std = sc.fit_transform(x_dist)
# I calculate the eigenvalues by hand
cov_mat = np.cov(x_dist_std.T)
eigen_vals, eigen_vecs = np.linalg.eig(cov_mat)
print(eigen_vals)
# Example from Raschka, S., Mirjalili, V., "Python Machine Learning", p. 147, 2nd Edition
tot = sum(eigen_vals)
n_vals = len(eigen_vals)+1
var_exp = [(i / tot) for i in sorted(eigen_vals, reverse=True)]
cum_var_exp = np.cumsum(var_exp)
plt.bar(
range(1,n_vals),
var_exp,
alpha=0.5,
align='center',
label='individual explained variance'
)
plt.step(
range(1,n_vals),
cum_var_exp,
alpha=0.8,
where='mid',
label='cumulative explained variance'
)
plt.ylabel('Explained variance')
plt.xlabel('Principal component')
plt.legend(loc='best')
It turns out that over 60% of variance can be explained using only a single component. This can be used to construct a single measure of distance to the centre. Including next two components (where eigenvalues are close or above 1) would allow to preserve nearly 80% of variance.
In the end I will try to include all information possible to improve the fit as much as possible.
Merging with housing data¶
In previous note I ended up with a regression model including housing parameters, next I extended them by a vectorized advertisement representation. This provided useful increasing the Mean Squared Error (MSE) from initial 0.0429 to 0.0238 on the test set and $R^2$ from 0.6906 to 0.8280. This was a huge improvement.
In this step I will add the distances information to the model and check if this will bring a further improvement to the model.
I start from pulling in all the relevant data from SQL database.
# Pull in the data from SQL database
query = '''
SELECT price,
pow,
rooms,
floor,
data_lat,
data_lon,
d.*,
details
FROM filtered as f
INNER JOIN
poi.distances as d
ON f.id=d.id
;
'''
df = pd.read_sql_query(query, con)
df['lprice'] = np.log(df['price'])
df['lpow'] = np.log(df['pow'])
df = df.drop(columns=['price', 'pow'])
print(df.shape)
df.head()
Having my data loaded I next apply the words preprocessor to the advertisements. After that I split the data into training and test sets.
import re
import string
def preprocessor(text):
# Remove punctuation
text = re.sub('[' + string.punctuation + ']+', ' ', text)
# Replace numbers
text = re.sub(r'\d+', '_liczba ', text)
# Lowercase and strip spaces
text = text.lower().strip()
return text
df['details'] = df['details'].apply(preprocessor)
# Generate full train and test data
from sklearn.model_selection import train_test_split
# Get column ids
y_col = [df.columns.get_loc('lprice')]
x_col = [df.columns.get_loc(i) for i in ['lpow', 'rooms', 'floor', 'data_lat', 'data_lon']]
text_col = [df.columns.get_loc('details')]
dist_col = [df.columns.get_loc(i) for i in dist.drop(columns='id').columns]
x = df.iloc[:, x_col + dist_col + text_col].values
y = df.iloc[:, y_col].values
x_train, x_test, y_train, y_test = train_test_split(
x,
y,
test_size=.2,
random_state=12345
)
I then create the word vectorization pipe limiting my final number of components to 400.
# Create word vectorization pipeline
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
from nltk.corpus import stopwords
stop = stopwords.words('polish')
vec_words = Pipeline([
('tfidf', TfidfVectorizer(
use_idf=True,
norm='l2',
max_df=1.0,
smooth_idf=True,
stop_words=stop
)),
('svd', TruncatedSVD(
n_components=400
))
])
Here I fit the vectorizer to my text data. I end up with 30121 unique words which are then reduced to 400 features.
# Train word vectorizer and add results to train and test dataset
text_col_id = x_train.shape[1]-1
x_train_word_vect = vec_words.fit_transform(x_train[:, text_col_id])
print('Numer of words vectorized: ', vec_words.named_steps['tfidf'].idf_.shape[0])
print('Train vectorized text data shape: ', x_train_word_vect.shape)
x_train = np.hstack((x_train[:, :text_col_id], x_train_word_vect))
x_test_word_vect = vec_words.transform(x_test[:, text_col_id])
print('Test vectorized text data shape: ', x_test_word_vect.shape)
x_test = np.hstack((x_test[:, :text_col_id], x_test_word_vect))
Simple OLS¶
I fit a simple OLS model using only information about the area, floor, coordinates and number of rooms.
# Simple OLS
from sklearn.linear_model import LinearRegression as OLS
from sklearn.metrics import mean_squared_error as mse
from sklearn.metrics import r2_score as r2
ols_simple = OLS()
ols_simple.fit(x_train[:, :5], y_train)
y_pred_test = ols_simple.predict(x_test[:, :5])
y_pred_train = ols_simple.predict(x_train[:, :5])
print('OLS - area, floor, coords, train MSE: %.4f R2: %.4f' % (
mse(y_train, y_pred_train),
r2(y_train, y_pred_train)
))
print('OLS - area, floor, coords, test MSE: %.4f R2: %.4f' % (
mse(y_test, y_pred_test),
r2(y_test, y_pred_test)
))
OLS with distances¶
Here I add the extracted distances to the previous model.
Overall the MSE in the test set improved, by only a meager 6.1%. That is not much compared to the improvement provided by using vectorized ads.
# OLS - with distances
ols_dist = OLS()
ols_dist.fit(x_train[:, :text_col_id], y_train)
y_pred_test = ols_dist.predict(x_test[:, :text_col_id])
y_pred_train = ols_dist.predict(x_train[:, :text_col_id])
print('OLS - with distances, train MSE: %.4f R2: %.4f' % (
mse(y_train, y_pred_train),
r2(y_train, y_pred_train)
))
print('OLS - with distances, test MSE: %.4f R2: %.4f' % (
mse(y_test, y_pred_test),
r2(y_test, y_pred_test)
))
OLS with distances and ads - full model¶
In the end I included all variables in the OLS model. The MSE and $R^2$ on the test set improved from 0.0238 and 0.8280 to 0.0216 and 0.8387 accordingly. It seems small, but overall the MSE improved by over 9%. This is not negligible, but I expected a bigger improvement. It looks like including latitude and longitude in the model is enough to catch most information.
A possible explanation can be found in the shape of Szczecin, which is divided by the Odra river. This separation can be approximated by a plane in lat-lon coordinates.
# All variables
ols_full = OLS()
ols_full.fit(x_train, y_train)
y_pred_test = ols_full.predict(x_test)
y_pred_train = ols_full.predict(x_train)
print('OLS - all variables, train MSE: %.4f R2: %.4f' % (
mse(y_train, y_pred_train),
r2(y_train, y_pred_train)
))
print('OLS - all variables, test MSE: %.4f R2: %.4f' % (
mse(y_test, y_pred_test),
r2(y_test, y_pred_test)
))
Random Forest Regression¶
To further explore my model I tried to fit a random forest estimator. Unfortunately, the estimator did worse than OLS. I did not tune any hyperparameters, so it is possible to obtain better fit by tuning them thoroughly.
# Regression with RandomForest regressor
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(max_depth=2, random_state=0, n_estimators=100)
forest.fit(x_train, y_train)
y_pred_test = forest.predict(x_test)
y_pred_train = forest.predict(x_train)
print('Random Forets - all variables, train MSE: %.4f R2: %.4f' % (
mse(y_train, y_pred_train),
r2(y_train, y_pred_train)
))
print('Random Forest - all variables, test MSE: %.4f R2: %.4f' % (
mse(y_test, y_pred_test),
r2(y_test, y_pred_test)
))
Neural net¶
Finally I wanted to try to fit my model using neural network approach. In the examples I encountered in Python Machine Learning and Hands-On Machine Learning with Scikit-Learn and TensorFlow they were dealing only with using Neural Networks for classification. After short search I found an example of using NN for regression modeling. It used Keras
and was quite simple to implement.
I did add some modifications that came into my mind after the lecture of Hands-On Machine Learning with Scikit-Learn and TensorFlow. First I used a StandardScaler
to standardize my input data. Next I changed the size of each layer, due to a larger number of parameters. Also I used the He normal initializer to obtain more efficient random weight during initialization. I also used the ELU activation function to minimize the vanishing gradient problem. Finally I used batch normalization for each layer to avoid exploding gradients.
I trained the model for 200 epochs, writing best weights at each step.
The final best weights fare much better than OLS. Although comparing MSE and $R^2$ for train and test data there are considerable differences suggesting that the model is overfitting the training data. This calls for further tuning of the network. Adding dropout layer would be a good idea, but I did not implement it yet for the sake of speed.
The MSE for the test dataset has fallen to 0.0099, that is 54.2% less than for the full OLS. This result is impressive regardless of the simple NN model which still needs much tuning.
# NN
# https://towardsdatascience.com/deep-neural-networks-for-regression-problems-81321897ca33
from keras.callbacks import ModelCheckpoint
from keras.models import Sequential
from keras.layers import Dense, Activation, BatchNormalization
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train_std = sc.fit_transform(x_train)
x_test_std = sc.transform(x_test)
NN_model = Sequential()
# The Input Layer :
NN_model.add(Dense(
256,
kernel_initializer='he_normal',
input_dim=x_train_std.shape[1],
))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))
# The Hidden Layers :
NN_model.add(Dense(256, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))
#NN_model.add(Dropout(0.5))
NN_model.add(Dense(256, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))
NN_model.add(Dense(128, kernel_initializer='he_normal'))
NN_model.add(BatchNormalization())
NN_model.add(Activation('elu'))
# The Output Layer :
NN_model.add(Dense(1, kernel_initializer='he_normal',activation='linear'))
# Compile the network :
NN_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])
NN_model.summary()
checkpoint_name = './weights/Weights-{epoch:03d}--{val_loss:.5f}.hdf5'
checkpoint = ModelCheckpoint(
checkpoint_name,
monitor='val_loss',
verbose = 1,
save_best_only = True,
mode ='auto'
)
callbacks_list = [checkpoint]
hist = NN_model.fit(
x_train_std,
y_train,
epochs=200,
batch_size=64,
validation_split = 0.2,
callbacks=callbacks_list,
verbose=2
)
# Load wights file of the best model :
weights_file = './weights/Weights-198--0.01055.hdf5' # choose the best checkpoint
NN_model.load_weights(weights_file) # load it
NN_model.compile(loss='mean_squared_error', optimizer='adam', metrics=['mean_absolute_error'])
y_pred_test = NN_model.predict(x_test_std, batch_size=64)
y_pred_train = NN_model.predict(x_train_std, batch_size=64)
print('Neural network, train MSE: %.4f R2: %.4f' % (
mse(y_train, y_pred_train),
r2(y_train, y_pred_train)
))
print('Neural network, test MSE: %.4f R2: %.4f' % (
mse(y_test, y_pred_test),
r2(y_test, y_pred_test)
))
Looking at the training and validation graph in time the overfitting becomes much visible. There is little improvement in the validation set MSE after epoch 50.
epoch = range(len(hist.history['loss'][1:]))
plt.plot(epoch, hist.history['val_loss'][1:], label='Validation set MSE')
plt.plot(epoch, hist.history['loss'][1:], label='Training set MSE')
plt.xlabel('Epoch')
plt.ylabel('MSE')
plt.legend(loc='best', prop={'size': 20})
Also looking at the prediction errors for the test set we can notice that there some patterns in errors visible. This is a clear indication of overfitting suggesting that we should use dropout for our network in the future.
# Analyze errors
plt.scatter(
y_pred_test,
y_pred_test - y_test,
c='steelblue',
marker='o',
edgecolor='white',
alpha=.4
)
plt.hlines(0.0, 11., 14., color='black', lw=2, linestyle='--')
Looking at the geographical distribution of prediction errors the clear geographical pattern of errors visible for OLS results in previous entry seems to be gone. There are still some outliers visible.
# Plot geografical distirbution of errors
err = y_pred_train - y_train
lat_train = x_train_std[:, 3]
lon_train = x_train_std[:, 4]
plt.figure(figsize=(12*1.1,12))
plt.hexbin(lon_train, lat_train, err, gridsize=40, cmap='bwr', vmin=-0.5, vmax=0.5, reduce_C_function=np.mean)
plt.colorbar()
In the end I write the elements for the model for possible future usage using the joblib
module.
import joblib
joblib.dump(
{
'word_pipe': vec_words,
'nn_scaler': sc,
'nn_model': NN_model
},
'nn_szczecin_model.pkl'
)
Out of curiosity I want to see the input layers of my model. I will plot the weight matrix for the first 50 input variables (rest are words) as an image.
Below I can see that some input neurons had their inputs activated for the coordinates variable and at the same time most of the distances had low activation weights.
weights, bias = NN_model.layers[0].get_weights()
plt.imshow(weights[:201, :], cmap='seismic', clim=(-3, 3))
plt.colorbar()
plt.xlabel('2nd layer')
plt.ylabel('1st layer (variable input)')
Conclusions¶
In the end the benefit of adding more geographical information was small. There seems to be some small improvement coming from it but the scale of this improvement makes further pursuit of this direction questionable. What does improve the prediction error is training a neural network for this data. But still using a simple out of the box model is not enough, because such simple model is prone to overfit the training dataset. This would require model tuning in the future.
It would be perfect to use the pickled model to estimate prices from real data. What would be great is to have a web service that after giving a link to the ad would then extract all the information and based on that estimate the price. It's doable, but maybe next time... ;)
There are some possible enhancements to the model:
- Use road / travel time distances instead of simple Euclidean ones;
- Exclude observations located in uninhabited areas;
- Add Polish word stemmer for ad content vectorization;
- Fine-tune the neural network model;
- Add time dependence.