Szczecin housing market data introduction

When learning Python I got really interested in data scraping. Surprisingly it coincided with an urge to buy a flat in Szczecin. Of course working a lot with data I wanted to at least analyze how the housing market in Szczecin looked like. I got much into web data scraping and downloaded all the housing advertisements from Szczecin from otodom.pl. This turned out to be a really nice database. I also use this database to create a price heatmap of Szczecin.

I've automated the process and I've been downloading this data weekly every Saturday night to balance load on the site. I ended up with a nice database from advertisements of Szczecin from nearly two years.

I no longer wish to buy a house in Szczecin, but it is really interesting to show the results what's in the data.

Note: This is the results of gathering publicly available data from otodom.pl. I do not use this data for commercial purposes and have no claim to ownership of it. Below is jus a simple analysis and given that buying a house is a serious investment it should not be used in any decision making. I also take no responsibility for these results. Also all this work has been done in my spare time and has not been financed by anyone. I stopped gathering data as of July 2018.

In [1]:
import os, re, datetime, json
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import pandas as pd
import numpy as np

%matplotlib inline
plt.rcParams["figure.figsize"] = (10,6)

ddir  = r"/home/mateusz/projects/szczecin_analysis/data/"
In [2]:
files = os.listdir(ddir)
# Filter files
data = {
    'file': [],
    'date': [],
    'lines':[]
        }
for f in files:
    if not re.match("^szczecin_[0-9]+\.jl", f):
        continue
    data['file'].append(f)
    date = datetime.datetime.strptime(f[9:15], "%d%m%y").date()
    data['date'].append(date)
    num_lines = sum(1 for line in open(os.path.join(ddir, f)))
    data['lines'].append(num_lines)
In [3]:
df = pd.DataFrame(data)
df = df.sort_values(by='date')
# df = df.set_index('date')
df.describe()
Out[3]:
lines
count 96.000000
mean 4312.854167
std 2498.183669
min 0.000000
25% 1626.250000
50% 4743.500000
75% 6723.250000
max 7675.000000

We see that there are observations with 0 lines, where for some reasons the data was not downloaded. We next get rid of observations with less than 500 lines.

In [4]:
df = df[df['lines'] > 500]
df.describe()
Out[4]:
lines
count 85.000000
mean 4867.105882
std 2084.254309
min 1431.000000
25% 3593.000000
50% 4951.000000
75% 6755.000000
max 7675.000000

We can next graph how the number of observations changed during the time of data gathering.

In [5]:
years = mdates.YearLocator()   # every year
months = mdates.MonthLocator()  # every month
yearsFmt = mdates.DateFormatter('%Y')

fig, ax = plt.subplots()
ax.plot(df['date'], df['lines'])
ax.xaxis.set_major_locator(years)
ax.xaxis.set_major_formatter(yearsFmt)
ax.xaxis.set_minor_locator(months)
ax.set_ylabel('Number of observations')
ax.set_xlabel('Date')
ax.grid(color='g', linestyle='--', linewidth=1, alpha=.3)
ax.get_xaxis().tick_bottom()    
ax.get_yaxis().tick_left()    
ax.spines["top"].set_visible(False)    
#ax.spines["bottom"].set_visible(False)    
ax.spines["right"].set_visible(False)    
#ax.spines["left"].set_visible(False)  

We conclude that since the beginning of the experiment the number of offers in the service has been dropping substantially. The biggest drop has been recorded in the beginning of 2018 when the number of offers dropped from 4000 to less than 2000.

The questions remains if this situation is because there are less offers on the market or whether the service is loosing to some other opponent? Unfortunately given this dataset be cannot answer this questions.

Data analysis

We first need to load that data that is stored as JSON lines.

In [6]:
data = []
for idx in df.index:
    read_date = pd.read_json(os.path.join(ddir, df.loc[idx, 'file']), lines=True)
    read_date['date'] = df.loc[idx, 'date']
    data.append(read_date)
    read_date = None

data = pd.concat(data, axis=0)
In [7]:
data.tail()
Out[7]:
_type data_lat data_lon description details floor location poi_lat poi_lon pow price rooms sellType town url date
1604 HomespiderItem 53.413120 14.578210 [Mieszkanie, 82,60 m², Szczecin] Polecam na sprzedaż 4 pokojowe mieszkanie o po... 3 Mieszkanie na sprzedaż, Szczecin, Śródmieście 53.4156 14.577 82.60 350000.0 4 Mieszkanie na sprzedaż Szczecin https://www.otodom.pl/oferta/mieszkanie-82-60-... 2018-06-23
1605 HomespiderItem 53.441479 14.538239 [Mieszkanie, 132 m², Szczecin] OFERTA DLA INWESTORA!!! MIESZKANIE JEST WYNAJĘ... 2 Mieszkanie na sprzedaż, Szczecin, ks. Piotra S... 53.4418 14.5354 132.00 555000.0 5 Mieszkanie na sprzedaż Szczecin https://www.otodom.pl/oferta/mieszkanie-132-m-... 2018-06-23
1606 HomespiderItem 53.432408 14.536559 [Mieszkanie, 105 m², Szczecin] Atrakcyjne mieszkanie (aktualnie wykorzystane ... 1 Mieszkanie na sprzedaż, Szczecin, Centrum, JAG... 53.4328 14.5343 105.00 460000.0 4 Mieszkanie na sprzedaż Szczecin https://www.otodom.pl/oferta/mieszkanie-105-m-... 2018-06-23
1607 HomespiderItem 53.413120 14.578210 [Mieszkanie, 132 m², Szczecin] Piękne i przestronne mieszkanie o powierzchni ... 2 Mieszkanie na sprzedaż, Szczecin, Śródmieście 53.4156 14.577 132.00 525000.0 5 Mieszkanie na sprzedaż Szczecin https://www.otodom.pl/oferta/mieszkanie-132-m-... 2018-06-23
1608 HomespiderItem 53.481732 14.541711 [2 pokoje Apartamenty Warszewo - Szczecin] Wyróżniająca się inwestycja realizowana w prze... 1 Mieszkanie na sprzedaż, Szczecin, Warszewo, M... 53.4826 14.5453 42.61 249000.0 2 Mieszkanie na sprzedaż Szczecin https://www.otodom.pl/oferta/2-pokoje-apartame... 2018-06-23

We see that some columns as _type, sellType, town or we won't be needing. The details column includes the whole description of the offer. Let's see what's inside the data. We leave url for checking for duplicates.

In [8]:
data['date'] = pd.to_datetime(data['date'])
data = data.drop(columns=['_type', 'sellType', 'town'])
data.describe()
Out[8]:
data_lat data_lon pow price
count 413704.000000 413704.000000 413704.000000 4.137040e+05
mean 53.196654 14.488744 68.401310 3.016699e+05
std 3.527650 0.976426 65.261838 1.308628e+05
min 0.000000 0.000000 11.000000 3.900000e+04
25% 53.420000 14.521824 47.740000 2.190000e+05
50% 53.432830 14.546396 62.000000 2.790000e+05
75% 53.445569 14.554006 82.000000 3.530000e+05
max 54.524587 53.221296 6330.000000 6.990000e+06

The average size of offered apartment was 68.4 m$^2$ and the average price was around 300k PLN.

We see that geographical coordinates, area and price are numeric. Apparently the floor variable has some non-numeric values.

In [9]:
data['floor'].value_counts()
Out[9]:
1           89674
4           78874
2           74595
parter      71983
3           64151
5            9602
6            7014
10           3671
8            3333
> 10         3258
7            3164
9            2437
poddasze      510
4             419
2             252
3             232
1             178
11            120
5              92
10             64
8              27
12             23
6              17
suterena        7
28              6
26              1
Name: floor, dtype: int64

We can see that the ground floor is described as the word parter. Also there are some apartments lower than ground floor suterena, but we will consider them as ground floor. We also add a category as giving the number of floors as 5 and more and 10 and more.

In [10]:
data.loc[data['floor'] == 'parter', 'floor'] = '0'
data.loc[data['floor'] == 'suterena', 'floor'] = '0'
data.loc[data['floor'] == 'poddasze', 'floor'] = '5'
data.loc[data['floor'] == '> 10', 'floor'] = '11'

data['floor'] = pd.to_numeric(data['floor'])
data.loc[data['floor'] > 10, 'floor'] = 10

data['floor'].value_counts()
Out[10]:
1     89852
4     79293
2     74847
0     71990
3     64383
5     10204
10     7143
6      7031
8      3360
7      3164
9      2437
Name: floor, dtype: int64
In [11]:
data['rooms'].value_counts()
Out[11]:
3                137132
2                132763
4                 55673
1                 25911
3                 16997
2                 15722
5                 13671
4                  7372
1                  4063
6                  2080
5                  1595
6                   289
7                   233
8                    76
7                    70
9                    32
więcej niż 10        13
8                    12
Name: rooms, dtype: int64
In [12]:
data.loc[data['rooms'] == 'więcej niż 10', 'rooms'] = '10'
data['rooms'] = pd.to_numeric(data['rooms'])
data['rooms'].value_counts()
Out[12]:
3     154129
2     148485
4      63045
1      29974
5      15266
6       2369
7        303
8         88
9         32
10        13
Name: rooms, dtype: int64

SQL database export

The floor and room numbers looks fine now. Next we can visualize our data. Also for visualization we will create a sample without duplicates. Before we do that let's just write the data into some better format for further analysis. I want to have a choice if I want to load just the appartments data without the description so I will go with a SQL database. Let's just stick with simple SQLite database.

We create a simple schema with just two tables with the following code:

CREATE TABLE adds (
    id integer PRIMARY KEY AUTOINCREMENT,
    description varchar,
    details varchar,
    location varchar,
    url varchar,
    price float,
    pow float,
    rooms integer,
    floor integer,
    data_lat float,
    data_lon float
);

CREATE TABLE dates (
    id integer PRIMARY KEY AUTOINCREMENT,
    add_id integer,
    date datetime
);
In [13]:
import sqlite3 as db

table_clear1 = '''DROP TABLE IF EXISTS adds; '''
table_clear2 = '''DROP TABLE IF EXISTS dates;'''
table_create1 = '''CREATE TABLE adds (
	id integer PRIMARY KEY AUTOINCREMENT,
	description varchar,
	details varchar,
	location varchar,
	url varchar,
	price float,
	pow float,
	rooms integer,
	floor integer,
	data_lat float,
	data_lon float
);'''
table_create2 = '''
CREATE TABLE dates (
	id integer PRIMARY KEY AUTOINCREMENT,
	add_id integer,
	date datetime
);
'''

scn_db = db.connect(ddir + 'szczecin_housing_290718.sqlite')
scn_cur = scn_db.cursor()

scn_cur.execute(table_clear1)
scn_cur.execute(table_clear2)

scn_cur.execute(table_create1)
scn_cur.execute(table_create2)

scn_db.commit()

No we can populate our table with observations from the dataframe.

In [14]:
unique_cols = ['location', 'pow', 'floor', 'rooms', 'data_lat', 'data_lon', 'url']
unique_adds_ids = ~data.duplicated(unique_cols, keep='first')

unique_adds = data[unique_adds_ids].drop(columns=['date', 'poi_lat', 'poi_lon'])
unique_adds = unique_adds.reset_index(drop=True)
unique_adds['id'] = unique_adds.index
In [15]:
unique_adds.tail()
Out[15]:
data_lat data_lon description details floor location pow price rooms url id
48772 53.39368 14.51525 [Centrum 43 m2 po gen remoncie 269tys] Apartament po generalnym remoncie przy Wałach ... 3 Mieszkanie na sprzedaż, Szczecin, Centrum, Hen... 43.00 269000.0 2 https://www.otodom.pl/oferta/centrum-43-m2-po-... 48772
48773 53.39368 14.51525 [Mieszkanie z dużym potencjałem, I p. w centrum.] Na sprzedaż mieszkanie zlokalizowane w centrum... 1 Mieszkanie na sprzedaż, Szczecin, Mazurska 108.68 349000.0 4 https://www.otodom.pl/oferta/mieszkanie-z-duzy... 48773
48774 53.39368 14.51525 [Mieszkanie, 85 m², Szczecin] DUŻE mieszkanie 3 pokojowe o powierzchni 85m2 ... 1 Mieszkanie na sprzedaż, Szczecin, Centrum, Ksi... 85.00 299000.0 3 https://www.otodom.pl/oferta/mieszkanie-85-m-s... 48774
48775 53.39368 14.51525 [Mieszkanie, 54 m², Szczecin] Sprzedaż mieszkania z garażem łącznie za dod... 1 Mieszkanie na sprzedaż, Szczecin, Śródmieście,... 54.00 270000.0 2 https://www.otodom.pl/oferta/mieszkanie-54-m-s... 48775
48776 53.43283 14.54819 [Mieszkanie, 97,38 m², Szczecin] Przestronne mieszkanie rozkładowe w kamienicy ... 2 Mieszkanie na sprzedaż, Szczecin, Centrum 97.38 320000.0 3 https://www.otodom.pl/oferta/mieszkanie-97-38-... 48776

Still the descriptioncolumn is coded as a list. We need to correct that.

In [16]:
unique_adds['description'] = unique_adds['description'].apply(lambda x: x[0])

We can now upload the db to SQLite.

In [17]:
unique_adds.to_sql("adds", scn_db, if_exists='append', index=False)
In [18]:
print(scn_cur.execute('SELECT * FROM adds WHERE details != "" LIMIT 2;').fetchall())
[(8962, 'Malczewskiego, 3pokoje,55m2', 'Na sprzedaż mieszkanie trzypokojowe położone przy ulicy Malczewskiego w Szczecinie. Na powierzchnię niespełna 55m2 składają się trzy pokoje( jeden przejściowy), łazienka z wc, kuchnia, oraz przedpokój. Mieszkanie wymaga nakładów finansowych. Do mieszkania przynależy piwnica. Zapraszam na prezentację oferty.', 'Mieszkanie na sprzedaż, Szczecin, Centrum', 'https://otodom.pl/oferta/malczewskiego-3pokoje-55m2-ID371WG.html', 235000.0, 54.9, 3, 4, 53.4354056, 14.5612614), (8963, 'Mieszkanie w domu z dużym ogrodem. Dąbie - super l', 'Do zaoferowania mam Państwu bezczynszowe mieszkanie, znajdujące się na parterze dwurodzinnego domu w Dąbiu.  Łączna powierzchnia mieszkania to 87 m2, na która składają się: salon o powierzchni 18 m2, sypialnia 17m2, dwa pokoje po 10 m2 oraz kuchnia, łazienka z wc i przedpokój.  Z salonu wyjście na zadaszony taras o powierzchni 7m2, z którego wychodzimy na duży, zadbany ogród (450m2). Poczucie prywatności zapewnia oddzielenie się od sąsiadów żywopłotem oraz tujami. Dodatkowo wydzielono miejsce na stół, krzesła czy pawilon ogrodowy poprzez zrobienie ogrodowego tarasu o powierzchni 12m2.  Mieszkanie jest w bardzo dobrym stanie. w 2008 roku wykonano generalny remont wraz z wymianą wszystkich instalacji, a w 2014 mieszkanie zostało odświeżone, wymieniono podłogi.  Ogrzewanie zapewnia piec gazowy, lecz dzięki ociepleniu budynku, mieszkanie jest bardzo ekonomiczne. W kuchni, łazience i przedpokoju ogrzewanie podłogowe. Dodatkowo do mieszkania przynależy duża piwnica składająca się z 3 pomieszczeń, pomieszczenia na strychu oraz 2 miejsca postojowe dla samochodów na podwórku na podjeździe.  Mimo, że mieszkanie znajduje się w domu dwurodzinnym, wejścia do mieszkań są całkowicie niezależne.  Wielkim atutem nieruchomości jest lokalizacja. Bliskość centrum Dąbia sprawia, że nawet na pieszo mamy dostęp do wielu sklepów, wszelkich urzędów, szkół itp. Mimo to z mieszkania możemy wyjść na ogród i w spokoju odpocząć.  Termin wydania mieszkania to połowa 2017 roku. Więcej informacji chętnie udzielę telefonicznie. Zapraszam na prezentację! - Nasza firma pośredniczy w bezpłatnym doborze najtańszego kredytu spośród 25 banków na zakup wybranej przez Państwa nieruchomości. Jako jedni z nielicznych pracujemy w systemie wymiany ofert z biurami nieruchomości na terenie całej Polski. Posiadamy własne cztery polisy OC. Nr licencji pośrednika: 17866. Nasza firma posiada 5 oddziałów. CENTRALA FIRMY: Nasze oddziały: Oddział GOLENIÓW ul. M. Konopnickiej 76 72-100 Goleniów Tel.  Oddział STARGARD SZCZECIŃSKI ul. B. Chrobrego 8a/7 (CH Starówka) 73-110 Stargard Szczeciński Tel.  Oddział NOWOGARD ul. 15 lutego 2 (I piętro) 72-200 Nowogard Tel.  Oddział SZCZECIN PRAWOBRZEŻE ul. Bagienna 36C, przy Andrzeja Struga 70-772 Szczecin Tel.  Oddział ŚWINOUJŚCIE ul. Gen. Józefa Bema 7/2 (I piętro) 72-600 Świnoujście Tel.  Zapraszamy do naszych oddziałów!  ::oferta eksportowana z programu mediaRent::', 'Mieszkanie na sprzedaż, Szczecin, Dąbie', 'https://otodom.pl/oferta/mieszkanie-w-domu-z-duzym-ogrodem-dabie-super-l-ID3723V.html', 395000.0, 87.0, 4, 0, 53.4285438, 14.5528116)]

It worked! Now we merge the unique DataFrame with the data DF and leave just the ids from the unique_adds with dates.

In [19]:
data_export = data.reset_index()
data_export['index'] = data_export.index
all_dates = data_export.merge(unique_adds, how='left', on=unique_cols)[['index', 'date', 'id']]

print(data_export.shape, unique_adds.shape, all_dates.shape)

all_dates.tail()
(413704, 14) (48777, 11) (413704, 3)
Out[19]:
index date id
413699 413699 2018-06-23 16718
413700 413700 2018-06-23 46878
413701 413701 2018-06-23 16637
413702 413702 2018-06-23 16498
413703 413703 2018-06-23 16221
In [20]:
all_dates = all_dates.rename(columns={'index' : 'id', 'id' : 'add_id'})
add_dates = all_dates.sort_values('id')
all_dates.tail()
Out[20]:
id date add_id
413699 413699 2018-06-23 16718
413700 413700 2018-06-23 46878
413701 413701 2018-06-23 16637
413702 413702 2018-06-23 16498
413703 413703 2018-06-23 16221
In [21]:
all_dates.to_sql("dates", scn_db, if_exists='append', index=False)
print(scn_cur.execute('SELECT * FROM dates LIMIT 5;').fetchall())
scn_db.close()
[(0, 0, '2016-10-01 00:00:00'), (1, 1, '2016-10-01 00:00:00'), (2, 2, '2016-10-01 00:00:00'), (3, 3, '2016-10-01 00:00:00'), (4, 4, '2016-10-01 00:00:00')]

Success! All in all the export worked and now we have a 92 MB database instead of 680 MB JSONLine files.

Summaries

First we want to look how does the price depend on the area.

In [22]:
duplicate_verify = ['details', 'location', 'pow', 'floor', 'rooms', 'data_lat', 'data_lon', 'url']
sample = ~data.duplicated(duplicate_verify, keep='first')

data[sample].plot.scatter('price', 'pow')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd9b99df128>

We see that there are some outliers with area greater than 5000 m$^2$. Also We will consider everything priced more than 3M PLN as outliers.

In [23]:
data = data.loc[(data['pow'] < 5000) & (data['price'] < 3e6)]

sample = ~data.duplicated(duplicate_verify, keep='first')

data[sample].plot.scatter('pow', 'price',  alpha=0.1)
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd9b99fb278>

There seems to be a lot of variance with increasing price. It's good to check the distribution of price and log price. We will once again plot, but with logs this time.

We will also calculate the log of area.

In [24]:
fig , ax= plt.subplots(ncols=2, nrows=1)
data.loc[sample, 'price'].plot.hist(bins=100, ax=ax[0])
ax[0].set_xlabel('Price, PLN')
data.loc[sample, 'price'].plot.hist(bins=100, logx=True, ax=ax[1])
ax[0].set_xlabel('Price, log PLN')
Out[24]:
Text(0.5,0,'Price, log PLN')
In [25]:
data['lprice'] = np.log(data['price'])
data['lpow'] = np.log(data['pow'])
data[sample].plot.scatter('lpow', 'lprice',  alpha=0.1)
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd9beae1c88>

Now let's see how the price, area and price per m$^2$ changed in time.

In [26]:
data['price_m'] = data['price'] / data['pow']

summ_cols = ['date', 'price', 'pow', 'price_m']
time_averages = data[summ_cols].groupby('date').aggregate(np.mean)
In [27]:
fig , ax= plt.subplots(ncols=1, nrows=3)
time_averages.plot.line(y='price', ax=ax[0])
time_averages.plot.line(y='pow', ax=ax[1])
time_averages.plot.line(y='price_m', ax=ax[2])

for axis in [0, 1, 2]:
    ax[axis].xaxis.set_major_locator(years)
    ax[axis].xaxis.set_major_formatter(yearsFmt)
    ax[axis].xaxis.set_minor_locator(months)
    ax[axis].grid(color='g', linestyle='--', linewidth=1, alpha=.3)
    ax[axis].get_xaxis().tick_bottom()    
    ax[axis].get_yaxis().tick_left()    
    ax[axis].spines["top"].set_visible(False)    
    #ax[axis].spines["bottom"].set_visible(False)    
    ax[axis].spines["right"].set_visible(False)    
    #ax[axis].spines["left"].set_visible(False)   
    if axis < 2:
        ax[axis].set_xticklabels([])
        ax[axis].set_xlabel("")
    else:
        ax[axis].set_xlabel('Date')
In [28]:
time_averages.tail()
Out[28]:
price pow price_m
date
2018-05-26 331198.355851 66.996952 5120.528661
2018-06-02 332626.154605 66.974287 5134.696668
2018-06-09 334994.679686 67.655043 5136.716990
2018-06-16 337102.781377 68.329473 5128.648129
2018-06-23 333983.030733 67.655736 5153.752442

We see that the average area does not change much. What we see is that the price rises continuously over the analyzed period.

We also look at how long an advertisement is on average available.

In [29]:
data['group'] = data.groupby(duplicate_verify).grouper.group_info[0]
gr = data.groupby('group')
gr_dur = gr['date'].max() - gr['date'].min()
print(gr_dur.describe())
gr_dur.dt.days.hist(bins=60)
count                      63966
mean     43 days 06:23:22.720195
std      63 days 21:29:32.398003
min              0 days 00:00:00
25%              7 days 00:00:00
50%             22 days 00:00:00
75%             48 days 00:00:00
max            545 days 00:00:00
Name: date, dtype: object
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd9c18f95f8>

We see that on average an ad is alive for 62 days. The median time recorded is 22 days, around 3 weeks. Unfortunately our data acquisition technique is on 7 day basis, so these results are likely biased. There are also many advertisements with survival time == 0 indicating that before the second run they are already gone.

Linear regression analysis

Having cleaned our data we can next analyze according to different criteria like area, number of rooms and which floor is the apartment on. For that we will be using statsmodels package.

Also we want only to use first occurrences for our observations.

In [30]:
import statsmodels.api as sm
In [31]:
sample = ~data.duplicated(duplicate_verify, keep='first')

formula = "price ~ pow"
m0 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m0.summary()
Out[31]:
OLS Regression Results
Dep. Variable: price R-squared: 0.580
Model: OLS Adj. R-squared: 0.580
Method: Least Squares F-statistic: 8.837e+04
Date: Wed, 08 Aug 2018 Prob (F-statistic): 0.00
Time: 06:32:36 Log-Likelihood: -8.1194e+05
No. Observations: 63966 AIC: 1.624e+06
Df Residuals: 63964 BIC: 1.624e+06
Df Model: 1
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 6.937e+04 796.518 87.089 0.000 6.78e+04 7.09e+04
pow 3393.6968 11.416 297.275 0.000 3371.321 3416.072
Omnibus: 50588.275 Durbin-Watson: 1.830
Prob(Omnibus): 0.000 Jarque-Bera (JB): 7365727.648
Skew: 3.042 Prob(JB): 0.00
Kurtosis: 55.217 Cond. No. 178.


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

From this initial regressions results we see that increasing area by 1 m$^2$ increases the price by 3400 PLN. Let's see how this changes when we add floor number and number of rooms.

We also add localization and length of description.

In [32]:
data['desc_len'] = data['details'].apply(len)

formula = "price ~ pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()
Out[32]:
OLS Regression Results
Dep. Variable: price R-squared: 0.591
Model: OLS Adj. R-squared: 0.591
Method: Least Squares F-statistic: 1.320e+04
Date: Wed, 08 Aug 2018 Prob (F-statistic): 0.00
Time: 06:32:36 Log-Likelihood: -8.1110e+05
No. Observations: 63966 AIC: 1.622e+06
Df Residuals: 63958 BIC: 1.622e+06
Df Model: 7
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.191e+04 3936.631 20.807 0.000 7.42e+04 8.96e+04
pow 3047.9401 18.703 162.966 0.000 3011.282 3084.598
floor -2375.9730 147.611 -16.096 0.000 -2665.291 -2086.655
rooms 1.13e+04 532.995 21.205 0.000 1.03e+04 1.23e+04
data_lat 8675.7804 746.630 11.620 0.000 7212.385 1.01e+04
data_lon 4.493e+04 5345.441 8.404 0.000 3.44e+04 5.54e+04
data_lat:data_lon -1470.5759 141.645 -10.382 0.000 -1748.200 -1192.951
desc_len 9.9362 0.364 27.308 0.000 9.223 10.649
Omnibus: 54714.590 Durbin-Watson: 1.840
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8977024.174
Skew: 3.444 Prob(JB): 0.00
Kurtosis: 60.626 Cond. No. 2.73e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

So each subsequent room adds 13k PLN to the price. Also each floor causes price to fall by 2.1k PLN. Also the longer the description the higher the price. Each letter "costs" around 10 PLN ;)

Let's see what happens after adding quarterly price interaction.

In [33]:
data['quarter'] = data['date'].dt.year.apply(str) + \
    "Q" + data['date'].dt.quarter.apply(str)

formula = "price ~ C(quarter) * pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len + C(quarter)"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()
Out[33]:
OLS Regression Results
Dep. Variable: price R-squared: 0.598
Model: OLS Adj. R-squared: 0.598
Method: Least Squares F-statistic: 5005.
Date: Wed, 08 Aug 2018 Prob (F-statistic): 0.00
Time: 06:32:37 Log-Likelihood: -8.1055e+05
No. Observations: 63966 AIC: 1.621e+06
Df Residuals: 63946 BIC: 1.621e+06
Df Model: 19
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 8.299e+04 3981.936 20.842 0.000 7.52e+04 9.08e+04
C(quarter)[T.2017Q1] -8913.9666 2659.061 -3.352 0.001 -1.41e+04 -3702.205
C(quarter)[T.2017Q2] 6907.2167 2095.344 3.296 0.001 2800.340 1.1e+04
C(quarter)[T.2017Q3] 1.006e+04 3108.769 3.236 0.001 3966.920 1.62e+04
C(quarter)[T.2017Q4] 2.091e+04 3081.835 6.783 0.000 1.49e+04 2.69e+04
C(quarter)[T.2018Q1] 1.04e+04 3507.255 2.964 0.003 3522.359 1.73e+04
C(quarter)[T.2018Q2] 1.127e+04 3943.485 2.857 0.004 3538.005 1.9e+04
pow 3057.5999 21.765 140.485 0.000 3014.941 3100.259
C(quarter)[T.2017Q1]:pow 88.8832 38.815 2.290 0.022 12.805 164.961
C(quarter)[T.2017Q2]:pow -77.3315 29.734 -2.601 0.009 -135.611 -19.052
C(quarter)[T.2017Q3]:pow -62.2346 46.988 -1.324 0.185 -154.330 29.861
C(quarter)[T.2017Q4]:pow -58.8361 45.783 -1.285 0.199 -148.570 30.898
C(quarter)[T.2018Q1]:pow 237.5971 52.472 4.528 0.000 134.752 340.442
C(quarter)[T.2018Q2]:pow 377.9014 57.809 6.537 0.000 264.595 491.208
floor -2368.4159 146.429 -16.175 0.000 -2655.416 -2081.415
rooms 1.119e+04 530.117 21.111 0.000 1.02e+04 1.22e+04
data_lat 8629.3926 740.431 11.655 0.000 7178.146 1.01e+04
data_lon 4.565e+04 5300.945 8.612 0.000 3.53e+04 5.6e+04
data_lat:data_lon -1487.2635 140.469 -10.588 0.000 -1762.584 -1211.943
desc_len 8.5650 0.380 22.518 0.000 7.820 9.311
Omnibus: 54502.764 Durbin-Watson: 1.870
Prob(Omnibus): 0.000 Jarque-Bera (JB): 8945236.463
Skew: 3.420 Prob(JB): 0.00
Kurtosis: 60.528 Cond. No. 2.73e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.73e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Here we can observe that the price per m$^2$ dependency is increasing in the last two quarters when compared to 2016Q4.

Being curious we also want to see how does the time that an advertisement spends online depend on our independent variables.

In [34]:
life = gr_dur.reset_index()
life = life.rename(columns={'date':'life'})
life['life'] = life['life'].dt.days

data = data.merge(life, how='left', left_on='group', right_on='group')

data['price_k'] = data['price'] / 1e3

sample = ~data.duplicated(duplicate_verify, keep='first')
formula = "life ~ price_k + pow + floor + rooms + data_lat + data_lon + data_lat * data_lon + desc_len"
m1 = sm.OLS.from_formula(formula, data=data[sample]).fit()
m1.summary()
Out[34]:
OLS Regression Results
Dep. Variable: life R-squared: 0.014
Model: OLS Adj. R-squared: 0.014
Method: Least Squares F-statistic: 111.5
Date: Wed, 08 Aug 2018 Prob (F-statistic): 7.46e-186
Time: 06:32:39 Log-Likelihood: -3.5624e+05
No. Observations: 63966 AIC: 7.125e+05
Df Residuals: 63957 BIC: 7.126e+05
Df Model: 8
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
Intercept 7.7096 3.224 2.391 0.017 1.391 14.028
price_k 0.0239 0.003 7.413 0.000 0.018 0.030
pow 0.2297 0.018 12.651 0.000 0.194 0.265
floor -0.1355 0.121 -1.123 0.262 -0.372 0.101
rooms -2.3239 0.437 -5.323 0.000 -3.180 -1.468
data_lat -1.1783 0.610 -1.931 0.053 -2.374 0.017
data_lon -9.7080 4.365 -2.224 0.026 -18.264 -1.152
data_lat:data_lon 0.2869 0.116 2.479 0.013 0.060 0.514
desc_len 0.0015 0.000 5.085 0.000 0.001 0.002
Omnibus: 41549.719 Durbin-Watson: 1.234
Prob(Omnibus): 0.000 Jarque-Bera (JB): 475143.906
Skew: 3.062 Prob(JB): 0.00
Kurtosis: 14.865 Cond. No. 2.77e+04


Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.77e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Quick look on the coefficients suggests that the bigger the house and the more expensive it is the longer it stayed on the website. Also the more rooms a house had the quicker it had gone. The effect of geographical position is big, but not significantly different from 0.

Also the longer the description, the longer an ad stayed on the website.

The R$^2$ of this model is really low indicating that most of what we record is noise and this model does not explain this dependence in a good way.

Conclusions

This turns out to be a fairly nice dataset. Also this is not everything that can be done with it. I will try to further show what else can be done given that I have information on geographical coordinates for each house and also the full text of the advertisement. More will come.