Simulating the Polish Parliament using Recurrent Neural Networks

While reading through Hands-On Machine Learning with Scikit-Learn and TensorFlow I finished the chapter on Recurrent Neural Networks (RNN). After the first reading I felt lost, but now I understood much more.

I'm generally a person who learns the most by doing. One of the exercises for this chapter encouraged the reader to go through official Tensor Flow tutorial regarding RNN. It seemed easy. I went through it, but I added a twist. I did not want to use the original data. I had an idea to create a text generator which I trained on the Polish Parliament minutes, which are freely available on the internet as PDFs.

It seemed entertaining for me to create an artificial Polish Parliament session generator. It seemed to me as an idea that I wanted to try out.

In this notebook I will show how to download, convert them into text and then how to train the using the TF tutorial code.

In [1]:

from bs4 import BeautifulSoup
import requests
import re
import time
import os
import numpy  as np
import tensorflow as tf
import pyprind

# Matplotlib options
import matplotlib
import matplotlib.pyplot as plt
matplotlib.rcParams['axes.labelsize'] = 14
matplotlib.rcParams['xtick.labelsize'] = 12
matplotlib.rcParams['ytick.labelsize'] = 12
matplotlib.rcParams['figure.figsize'] = (20.0, 10.0)


save_folder = './rnn_sejm/data/'
# Enable eager execution 
tf.enable_eager_execution()

Scraping Polish Parliament minutes¶

I wanted to try out my web scraping skills and I decided to automatize downloading the minutes using loops and BautifulSoup to crawl the main HTML document. Thankfully the Parliament's website is pretty simple and it was easy to download the page for every year of this term starting in 2015 until this year.

The HTML included a simple table with links to the PDFs for each sitting with class="pdf". For each such link I just downloaded the file in the output folder. I also added a 1 s break not to spam the server.

I also added a progress bar using PyPrind.

I ended up with 213 PDFs weighting over 375 MiB.

In [33]:

for year in [2019, 2018, 2017, 2016, 2015]:
    sejm_url = 'http://www.sejm.gov.pl/Sejm8.nsf/stenogramy.xsp?rok={}'
    sejm_url = sejm_url.format(year)
    page_response = requests.get(sejm_url, timeout=5)
    page_content = BeautifulSoup(page_response.content, "html.parser")
    #%%
    table = page_content.find('table')
    links = table.findAll('a', {'class': 'pdf'})
    dl_bar = pyprind.ProgBar(
        len(links),
        track_time=True,
        title='Downloading PDFs for {}'.format(year)
    )
    for link in links:
        pdf_url = link.get('href')
        dl_name = pdf_url.split('/')[-1:][0]
        if re.match(r'.*\.pdf$', dl_name):
            r = requests.get(pdf_url)
            if r.status_code == 200:
                save_file = os.path.join(save_folder, 'pdf', dl_name)
                with open(save_file, 'wb') as f:
                    f.write(r.content)
                time.sleep(1)
            else:
                print('Could not download: ', pdf_url)
        dl_bar.update()

Downloading PDFs for 2019
0% [####] 100% | ETA: 00:00:00
Total time elapsed: 00:00:08
Downloading PDFs for 2018
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:13
Downloading PDFs for 2017
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:25
Downloading PDFs for 2016
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:02:44
Downloading PDFs for 2015
0% [##################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:37

I decided to use pdftotext from Xpdf to convert downloaded docs to text.

It turned out it wasn't so easy as I've expected. I've tried using the Python PyPDF2 library, but I couldn't get any reliable results fast. I also tried Calibre's converter, but the PDFs have tw-column format and the converter mixed up the lines. Only Xpdf converted the text with least mistakes

In [34]:

import subprocess
import pyprind

pdf_files = os.listdir(os.path.join(save_folder, 'pdf'))

bar = pyprind.ProgBar(
    len(pdf_files),
    track_time=True,
    title='Converting PDF to txt'
)
for f in pdf_files:
    in_pdf = os.path.join(save_folder, 'pdf', f)
    txt_name = f.split('.')[0] + '.txt'
    out_txt = os.path.join(save_folder, 'raw_text', txt_name)
    subprocess.run(["pdftotext", in_pdf, out_txt])
    bar.update()

Converting PDF to txt
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:07:57

Unfortunately the conversion wasn't perfect. I got a lot of additional stuff that I did not need, like table of contents, headers, footers or page numbers.

Below is a quick and dirty approach to remove this unnecessary text from the content. Still some unnecessary text remains. These are mostly MP's names that appeared in the beginning of each page, indicating the current speaker. Fot this short exercise I think that it is enough. For a more serious analysis I would have to be more thorough.

In [35]:

def clean_steno(in_file, out_file):
    text = ''
    with open(in_file, 'r') as f:
        text = f.read()

    # Remove control characters
    # text = "".join(ch for ch in text if unicodedata.category(ch)[0]!="C")
    text = re.sub(r'\f', '', text)
    # Remmove 'Spis treści'
    toc = re.search(r'\(Na posiedzeniu', text)
    if toc is not None:
        text = text[toc.span(0)[0]:]
    else:
        print('Did not find TOC for ', in_file)
    # Remove ending
    toc = re.search(r'\nTŁOCZONO ', text)
    if toc is not None:
        text = text[:toc.span(0)[0]]
    else:
        print('Did not find ending for ', in_file)
    # Clean some chars
    for ch in ['–', '—', '−']:
        text = text.replace(ch, '-')
    for ch in ['«', '»', '”', '„']:
        text = text.replace(ch, '"')
    text = text.replace('§', 'par. ')
    # Remove headers, multiple lines, line breaks, page nr., etc.
    text = re.sub(r'\n\n[0-9]+\n+([0-9]+\n)?', '\n[line]\n', text)
    text = re.sub(r'\nSpis treści\n', '\n', text)
    text = re.sub(r'[0-9]+\. posiedzenie .*\n+', '\n[line]\n', text)
    text = re.sub(r'\[line\]\n(.*\n){2}\n+[A-Z].*\n', '\n[line]\n', text)
    text = re.sub(r'-\n+([A-Z\[].*\n+)+', '', text)
    text = re.sub(r'\n\n(.*\n)\[line\]', '\n[line]', text)
    text = re.sub(r'\[line\]\n[A-Z].*\n\n', '', text)
    text = re.sub(r'\[line\]\n', '\n', text)
    text = re.sub(r'-\n+', '', text)
    text = re.sub(r'…', '...', text)
    text = re.sub(r'([^\.\)\?\!:\n])\n+', r'\1 ', text)
    text = re.sub(r'\n{3,}', '\n\n', text)
    # Save file
    with open(out_file, 'w') as f:
        f.write(text)
    bar.update()


# Test
# txt_name = '01_ksiazka_c_bis.txt'
txt_files = os.listdir(os.path.join(save_folder, 'raw_text'))
bar = pyprind.ProgBar(
    len(txt_files),
    track_time=True,
    title='Cleaning text files'
)
for txt_name in txt_files:
    # txt_name = '73_b_ksiazka.txt'
    in_file = os.path.join(save_folder, 'raw_text', txt_name)
    out_file = os.path.join(save_folder, 'clean_text', txt_name)
    clean_steno(in_file, out_file)

Cleaning text files
0% [##############################] 100% | ETA: 00:00:00
Total time elapsed: 00:00:23

Did not find ending for  ./rnn_sejm/data/raw_text/34_b_ksiazka.txt
Did not find TOC for  ./rnn_sejm/data/raw_text/Ksi%C4%99ga1_UZ.txt
Did not find ending for  ./rnn_sejm/data/raw_text/Ksi%C4%99ga1_UZ.txt
Did not find ending for  ./rnn_sejm/data/raw_text/34_a_ksiazka.txt
Did not find TOC for  ./rnn_sejm/data/raw_text/ZN_ksiazka_05_12_2017.txt
Did not find TOC for  ./rnn_sejm/data/raw_text/ZN_13_07_2018_ksiazka.txt
Did not find TOC for  ./rnn_sejm/data/raw_text/ZN_ksiazka_15_04_2016.txt

I decided to select just one full sitting (23rd) to train my data. Unfortunately I only have my laptop with Linux running in a Virtual Machine environment, so my computational power is limited. Even for the 2.5M characters selected each Epoch of training takes around an hour to complete.

In [36]:

# Build corpus
files = os.listdir(os.path.join(save_folder, 'raw_text'))
# We read 23. seating
files = [file for file in files if re.match(r'^23', file)]

corpus = ''
for file in files:
    text_file = os.path.join(save_folder, 'clean_text', file)
    text = open(text_file, 'r').read()
    corpus += text

print('Length of text: {} characters'.format(len(corpus)))

Length of text: 2539200 characters

Model training¶

In the next lines I follow the approach form the official TensorFlow tutorial.

Unfortunately I didn't modify the code too much I only changed the loss function to tf.losses.sparse_softmax_cross_entropy. The tutorial used a nightly build of TF and I use the official release. I got an error when using the loss function from the example.

At this point I feel more like a script kiddie following someone else's code. I really like to write my own procedures, because this is how I learn. Even when learning Python, ML or NN from books I refrained from copy-pasting the code from the repositories. Instead I've written everything myself, by hand. For me this is a way of getting to know the mostly used classes and functions. I somehow see no added value in copying and pasting code written by someone else when learning a new library. It would just give me a false feeling of accomplishment instead of understanding or even familiarizing myself with each step needed. Of course I've went through the code line by line, but it is just not the same for me as writing it myself. For me that should not be the point of learning. Also I see this an easy to fall trap when learning how to code. That's why it was my ambition to at least try it out with my own data.

In [6]:

# The unique characters in the file
vocab = sorted(set(corpus))
print('{} unique characters'.format(len(vocab)))

98 unique characters

This is quite a lot. The original had around 60 chars. But that is the peculiarity of the Polish language, that it includes chars like ąłśćźżó.

In [7]:

# Creating a mapping from unique characters to indices
char2idx = {u: i for i, u in enumerate(vocab)}
idx2char = np.array(vocab)
text_as_int = np.array([char2idx[c] for c in corpus])

In [8]:

tf.enable_eager_execution()
# The maximum length sentence we want for a single input in characters
seq_length = 100
examples_per_epoch = len(corpus) // seq_length

# Create training examples / targets
char_dataset = tf.data.Dataset.from_tensor_slices(text_as_int)
sequences = char_dataset.batch(seq_length + 1, drop_remainder=True)
for item in sequences.take(5):
    print(repr(''.join(idx2char[item.numpy()])))

'(Na posiedzeniu przewodniczą marszałek Sejmu Marek Kuchciński oraz wicemarszałkowie Ryszard Terlecki,'
' Barbara Dolniak, Stanisław Tyszka, Małgorzata Kidawa-Błońska i Joachim Brudziński)\n\nMarszałek:\nOtwie'
'ram posiedzenie.\n(Marszałek trzykrotnie uderza laską marszałkowską)\nNa sekretarzy dzisiejszych obrad '
'powołuję posłów Piotra Łukasza Babiarza, Piotra Olszówkę, Marcina Duszka oraz Daniela Milewskiego.\nW '
'pierwszej części obrad sekretarzami będą posłowie Piotr Łukasz Babiarz i Piotr Olszówka.\nProtokół i l'

In [9]:

def split_input_target(chunk):
    input_text = chunk[:-1]
    target_text = chunk[1:]
    return input_text, target_text

dataset = sequences.map(split_input_target)

In [10]:

for input_example, target_example in dataset.take(1):
    print('Input data: ', repr(''.join(idx2char[input_example.numpy()])))
    print('Target data:', repr(''.join(idx2char[target_example.numpy()])))

Input data:  '(Na posiedzeniu przewodniczą marszałek Sejmu Marek Kuchciński oraz wicemarszałkowie Ryszard Terlecki'
Target data: 'Na posiedzeniu przewodniczą marszałek Sejmu Marek Kuchciński oraz wicemarszałkowie Ryszard Terlecki,'

In [11]:

# Batch size 
BATCH_SIZE = 64
steps_per_epoch = examples_per_epoch//BATCH_SIZE

# Buffer size to shuffle the dataset
# (TF data is designed to work with possibly infinite sequences, 
# so it doesn't attempt to shuffle the entire sequence in memory. Instead, 
# it maintains a buffer in which it shuffles elements).
BUFFER_SIZE = 1000

dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
dataset

Out[11]:

<BatchDataset shapes: ((64, 100), (64, 100)), types: (tf.int64, tf.int64)>

In [12]:

# Length of the vocabulary in chars
vocab_size = len(vocab)

# The embedding dimension
embedding_dim = 256

# Number of RNN units
rnn_units = 1024

In [4]:

import functools
rnn = functools.partial(
    tf.keras.layers.GRU,
    recurrent_activation='sigmoid'
)


def build_model(vocab_size, embedding_dim, rnn_units, batch_size):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(
            vocab_size,
            embedding_dim,
            batch_input_shape=[batch_size, None]
        ),
        rnn(rnn_units,
            return_sequences=True,
            recurrent_initializer='glorot_uniform',
            stateful=True
        ),
        tf.keras.layers.Dense(vocab_size)
    ])
    return model

In [14]:

model = build_model(
    vocab_size=len(vocab),
    embedding_dim=embedding_dim,
    rnn_units=rnn_units,
    batch_size=BATCH_SIZE
)

model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding (Embedding)        (64, None, 256)           25088     
_________________________________________________________________
gru (GRU)                    (64, None, 1024)          3935232   
_________________________________________________________________
dense (Dense)                (64, None, 98)            100450    
=================================================================
Total params: 4,060,770
Trainable params: 4,060,770
Non-trainable params: 0
_________________________________________________________________

In [15]:

def loss(labels, logits):
    return tf.losses.sparse_softmax_cross_entropy(
        labels,
        logits
    )

In [16]:

model.compile(
    optimizer=tf.train.AdamOptimizer(),
    loss=loss
)

In [26]:

# Directory where the checkpoints will be saved
checkpoint_dir = './rnn_sejm/model'
# Name of the checkpoint files
checkpoint_prefix = os.path.join(checkpoint_dir, "ckpt_{epoch}")
checkpoint_callback = tf.keras.callbacks.ModelCheckpoint(
    filepath=checkpoint_prefix,
    save_weights_only=True
)

In [18]:

EPOCHS = 10
history = model.fit(
    dataset.repeat(),
    epochs=EPOCHS,
    steps_per_epoch=steps_per_epoch,
    callbacks=[checkpoint_callback]
)

Epoch 1/10
396/396 [==============================] - 2901s 7s/step - loss: 2.3182
Epoch 2/10
396/396 [==============================] - 2857s 7s/step - loss: 1.4224
Epoch 3/10
396/396 [==============================] - 2853s 7s/step - loss: 1.2153
Epoch 4/10
396/396 [==============================] - 2857s 7s/step - loss: 1.1247
Epoch 5/10
396/396 [==============================] - 2861s 7s/step - loss: 1.0657
Epoch 6/10
396/396 [==============================] - 2878s 7s/step - loss: 1.0225
Epoch 7/10
396/396 [==============================] - 2883s 7s/step - loss: 0.9851
Epoch 8/10
396/396 [==============================] - 2893s 7s/step - loss: 0.9540
Epoch 9/10
396/396 [==============================] - 2896s 7s/step - loss: 0.9262
Epoch 10/10
396/396 [==============================] - 2908s 7s/step - loss: 0.9008

As I mentioned before, each epoch takes around 3 quarters of an hour to train. Initially I played with only 1 epoch, but in the end I let it run through 10 epochs, while I left home to work. When I returned I has my 10 epoch model trained.

The graph below suggests, that there is still much place for improvement. But for practice this should suffice.

In [14]:

epoch = range(EPOCHS)
plt.plot(epoch, history.history['loss'], label='Training set loss')
plt.xlabel('Epoch')
plt.ylabel('loss')
plt.legend(loc='best', prop={'size': 20})

Out[14]:

<matplotlib.legend.Legend at 0x7f5a3e56bac8>

In [7]:

pred_model = build_model(vocab_size, embedding_dim, rnn_units, batch_size=1)

pred_model.load_weights(tf.train.latest_checkpoint(checkpoint_dir))
pred_model.build(tf.TensorShape([1, None]))
pred_model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (1, None, 256)            25088     
_________________________________________________________________
gru_1 (GRU)                  (1, None, 1024)           3935232   
_________________________________________________________________
dense_1 (Dense)              (1, None, 98)             100450    
=================================================================
Total params: 4,060,770
Trainable params: 4,060,770
Non-trainable params: 0
_________________________________________________________________

In [11]:

def generate_text(model, start_string, num_generate=100):
    # Evaluation step (generating text using the learned model)
    # Converting our start string to numbers (vectorizing) 
    input_eval = [char2idx[s] for s in start_string]
    input_eval = tf.expand_dims(input_eval, 0)

    # Empty string to store our results
    text_generated = []

    # Low temperatures results in more predictable text.
    # Higher temperatures results in more surprising text.
    # Experiment to find the best setting.
    temperature = .5

    # Here batch size == 1
    model.reset_states()
    for i in range(num_generate):
        predictions = model(input_eval)
        # remove the batch dimension
        predictions = tf.squeeze(predictions, 0)

        # using a multinomial distribution to predict the word returned by the model
        predictions = predictions / temperature
        predicted_id = tf.multinomial(predictions, num_samples=1)[-1, 0].numpy()

        # We pass the predicted word as the next input to the model
        # along with the previous hidden state
        input_eval = tf.expand_dims([predicted_id], 0)

        text_generated.append(idx2char[predicted_id])

    return (start_string + ''.join(text_generated))

Simulation results¶

After training I generate a sample of text. To my astonishment the text is even readable! Given that the model generates text letter by letter I'm impressed. Even given the complexities of the Polish grammar there are less mistakes then I expected. Of course the sentences are too long, they switch sense in the middle, but there is some cohesion in them. Moreover the model catches some peculiarities of the discussion well.

I'm curios enough to check how the results would look like after more training.

I would like to play with the models more myself, but currently I'm handicapped by my laptop's computational power. Still I really like Kers and how easy it is to build models in it compared to raw TensorFlow.

I'll keep the ideas coming!

In [16]:

print(generate_text(pred_model, start_string=u"Wicemarszałek ", num_generate=1500))

Wicemarszałek Barbara Dolniak:
Bardzo dziękuję.
Zamykam dyskusję.
I proszę pana posła Jerzego Meysztowicza, klub Prawo i Sprawiedliwość.

Poseł Małgorzata Wassermann:
Jak się zgadzali się zakończyć.
Oczywiście powiedziałam, że Sejm propozycję przyjął.
Sprzeciwu nie słyszę.
Przystępujemy do rozpatrzenia punktu 3. porządku dziennego: Sprawozdanie Komisji Infrastruktury o poselskim projekcie uchwały w sprawie powołania Komisji Śledczej do zbadania prawidłowości i legalności działalności Instytutu Pamięci Narodowej - Komisji Ścigania Zbrodni przeciwko Narodowi Polskiemu w odniesieniu do Izby o prawo spraw, które znajdziemy do spraw energii europejskiej, która jest zapewnienia wymagań ciekawych ponad 1 mln turystów, co więcej, które są wypełnienie obliczania prawa do prezesa Urzędu Lotnictwa Cywilnego, ponieważ jest to kolejne proporcje i odpowiednie przedstawiciele różnych porozumienia o Wolnym Handlu przez jednostkę samorządu terytorialnego, która dotyczy podmiotów wchodzących w skład "Grupy Amber Gold" Dziękuję bardzo.
Pytanie zadaje pan poseł Krzysztof Brejza:
W sprawie SKOK-ów. Zapewniam, że będzie pan wyraźnie powiedzieć, że działalność w sprawie skutków w przypadku wystawienia sporu w radzie parlamentarnego Prawo i Sprawiedliwość.
Bardzo proszę.

Poseł Władysław Kosiniak-Kamysz: Czy przecież powołana została przedstawiona przez Sejm pomijał dyskusję.
Jako pierwsza w imieniu Klubu Parlamentarnego Polskiego Stronnictwa Ludowego pan poseł Wassermann. Nie mam takiej sprawy do komisji na lotn