Revolutionizing Text Summarization with Abstractive Techniques: A Deep Dive into Abstractive Summarization
In my previous article, we took a look at extractive summarization and how it works. Today, let’s dive into the exciting world of abstractive summarization.
Abstractive summarization
Abstractive summarization is a sophisticated technique that creates a concise and understandable summary of a longer text. Unlike extractive summarization, which simply pulls out key sentences, abstractive summarization creates a new, unique summary that captures the text’s essence and main ideas, often using different language. This technique requires a deep understanding of the text and the ability to generate new language based on that understanding.
So, are you ready to explore the exciting world of abstractive summarization? Let’s do this!
Approach to Abstractive Summarizer
We will be performing abstractive summarization of text using a seq2seq model. Our approach to the problem can be divided into these broad categories.
Introduction to seq2seq architecture for text generation
- The model uses a sequence-to-sequence (seq2seq) encoder-decoder architecture to generate text.
- The encoder processes the input text and encodes it into a fixed-length vector, which is then passed to the decoder.
- The decoder uses this vector to generate the output text.
Description of the encoder
- The encoder consists of two layers of LSTM cells: recurrent neural network (RNN) cells that can remember previous input and use it to inform the current output.
- The first layer processes the input text and produces a hidden state representation of the input.
- The second layer processes the hidden state and produces the final fixed-length vector representation of the input.
Description of the decoder
- The decoder also consists of two layers of LSTM cells, with the first layer receiving the fixed-length vector from the encoder as its initial input.
- The first layer produces a hidden state representation of the vector.
- The second layer processes the hidden state and produces the output text.
Training and use of the model
- The model is trained on a large corpus of text data, where the input text is a sequence of words, and the output text is the next word in the sequence.
- During training, the model learns to predict the next word in the sequence based on the input text.
- Once trained, the model can be used to generate new text by providing a seed input text and allowing the model to predict the next word in the sequence.
- The seq2seq encoder-decoder architecture allows the model to learn the relationships between input and output sequences, allowing it to generate coherent text.
What are seq2seq models?
Seq2seq models are a type of machine learning model used in natural language processing. They’re a cool way of turning input sequences into output sequences! The idea is that the model learns to understand the relationships between the input and the desired output. This allows it to generate new, coherent text based on a given input. The models consist of two parts: an encoder and a decoder. The encoder processes the input sequence and encodes it into a fixed-length vector, which is then passed to the decoder. The decoder uses this vector to generate the output sequence. Basically, it’s a fun way of making a computer write like a human!
Let's delve into the details.
Since abstractive summarization is a long process, I will not be showing every part of the coding.
Step 0: Preprocess the dataset
Preprocessing the dataset is a very important step not only for Abstraction but also for any NLP application. However, this article’s primary focus is on the seq2seq encoder hence we won't be discussing the preprocessing steps in detail. In short, we will define a text cleaner function that takes a document as input and cleans it up for further processing. The function will convert the document to lowercase, split it into words, expand contractions, remove URLs, HTML tags, punctuation marks, and more, using regular expressions. Finally, it will remove those from the document and return the cleaned result.
Step 1: Tokenizing Text and the Summary Data
First, we split the clean dataset into four variables: train_x, test_x, train_y, and test_y. train_x and train_y will be used to train a machine learning model, and test_x and test_y will be used to evaluate the performance of the model.
train_x, test_x, train_y, test_y = train_test_split(New_df['document'], New_df['summary'], # split the data into training and testing sets
test_size=0.1, # use 10% of the data for testing
random_state=0) # use a fixed seed for the random number generator
del New_df# delete the original dataframe
Then we take a list of texts, fit a tokenizer on it, and count various statistics about the words in the texts. It looks at the frequency of each word in the texts and increments counters based on whether the word’s frequency is less than a given threshold. By the end of the code, you’ll have counted the total number of words, the total frequency of all words, the number of words with a frequency less than the threshold, and the total frequency of words with a frequency less than the threshold.
# Initialize the Tokenizer object
Tokenizer = Tokenizer()
# Fit the Tokenizer object on the training data
Tokenizer.fit_on_texts(list(train_x))
# convert train_y and val_y to sequences
train_y = Tokenizer.texts_to_sequences(train_y)
val_y = Tokenizer.texts_to_sequences(val_y)
Step 3: Reading the word embeddings.
You can download the Glove word embedding from Kaggle.
Next, we read the file ‘glove.6B.100d.txt’ using the utf8 encoding and assigning the result to the variable ‘glove_file’.
glove_file = open('<path to your embedding>/glove.6B.100d.txt', encoding = 'utf8')
Step 4: Defining the Model Architecture
Now we define a seq2seq model for encoding and decoding sequences. For my model, I set the number of dimensions in the latent space is set to 128. The encoder takes a sequence of text as input and processes it using an input layer, an embedding layer, and a bidirectional LSTM layer. The forward and backward hidden and cell states from the bidirectional LSTM are then concatenated to form the encoder’s final hidden and cell states. The decoder takes a sequence as input, processes it with an input layer, an embedding layer, and an LSTM layer, and outputs a time-distributed dense layer with a softmax activation. The final model is defined with the encoder and decoder inputs and the decoder output. A summary of the model is printed, and the model’s diagram is also plotted.
latent_dim = 128 # number of dimensions in the latent space
# Encoder
eInput = Input(shape=(maxlen_text, )) # input layer for the encoder
eEmbed = Embedding(t_max_features, embed_dim, input_length=maxlen_text, weights=[t_embed], trainable=False)(eInput ) # embedding layer for the encoder
eLstm = Bidirectional(LSTM(latent_dim, return_state=True)) # bidirectional LSTM for the encoder
eOut, eFh, eFc, eBh, eBc = eLstm(eEmbed) # forward hidden state, forward cell state, backward hidden state, and backward cell state
eH= Concatenate(axis=-1, name='eH')([eFh, eBh]) # concatenate forward and backward hidden states
eC= Concatenate(axis=-1, name='eC')([eFc, eBc]) # concatenate forward and backward cell states
#Decoder
dInput = Input(shape=(None, )) # input layer for the decoder
dEmbed = Embedding(s_max_features, embed_dim, weights=[s_embed], trainable=False)(dInput) # embedding layer for the decoder
dLstm = LSTM(latent_dim*2, return_sequences=True, return_state=True, dropout=0.3, recurrent_dropout=0.2) # LSTM layer for the decoder
dOut, _, _ = dLstm(dEmbed, initial_state=[eH, eC]) # output sequence and hidden and cell states
dDense = TimeDistributed(Dense(s_max_features, activation='softmax')) # time distributed dense layer for the decoder
dOut = dDense(dOut) # output of the decoder
model = Model([eInput , dInput], dOut) # define the model with encoder and decoder inputs and decoder output
model.summary() # print model summary
Step 5: Fit the Model
Now we fit the model with the objective of reducing the loss using sparse categorical cross-entropy. The optimizer I used is rmsprop, which will adjust the model’s parameters to minimise the loss. The code also includes an early stopping mechanism, which will stop the training process if the validation loss hasn’t improved in the past two epochs. The model is trained for a total of 10 epochs, with a batch size of 128, using the specified validation data and the early stopping mechanism to ensure the best results.
model.compile(loss='sparse_categorical_crossentropy', optimizer='rmsprop') # compile the model with sparse categorical crossentropy loss and rmsprop optimizer
arly_stop = keras.callbacks.EarlyStopping(monitor='val_loss', mode='min', verbose=1, patience=2) # define early stopping callback with minimum validation loss as monitor and a patience of 2 epochs
model.fit([train_x, train_y[:, :-1]], train_y.reshape(train_y.shape[0], train_y.shape[1], 1)[:, 1:], epochs=10, callbacks=[early_stop], batch_size=128, verbose=2, validation_data=([val_x, val_y[:, :-1]], val_y.reshape(val_y.shape[0], val_y.shape[1], 1)[:, 1:])) # fit the model on the training data with early stopping callback and validation data, using 10 epochs and a batch size of 128, with verbose output set to 2
Step 6: Generate sentences
To generate the summary, we take an input sequence and generate a summary of it in a nutshell. The code starts by using an encoder model to predict the hidden and cell states of the input sequence. Then, it sets up the next token as the start-of-sequence token and an empty output sequence string. The next step is to enter a loop that continues until either the end-of-sequence token is encountered or the maximum number of iterations is reached. In each iteration, the decoder model is used to predict the next token and the sequence's hidden and cell states. The token with the highest probability is chosen, added to the output sequence (as long as it’s not a start-of-sequence or end-of-sequence token), and then becomes the next token. And so the loop goes on until the stop condition is met.
def generate_summary(input_seq):
h, c = eModel.predict(input_seq) # get hidden and cell state from encoder model
next_token = np.zeros((1, 1))
next_token[0, 0] = yTokenizer.word_index['sostok'] # set next token to start of sequence token
output_seq = ''
stop = False
count = 0
while not stop:
if count > 100: # maximum number of iterations
break
decoder_out, state_h, state_c = dModel.predict([next_token]+[h, c]) # get output from decoder model
token_idx = np.argmax(decoder_out[0, -1, :]) # get index of predicted token
if token_idx == yTokenizer.word_index['eostok']: # end of sequence token
stop = True
elif token_idx > 0 and token_idx != yTokenizer.word_index['sostok']: # exclude special tokens
token = yTokenizer.index_word[token_idx] # get actual word from token index
output_seq = output_seq + ' ' + token # append to output sequence
next_token = np.zeros((1, 1))
next_token[0, 0] = token_idx # set next token to predicted token
h, c = state_h, state_c # update hidden and cell state
count += 1
return output_seq # return generated summary sequence
Step 6: Get the Summary
Finally, we define a function called “get_summary”, which takes in a user input text and returns its summary. The function starts by cleaning the input text using the “text_cleaner” function. Then, it converts the cleaned text into sequences of integers using the “Tokenizer.texts_to_sequences” method. Next, the sequences are padded to match the maximum length of the documents using the “pad_sequences” function. Finally, the summary of the input text is generated using the “generate_summary” function and returned as the output of the “get_summary” function.
#get summary for one custom input
def get_summary(input_text):
#clean the input document
input_text = text_cleaner(input_text)
#convert the cleaned document to sequences of integers
input_text = Tokenizer.texts_to_sequences([input_text])
#pad the sequences to match the maximum length of document
input_text = pad_sequences(input_text, maxlen=maxlen_text, padding='post')
#generate the summary
summary = generate_summary(input_text)
return summary
Final Output
For the following input
Dougie Freedman is on the verge of agreeing a new two-year deal to remain at Nottingham Forest.
Freedman has stabilised Forest since he replaced cult hero Stuart Pearce and the club's owners are pleased with the job he has done at the City Ground.
Dougie Freedman is set to sign a new deal at Nottingham Forest.
Freedman has impressed at the City Ground since replacing Stuart Pearce in February.
They made an audacious attempt on the play-off places when Freedman replaced Pearce but have tailed off in recent weeks.
That has not prevented Forest's ownership making moves to secure Freedman on a contract for the next two seasons.
The output we get is:
Dougie Freedman is likely to sign a new two-year deal at Nottingham Forest after impressing since replacing Stuart Pearce in February and stabilizing the club.
Despite a recent drop in performance, Forest's owners are pleased with Freedman's work and want to secure him for the next two seasons.
Some applications of abstractive summariser in everyday life
Abstractive summarization has found applications in a variety of software systems and platforms, particularly in the fields of artificial intelligence and natural language processing. Some popular examples include:
- News aggregators, such as Google News, Apple News, and Flipboard, use abstractive summarization techniques to provide a brief overview of the most important news stories of the day based on a variety of sources.
- Chatbots and virtual assistants, such as Amazon’s Alexa, Apple’s Siri, and Google Assistant, use abstractive summarization to provide users with concise and relevant information in response to their queries.
- Content management systems and knowledge management platforms, such as IBM Watson Discovery and Microsoft SharePoint, use abstractive summarization to automatically generate summaries of large collections of documents, making it easier for users to find the information they need.
- Email clients and messaging apps, such as Gmail and Slack, use abstractive summarization to provide users with a summarized view of their incoming messages and notifications.
Abstractive vs Extractive Summarization: The Debate Continues
Now that we have a better understanding of both the abstractive and extractive summarization techniques, it’s time to take a closer look and compare the two. Let’s dive into the key differences between the two approaches and see what makes each unique. This will give us a clearer picture of when it’s best to use one technique over the other and help us make informed decisions when it comes to selecting the right summarization approach for our specific needs.
In conclusion, abstractive summarization is a powerful technique that creates a concise and understandable summary of a longer text by generating new language based on its understanding of the text. Our approach to this task uses a sequence-to-sequence (seq2seq) encoder-decoder architecture, where the encoder processes the input text and encodes it into a fixed-length vector, which is then passed to the decoder. The decoder uses this vector to generate the output text. The seq2seq model is trained on a large corpus of text data, where the input text is a sequence of words, and the output text is the next word in the sequence. By the end of the training process, the model can generate coherent text based on a given input. It’s amazing how we can train a computer to write like a human using these techniques!
Thank you for taking the time to read this article. Balancing grad school and writing can be a challenge, but I felt inspired to share my knowledge after taking a Natural Language Processing course last semester. For my final project, I chose to focus on summarization, and I hope that by writing about it, I can help others clarify their understanding of this topic.
It takes a lot of work to research for and write such an article, and a clap or a follow 👏 from you means the entire world 🌍to me. It takes less than 10 seconds for you, and it helps me with reach! You can also ask me any questions, point out anything, or just drop a “Hey” 👇 down there.