Math behind TextRank Algorithm

MLMath.io
4 min readFeb 11, 2022

TextRank, PageRank in detail with Numpy Implementation.

Goal — Maths behind TextRank, PageRank explained in detail with Numpy Implementation and used it to generate Summary.

TextRank — is a graph-based ranking model for text processing that can be used in order to find the most relevant sentences in the text documents.

Application

Extractive summarization and also to find keywords/keyphrases.

Topic Covered

  1. PageRank Explanation
  2. TextRank Explanation
  3. Python Implementation of PageRank
  4. Python Implementation of TextRank

PageRank Explanation

PageRank is a very popular algorithm by google which used to calculate the weight of web pages, which in the under-hood a graph-based algorithm is used to rank the webpages. The reason I am explaining PageRank here, as the TextRank algorithm is based on PageRank

Page Rank

Let me explain the page rank with the below example. It will be a lot easier.

We have 7 web pages namely A, B, C, D, E, F, G, I, represented in the graph as shown in fig 1, each webpage is a node in graph and arrows signify direct relation, meaning in a chunk of the text of webpage E has links for web page G.

Fig2 represents the outgoing links between webpages, that is B has an outgoing link to E and F similarly, F has outgoing links to G. And each outgoing link is mapped to a value of 1 in matrix else 0.

Fig 3 is just a normalisation of outgoing links, as B has 2 outgoing links so, each outbound link will be averaged to 0.5.

Formula of PageRank

PR(p_i, t) → Page rank at t^th iterations for i^th webpage.

d → Damping Factor (Way to do teleportation)

L→ length of outgoing links

N→ length of webPages

Algorithms:

  1. Initialise a vector “V” where all element is equal to 1 and size is equal to the number of nodes. and also define no of iteration “Iter”
  2. Normalize Vector “V”.
  3. Take the damping factor value like “0.85”
  4. Compute the PageRank of each node by the above formula.
  5. repeat step 4 for a given number of iterations “Iter”..
import numpy as np
def pagerank(M, iteration, D):
N = len(M)
V = [1]*N
V = V / np.linalg.norm(V, 1)
M_T = (D * M + (1 - D)/N)
for i in range(iteration):
V = np.dot(M_T, V)

return V
M = np.array([[0, 0, 0, 0, 0, 0, 1],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 0, 0, 0],
[0.5, 0.5, 0, 0, 0, 0, 0],
[0.5, 0.5, 1, 1, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0]])
V = pagerank(M, 100, 0.85)
array([1, 2, 3, 4, 5, 0, 6])
Basically -> rank is G A F E C D

TextRank Explanation:

Similar to page Rank, text rank is also a graph-based ranking algorithm. Instead of webpage links, TextRank algorithms work on words and sentences(word embedding or sentence embedding). And Instead of counting incoming links here, the similarity between sentences is used.

In Laymen Terms:

In Laymen terms what TextRank does is very simple it finds how similar each sentence is to all other sentences. and one which is most similar to other sentences are ranked higher.

Algorithms

  1. Take a document with n sentence.
  2. compute embedding of each sentence using Tf-IDF, Bert, etc.
  3. calculate the similarity between each sentence, which will be nothing but a matrix M which you see above in implementation.
  4. Normalise the cosine Matrix M, so that each row sum equals one.
  5. Then use page rank algorithms to compute the rank of each sentence.

Example and Implementations.

import numpy as np
from sentence_transformers import SentenceTransformer, util
model = SentenceTransformer('all-MiniLM-L6-v2')
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize
min_threhold = 1e-5
text = ['TextRank algorithm is used for summarization algorithms.'
, 'Text processing is necessary and important.'
, 'Text processing is easy.'
, 'TextRank algorithm is used for keyword extraction.'
, 'The data structure is an integral part of any algorithm.'
, 'apple banana stores'
, 'MI, is a phone brand'
, 'Apple mobile are more durabel then MI']
def textrank(M, iteration, D):
N = len(M)
V = [1/N]*N
V = V / np.linalg.norm(V, 1)
Prev_V = 0
M_T = (D * M + (1 - D)/N)
while i < iteration:
V = np.dot(M_T, V)
if abs(Prev_V - sum(V)) < min_threhold:
break
else:
Prev_V = sum(v)

return V
n = len(text)
similarity_score = np.zeros((n,n))
for row in range(n):
for column in range(n):
first_sent_embedding = model.encode([text[row]])
second_sent_embedding = model.encode([text[column]])
sim_score = cosine_similarity(first_sent_embedding, second_sent_embedding)
similarity_score[row][column] = sim_score
similarity_score[column][row] = sim_score
similarity_score/=np.sum(similarity_score, axis=1)
similarity_score = similarity_score
V = textrank(similarity_score, 400, 0.85)
d = np.argsort(V)
for i in d.reverse():
print(text[i])
--------------Sentence ranked from high to low--------------Text processing is easy.
Text processing is necessary and important.
TextRank algorithm is used for keyword extraction.
TextRank algorithm is used for summarization algorithms.
Apple mobile are more durabel then MI
MI, is a phone brand
apple banana stores
The data structure is an integral part of any algorithm.

For Future blog:

  1. power method to compute page rank as it saves lots of computation.
  2. Drawbacks of TextRank in the context of Summarization.
  3. Maximal Margin relevance.

--

--

MLMath.io

Machine learning | Deep Learning | Reinforcement Learning | Probability