Skip to content

This code can be used to implement a text search algorithm. Specifically, given an input text it returns a summary of the most relevant paragraph from a given list of paragraphs.

Notifications You must be signed in to change notification settings

marcoramponi/Extract-Relevant-Text

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

14 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract Relevant Text

This code can be used to implement a text search algorithm powered by Machine Learning techniques.

Specifically, given an input text, it returns a summary of the most relevant paragraph from a given list of paragraphs.

The logic essentially consists of two components:

1. Topic Extractor

I split the data (list of paragraphs) into a predefined number of topics via a non-negative matrix factorization algorithm. Given an input text, the most 'relevant' topic is then extracted by computing a cosine similarity function.

2. Summarization

An (abstractive) text summarization of the chosen paragraph is then performed via a Hugging Face Transformers pipeline object set for summarization task.

Dataset

In order to test this algorithm I prepared a dataset containing a list of paragraphs from F. Nietzsche. The preprocessing is aimed at extracting and grouping text by paragraphs. I also perform some exploratory data analysis along the way.

The original dataset 'nietzsche.txt' is publicly availabe on Kaggle:

https://www.kaggle.com/datasets/christopherlemke/philosophical-texts

About

This code can be used to implement a text search algorithm. Specifically, given an input text it returns a summary of the most relevant paragraph from a given list of paragraphs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published