Skip to content

NelsonJQ/termseeker

Repository files navigation

Project Overview

TermSeeker is a Python application designed to facilitate the terminology work by searching for candidate terms in UN languages in official UN documents. The application uses the UN Digital Library search results in one or multiple languages (Arabic, Spanish, French, Russian, Chinese).

Features

  • Convert Word documents to Markdown format.
  • Convert PDF documents (from local files or URLs) to Markdown format.
  • Search the UN Digital Library by term and document symbol.
  • Extract document symbols and metadata from search results.
  • Generate downloadable PDF URLs for UN documents in all official languages.

Google Colab

You can test the installation and execution of the termseeker library using Google Colab. Click the link below to open the notebook:

Open in Colab

Installation

To set up the project, clone the repository and install the required dependencies:

git clone https://github.com/NelsonJQ/termseeker.git
cd termseeker
pip install .

Usage

Using getCandidates()

The getCandidates() function is the core function of this application. It searches for documents in the UN Digital Library based on the provided search text, languages, and filter symbols. It then processes the documents to extract relevant paragraphs and terms.

Function Signature

def getCandidates(input_search_text, input_lang,
                input_filterSymbols, sourcesQuantity,
                paragraphsPerDoc, eraseDrafts,
                localLM=False, groqToken=None
                ):

Parameters

  • input_search_text (str): The search text to find terminology for.
  • input_lang (list or str): Target languages (e.g., ["Spanish", "French"]). Use "ALL" for all supported languages.
  • input_filterSymbols (list): Filter symbols (e.g., ["UNEP/CBD", "UNEP/EA"]).
  • sourcesQuantity (int): Number of sources to retrieve.
  • paragraphsPerDoc (int): Paragraphs per document.
  • eraseDrafts (bool): Whether to erase draft documents.
  • localLM (bool): Whether to use LM Studio for local inference server (Ollama) (Optional, set it as None to skip term extraction by any local or cloud LLM)
  • groqToken (str): API key for Groq cloud inference server (70b model) (Optional)

Example Usage

from termseeker.getcandidates import getCandidates

input_search_text = "10-Year Framework of Programmes on Sustainable Consumption and Production Patterns"
input_lang = ["Spanish", "French"]
input_filterSymbols = ["UNEP/CBD", "UNEP/EA", "FCCC"]
sourcesQuantity = 3
paragraphsPerDoc = 2
eraseDrafts = True

results = getCandidates(input_search_text, input_lang, input_filterSymbols, sourcesQuantity, paragraphsPerDoc, eraseDrafts)
print(results)

Using consolidate_results()

The consolidate_results() function from utils.py consolidates the results obtained from getCandidates() into a compact dataframe and optionally exports it as an Excel file.

Function Signature

def consolidate_results(result, exportExcel=False) -> list:

Parameters

  • result (list): List of dictionaries containing the cleaned metadata (output from getCandidates()).
  • exportExcel (bool): Whether to export the consolidated results as an Excel file.

Example Usage

from termseeker.utils import consolidate_results

# Assuming `results` is the output from getCandidates()
consolidated_results = consolidate_results(results, exportExcel=True)
print(consolidated_results)

Example of Returned DataFrame

The following is an example of the returned dataframe by consolidate_results() and getCandidates():

EnglishTerm FrenchTerm SpanishTerm FrenchSynonyms SpanishSynonyms EnglishParagraphs FrenchParagraphs SpanishParagraphs docSymbol publicationDate docType docTitle
10-Year Framework of Programmes on Sustainable Consumption and Production Patterns Cadre décennal de programmation concernant les modes de consommation et de production durables Marco Decenal de Programas sobre Modalidades de Consumo y Producción Sostenibles [] [] Emphasizing the need to [...] such as
the 10-Year Framework of Programmes on Sustainable Consumption and Production Patterns,
relevant to [...], (Source: UNEP/EA.2/RES.8 on 2016-08-03)

6. _Also requests the [...] of the
10-Year Framework of Programmes on Sustainable Consumption and Production Patterns, taking into
account [...]: (Source: UNEP/EA.2/RES.8 on 2016-08-03)

6. The 10-Year Framework of Programmes on Sustainable Consumption and Production Patterns
reported that some $80 million [...]. (Source: UNEP/EA.3/11 on 2017-09-20)
Soulignant qu’il faut [...]
que le Cadre décennal de programmation concernant les modes de consommation et de production
durables, qui présentent [...], (Source: UNEP/EA.2/RES.8 on 2016-08-03)

6. _Prie également le [...] du Cadre décennal de programmation concernant les modes de consommation et de
production durables, compte tenu des [...] : (Source: UNEP/EA.2/RES.8 on 2016-08-03)

6. Selon le Cadre décennal de programmation concernant les modes de consommation et de
production durables, quelque 80 millions de dollars avaient [...]. (Source: UNEP/EA.3/11 on 2017-09-20)
Haciendo hincapié en [...] como el Marco Decenal de Programas sobre Modalidades de Consumo y Producción
Sostenibles, que guardan [...], (Source: UNEP/EA.2/RES.8 on 2016-08-03)

6. _Solicita también al [...] del Marco Decenal de Programas sobre Modalidades de Consumo y Producción
Sostenibles, teniendo en cuenta las [...]: (Source: UNEP/EA.2/RES.8 on 2016-08-03)

6. El Marco Decenal de Programas sobre Modalidades de Consumo y Producción Sostenibles
informó de que a [...]. (Source: UNEP/EA.3/11 on 2017-09-20)
UNEP/EA.2/RES.8
UNEP/EA.1/INF/3
UNEP/EA.3/11
2016-08-03
2014-05-30
2017-09-20
Resolutions and Decisions
Documents and Publications
Reports
2/8. Sustainable consumption and production : resolution / adopted by the United Nations Environment Assembly
Results of the sixty-eighth session of the General Assembly of relevance to the United Nations Environment Assembly : note / by the Executive Director
Progress made pursuant to resolution 2/8 on sustainable consumption and production : report of the Executive Director

Choosing Between Local Inference Server and DDGS

The askLLM_term_equivalents function can use either a local inference server (using LM Studio AI) or the DuckDuckGo Search (DDGS) service to extract term equivalents.

  • Local Inference Server:

    • Advantages: No dependency on external services, full data privacy, and more control over the model and its parameters.
    • Disadvantages: Requires setup and maintenance of the local server, which might be resource-intensive. See the documentation for calling to the server's endpoint here.
  • DDGS:

    • Advantages: Easy to use without setup, leverages powerful models hosted by DuckDuckGo.
    • Disadvantages: Dependent on external service availability and internet connection, potentially slower response times.

Example Usage

from termseeker.utils import askLLM_term_equivalents

# Using DDGS
response = askLLM_term_equivalents(
    source_term="climate change",
    source_paragraphs=["..."],
    target_paragraphs=["..."],
    source_language="English",
    target_language="Spanish",
    customInference=False
)
print(response)

# Using Local Inference Server
response = askLLM_term_equivalents(
    source_term="climate change",
    source_paragraphs=["..."],
    target_paragraphs=["..."],
    source_language="English",
    target_language="Spanish",
    customInference=True
)
print(response)

Contributing

Contributions are welcome! Please open an issue or submit a pull request for any enhancements or bug fixes.

License

This project is licensed under the MIT License and requires permission from UN Digital Library. See the LICENSE file for more details.

About

UN Terminology checker and corrector for parallel texts

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •