Skip to content

Solution-Management/Document-Classification-Dataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Document-Classification-Dataset

A dataset containing text from a variety of document classes for classification and demonstration purposes.

For anyone who has ever had to set up and demo a document classification system, You know that generating a dataset of documents in specific classes is time-consuming and often of a poor quality (Copy pasting the same set of few documents over and over again until the classifier is happy).

This project is an effort to create a high-quality classification dataset containing a variety of document classes for everyone to use and add to. The goal is to reach about 100 different samples of each document class for training and a smaller set for validation or demonstration.

Types

  • Invoices
  • Contracts
  • CVs
  • TBD

Progress

How to contribute

This description, along with the rest of this readme, is to be ellaborated. Essentially, adding to this project requires documents to be in English as well as anonymized. Merging documents into the repository can be done via a Pull-request or by sending them straight to Solution Management.

The plan is to also have a script that automatically fills in a set of "Customer" names and up-to-date dates within a range, to make the data more relevant. The exact details of this script and how things should be stored and processed is yet to be determined.

Contributors

About

A dataset containing text from a variety of document classes for classification and demonstration purposes.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages