Skip to content

Web Scraper class for parsing html and downloading files, using NodeJS and typescript types declaration

Notifications You must be signed in to change notification settings

lsthisloss/scraperHTML_TS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

A TypeScript-based web scraping project for downloading and processing catalog data. In ES style, follows SOLID principles

Features

  • Use TypeScript.
  • Follows SOLID principles.
  • Fetches catalogs data from a website.
  • Serialize catalog's information in a data.json file.
  • Save PDF files for each catalog, check for an unique name.

Classess

BaseScraper base class which implements IScraper interface and generic type of content.
CatalogScraper implements the ICatalogScraper and extends the BaseScraper class, inheriting shared functionality.
ServiceProvider encapsulating logic associated with files or directories processing: reading, writing, downloading. It provides a set of methods for working with catalogs.

Services

ServiceProvider provides an access to the services FileDownloader, FileManager, HttpClient, HttpParser, Serializer.

Catalogs interface

interface ICatalog {
    name: string;
    link: string;
    validity: string;
    filename?: string;
    lastParsed: Date;
}

The project structure

The structure is displayed in a tree format for clarity.

Click to expand the project structure
├── engine/
│   ├── classes/                # Core classes implementing business logic
│   │   ├── BaseScraper.ts
│   │   ├── CatalogScraper.ts
│   ├── interfaces/             # TypeScript interfaces for contracts
│   │   ├── interfaces.ts       # Core interfaces (ICatalog, IScraper, ICatalogScraper)
│   │   ├── services.interfaces.ts # Service-specific interfaces (ISerializer, IHttpClient, IHtmlParser, IFileManager, IServiceProvider)
│   ├── services/               # Service layer for reusable utilities
│   │   ├── ServiceProvider.ts  # Service provider to manage dependencies
│   │   ├── Serializer.ts       # Handles serialization
│   │   ├── HtmlParser.ts       # Parses HTML content
│   │   ├── FileManager.ts      # Manages file operations
│   │   ├── HttpClient.ts       # Handles basic HTTP requests
│   ├── utils/                  # Utility functions and helpers
│   │   ├── logger.ts           # Logging utility
│   └── index.ts                # Application Entry Point
├── package.json                # Project metadata and dependencies
├── tsconfig.json               # TypeScript configuration
└── README.md                   # Project documentation

Example data.json

[
  {
    "name": "AKCIJSKI KATALOG",
    "link": "https://www.tus.si/app/uploads/catalogues/20250324085441_13_AKCIJSKI_LETAK26.3.-1.4.2_iWtyoCN.pdf",
    "validity": "26. 03. 2025 ‐ 01. 04. 2025",
    "lastParsed": "2025-04-01T11:31:45.114Z",
    "filename": "AKCIJSKI KATALOG.pdf"
  },
  {
    "name": "AKCIJSKI KATALOG",
    "link": "https://www.tus.si/app/uploads/catalogues/20250331082428_14_AKCIJSKI_LETAK_2.4.-8.4.2_WKd8GTI.pdf",
    "validity": "02. 04. 2025 ‐ 08. 04. 2025",
    "lastParsed": "2025-04-01T11:31:51.005Z",
    "filename": "AKCIJSKI KATALOG_onep9.pdf"
  }
]

How to run

git remote add origin https://github.com/lsthisloss/scraperHTML_TS.git                  
npm install        
npm run dev

Demo

image
image

About

Web Scraper class for parsing html and downloading files, using NodeJS and typescript types declaration

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published