Skip to content

A microservice scraping GitHub repositories based on a specific topic

License

Notifications You must be signed in to change notification settings

pcolt/playwright-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

NodeJS TypeScript Playwright MongoDB Redis ESLint Docker

GitHub's topics scraper with Playwright

A microservice crawling and scraping GitHub repositories based on a specific topic (i.e. climatechange). This project is part of my final project for the Helsinki University's Full Stack Open course.

The service is subcribed to a Redis pub/sub message channel and starts a new scraping process whenever a message is received.

The microservice stores the results into an Atlas Mongodb database. The complete result is also stored into a local .json and .csv file.

The scraping process returns for each repository found the following data:

  • owner
  • name
  • URL
  • number of starts
  • description
  • list of repository topics

Work hours

A list of approximate work hours used to develop the project are listed in workhours.md

Installation

Run npm install

Configure secret/environment variables

  • In the root folder create .env file with following keys:
MONGO_URL = 'mongodb+srv://fullstack:MONGODB_FULLSTACK_USER_PASSWORD@cluster0.ck2n2.mongodb.net/repos?retryWrites=true&w=majority'
REDIS_URL = 'redis://default:REDIS_DEFAULTUSER_PASSWORD@redis-12236.c300.eu-central-1-1.ec2.cloud.redislabs.com:12236'
  • Set sensitive data as Fly.io secrets with commands:
    fly secrets set MONGO_URL='mongodb+srv://fullstack:MONGODB_FULLSTACK_USER_PASSWORD@cluster0.ck2n2.mongodb.net/repos?retryWrites=true&w=majority' fly secrets set REDIS_URL='redis://default:MONGODB_DEFAULTUSER_PASSWORD@redis-12236.c300.eu-central-1-1.ec2.cloud.redislabs.com:12236'

Usage

npm run build to compile typescript .ts files located in /src
npm start to run in dev mode the compiled files located in ./build folder
npm run dev to run typescript files on the fly reloading when something changes

Deploy to Fly.io

Check secrets: fly secrets list

Deploy to Fly fly deploy or npm run deploy

Scale Fly app to 0 machines (stopped) fly scale count 0

Scale Fly app back to 1 machine fly scale count 1

Show list of Fly apps currently deployed: fly apps list

Show logs from all machines (or filter by id with -i flag) fly logs

Restart machine fly machine restart

Docker

Docker image is used by Fly.io to deploy this micro-service.
It can be also used to run and debug the Docker image.

Build Docker image docker build . -t scraper

Run Docker image docker run --env MONGO_URL='MONGO_URL_in_.ENV_FILE' --env REDIS_URL='REDIS_URL_in_.ENV_FILE' scraper

Docker list of all containers docker ps -a
Restart a container docker restart [container-id]
Follow container logs docker logs --follow [container-id]

Docker best practices: Docker best practicesOpen it in a new tab.

Git

Print list of all commits to a .txt file (Docs)

git log --reverse --pretty=format:'| %as | 1 | %s |' > log.txt

Dependencies

Mongodb atlas

Connect via web app

https://account.mongodb.com/

Redis cloud

Connect via web app

https://app.redislabs.com/

Connect via terminal

Use the Connect button from the web app which will provide something like this: redis-cli -u redis://default:REDIS_DEFAULTUSER_PASSWORD@redis-12236.c300.eu-central-1-1.ec2.cloud.redislabs.com:12236

Once you are connected, check open and running pub.sub channels with: PUBSUB CHANNELS

References

About

A microservice scraping GitHub repositories based on a specific topic

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published