Project Title

A comprehensive data pipeline leveraging Kafka, Confluent, and MySQL to ingest, process, and store data. This project demonstrates the implementation of a robust data pipeline using Kafka producers and consumers, schema enforcement with Avro, and seamless integration with MySQL.

Overview

This project was a simple task assigned to me during a hiring process for a startup. I was assigned to build a complete data pipeline that:

Ingests data using Kafka producers.
Enforces schemas using Confluent's Schema Registry with Avro.
Processes the data using Kafka consumers.
Stores the processed data in a MySQL database for further analysis and querying.

Features

Data Ingestion (Kafka): The system reads payment transaction data from an Apache Kafka stream in real time. Kafka serves as a robust, scalable solution for handling highthroughput, real-time data feeds.
Transaction Scoring (Machine Learning Model): A pre-trained machine learning model will be used to score each transaction based on various features. The model can be designed for fraud detection, credit scoring, or other relevant evaluations.
Data Storage (MySQL): The scored transactions will be written into a MySQL database. This integration ensures that scoring results are immediately available for reporting,analysis, and further action.

Technologies Used

Kafka: Distributed event streaming platform for real-time data processing.
Confluent Platform: Kafka management tools and Schema Registry.
Avro: Schema definition and enforcement.
MySQL: Relational database for data storage.
Python/Java: (Specify language if relevant) Programming language used for producer and consumer implementation.

Setup

Prerequisites

Kafka and Confluent Platform installed.
MySQL database setup and running.
Required libraries and dependencies installed.

Steps

Kafka Setup:

Start Kafka brokers and Zookeeper.

/bin/zookeper-server-start.sh $KAFKA_HOME/config/zookeper.properties
/bin/kafka-server-start.sh $KAFKA_HOME/config/server.properties

Configure Kafka topics for data ingestion and processing. you can do it like this. we will create a topic called Transactions since this is the name we used.

kafka-topics.sh --bootstrap-server localhost:9092 --create --topic Transactions --partitions 1 --replication-factor 1

Confluent Setup:
- Configure Confluent Schema Registry. After downloading the platform tarball use:
```
tar -xzf confluent-7.7.1.tar.gz
cd confluent-7.7.1
./bin/schema-registry-start.sh /etc/schema-registry.properties
```
- Set up Confluent API for Kafka topic management.
```
pip install confluent-kafka
```
Avro Schema:
- Define and register Avro schemas for data validation. we created the avro schema separately in a file.
MySQL Database:
- Create the necessary database and tables for storing processed data. open mysql workbench or cli. For me I used the cli
```
mysql -u root -p -h 127.0.0.1
```
I used the loopback address instead of localhost because for some reason it kept denying the access. Enter your password. If you don't have the database and it is you first time then use the sql script provided.
Run the Pipeline:
- Start Kafka producers to publish data. You will need to edit the code to change the user, host, user-password, and the bootstrap-servers.
- Start Kafka consumers to process data and load it into MySQL.

Usage

Run the Admin:
- This script creates the Database table and Kafka topic if they are not created
```
python admin.py
```
Run the Producer:
- Publish data using the Kafka producer:
```
python producer.py
```
Run the Consumer:
- Consume and process data using the Kafka consumer:
```
python consumer.py
```

Contributing

Contributions are welcome! Please create a pull request or open an issue for feedback and suggestions.

License

Use whatever you want from here. But note that we are using confluent so make sure you have the proper permessions if you plan to use it for a commercial purpose.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
AmountTimeScaler.pkl		AmountTimeScaler.pkl
CreditScoringModel.json		CreditScoringModel.json
README.md		README.md
admin.py		admin.py
consumer.py		consumer.py
producer.py		producer.py
transaction.avsc		transaction.avsc

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Project Title

Table of Contents

Overview

Features

Technologies Used

Setup

Prerequisites

Steps

Usage

Contributing

License

About

Uh oh!

Releases

Packages

Languages

Mohammed0tarek/DataTask

Folders and files

Latest commit

History

Repository files navigation

Project Title

Table of Contents

Overview

Features

Technologies Used

Setup

Prerequisites

Steps

Usage

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages