- Kapline is a platform that uses machine learning to detect and classify malicious applications (only Android apk)
- This is a project for the subjects
Technologies for Advanced ProgrammingandSocial Media Managementat UniCT. - The core of the project is built on top of quark-engine and quark-rules
- Based on the dataset, the model is trained on 5 types of malware families:
Benign, Riskware, Adware, SMS, Banking.
The pipeline is structured as follow:
| Data | Service |
|---|---|
| Source | User via Telegram Bot |
| Ingestion | Fluentd |
| Transport | Apache Kafka |
| Storage (input) | httpd |
| Processing | Apache Spark |
| Storage | Elastic Search |
| Visualization | Grafana |
- The frontend is provided by a telegram bot (for simplicity reasons)
- The
telegram botcontainer and thehttpd containershare a volume where the files are stored
The bot sends a message to fluentd in this format:
{
"userid": long,
"filename":string,
"md5": string,
}
The field filename will be used later to retrieve the file from httpd.
- Ingestion is provided by
fluentd - Fluentd exposes a route where it awaits an input event
- In this step, the field
timestamp": dateis added - This component write the message on a
Kafkatopic namedapk_pointers
Data processing is powered by Apache Spark. The workflow is:
- The file is retrieved from
http://httpd/{filename} - Then it runs
quark-engineon the retrieved file and score all crimes - The malware family is predicted through machine learning
- The predicted label is sent to the telegram user who requested the analysis
- A new message is written on a
Kafkatopic calledanalyzed
The structure of the message is the following:
{
"timestamp": date,
"md5": string,
"features": list[double],
"size": long,
"predictedLabel":string
}
Now the message will be enriched with some statistics:
- The rules are grouped by labels (refers to quark-rules/label_desc.csv and utils/extract_labels.py)
- Some partials score are calculated (if the label contains at least 4 rules)
- The data is brought into elasticsearch
The structure of a record in elastic search is:
{
"@timestamp": date,
"calendar_score": double,
"calllog_score": double,
"network_score": double,
...
"max_score": double,
"md5": string,
"size": long,
"predictedLabel": string
}
The dataset was generated through the script /utils/extractor.py on Maldroid dataset. Then a model was trained through logistic regression in which the scoring of each rule is used as a feature.
You can get the jupyter notebook used for training in spark/model_training.ipynb
N.B: At the time I trained the model the rules were 204, so 204 features.
| Service | URL |
|---|---|
| Bot | @nameofthebot |
| htppd | http://httpd |
| Elastic Search | https://elasticsearch:9200 |
| Grafana | https://grafana:3000 |
All environment variables in .env must be set before running docker-compose
cp .env.dist .envRun with:
docker-compose upJust contact the bot and send the APK(s) you want to analyze!
N.B.: There is a limit on the maximum file size that the bot can download (20 MB)



