NWDAF ML

This repository presents a proof-of-concept (PoC) demonstrating the application of Machine Learning (ML) functionality based on the specifications of the Network Data Analytics Function (NWDAF) implementation for the classification of 5G devices solely based on their observed traffic.

Tested Environment Configuration

Software

python: 3.13.1 (also tested on 3.13.5)
pip: 25.1.1
python packages from requirements.txt
tshark: 4.4.3
perl: 5.40.1 (also tested on 5.42.0)
glibc: 2.42

Hardware

The hardware specifications below concern the ML experiment pipeline (e.g. processing the dataset, training models and running inference). For the hardware and software requirements for generating a new 5G simulated traffic dataset, please, refer to the traffic-gen README file.

Minimum

4x 3.0GHz CPU cores
16GB of RAM
AGB HDD to store the input dataset
~11.3xA HDD available to store the temporary files of preprocessed data
~2xA HDD to store the preprocessed data

TIP: The dataset used to obtain these specifications was a subset of the full dataset and had 19.6GB training + 2.4GB inference data (i.e. A = 22GB), see the values on the list below for a concrete example.

The specifications below were taken from the machine used during the experiments. If you don't have this amount of resources, consider downloading the preprocessed files or splitting the data to be processed in blocks (specially on pcap_extract.sh and export_JSON.py steps, that require the largest amount of HDD and RAM).

16x 3.8GHz CPU cores
4x 32GB of RAM 3200MT/s CL16
128GB SWAP (not needed if 256GB of RAM is available)
25GB SSD 5000MB/s write / 3200MB/s read* to store the input dataset
400GB SSD available to store the temporary files of preprocessed data
50GB SSD to store the preprocessed data

* A 450MB/s SATA interface might be enough

NOTE: These specifications are tailored to the datasets we tested on our implementation. The amount of RAM required increases linearly with the size of the dataset as the dataset will be loaded into RAM during training. The amount of CPU cores available will directly influence the parallel tasks (such as in pcap_extract.sh and export_JSON.py), the more input files, the more CPUs are required.

Quick comparison between [Kim et al. 2022], our previous work and current work

The authors of [Kim et al. 2022] implemented the NWDAF module and its submodules (MTLF and AnLF) integrated to free5GC, however, they used an image dataset as their ML functionality.

Previously, a reprodction of [Kim et al. 2022]'s work was made on [de Oliveira et al. 2024] (that README details the environment used in this process). After that, another ML functionality closely related to Computer Networks field was implemented. Instead of using an image dataset, a packet capture dataset containing 6 captures of 1000 packets each was created. This dataset was used to test the new ML functionality and the instructions to reproduce the environment used for this second phase are located on that other file. The integration between [Kim et al. 2022]'s NWDAF and [de Oliveira et al. 2024]'s ML functionality wasn't finished.

Our current work focused on two main points: (i) creating a larger 5G simulated public PCAP dataset; and (ii) enhancing the classification results obtained on [de Oliveira et al. 2024].

The dataset was created with 1 million packets for each capture. Builing upon the implementation done by [de Oliveira et al. 2024], the current work completely reimplemented the ML pipeline to include 11 models and 33 features (previously there was only 3 models and 7 features) extracted from the PCAP files. The data used for training and testing the models didn't overlap with the data used for inference (more details in the section below). Despite being possible to greatly enhance the model performance, the models suffered from overfitting not being able to correctly classify the packets of the mMTC class.

Dataset description

The used dataset was divided in two parts: training data and inference data.

In general, classification ML models require a large amount of data to work well. Our previous work contained only 3000 packets for training the models and another 3000 to run inferences. To improve that, we focused on having around 10 million packets for each class on the training phase.

Considering an application of our work (5G user equipment traffic classification), we expect that real world systems that work the same way our implementation was design would have some data to be trained on then would have the models used for inference in another dataset. For example: a mobile network operator might train a classification model in a test environment, then deploy this model in a production environment and use its output to have some insight or make a decision. Because of that, it was decided that the training and inference datasets would not overlap.

With those objectives in mind, it was possible to find two public 5G datasets ("5G Traffic Datasets" and "5G Campus Networks: Measurement Traces") that contained data that we could use to train models. For the inference set, the description of the steps taken to create the datasets we've used in training were taken into account and, as close as possible, implemented on our 5G simulated testing environment.

NOTE: The files listed below are publicly available on this Zenodo record

Training set

file_name.pcap: disk_size; number_of_packets; class**; comments
Youtube_cellular.pcap: 12.4GB; 10,774,692; eMBB; obtained from "5G Traffic Datasets"
naver5g3-10M.pcap: 11.8GB; 10,248,958; URLLC; a 10M packet cut from the file naver5g3.pcap obtained from "5G Traffic Datasets"
*.100.pcap (10 files in total): 169.6MB; 1,000,000***; mMTC; obtained from "5G Campus Networks: Measurement Traces"

Inference set

file_name.pcap: disk_size; number_of_packets; class**; comments
youtube-1M-1080p.pcap: 1.3GB; 1,004,464; eMBB; captured during the playback of this playlist
naver-tv-1M.pcap: 1.1GB; 1,042,918; URLLC; captured during the video live streaming of this channel
udp-100pps.pcap: 97.4MB; 1,069,973; mMTC; captured using the UDP client with 100 packets/sec (as described on [Rischke et al. 2021], the same authors of the "5G Campus Networks: Measurement Traces" dataset)
udp-nc-traffic-1k.pcap: 88kB; 1,007; mMTC; captured using the UDP client with the probabilistic approach described on [Sivanathan et al. 2017]

** Classes based on ITU's M.2083-0 recommendation

*** To the best of our knowledge, there isn't any real world PCAP dataset containing 10 million mMTC packets on a single capture, the closest it was possible to find at this time were the captures from "5G Campus Networks: Measurement Traces" dataset that account for 1 million packets in total

NOTE: The scripts used to create this data and details of the environment are available on traffic-gen folder.

Install the prerequisites

NOTE: The commands in this section are supported on a Bash console

Clone the repo

git clone https://github.com/netlabufjf/nwdaf_ml.git

Install Python3 and pip and configure a virtual environment

sudo apt install python3 python3-pip python3-venv
cd nwdaf_ml # enter to repository's root folder
python -m venv pyvenv
source pyvenv/bin/activate

Install Python required packages

pip install -r requirements.txt

Install Perl and Perl JSON module

Example:

sudo apt install perl
cpan install JSON

Usage

Go to the root folder and activate the virtual environment

cd nwdaf_ml # enter to repository's root folder
source pyvenv/bin/activate

Move the input PCAP files to ./pcap/input or edit the path in the scripts

NOTE: For more information on the directory structure, check the pcap-folder-dir-tree files on this page

Execute the steps to extract the PCAP data and obtain some statistics

bash pcap_extract.sh
python dataset_CSV_characterization.py

Prepare the dataset for model training

python export_JSON.py
bash add_label_to_name.sh

Execute the Machine Learning script to preprocess the data and train the models

python ml.py

Execute the inference using the trained models

python inference.py

NOTE: On its current implementation, the inference script will check for data that still needs to be preprocessed before running inference. This was designed to handle the use case where the inference data is added to the input folder only after running the training pipeline with the training data.

Statistics

To obtain some statistics, run steps 1-3 from previous section and:

python stat-plotter.py

To obtain some more statistics (box plots), run up to the first command of step 4 from the previous section and:

python box-plotter.py

Execution order

The diagram depicted below outlines the execution order of the scripts after the data is loaded in the input folder

Using the traffic generator scripts

Please, refer to the traffic-gen README file

Automated pipeline execution

The script run-all.sh was designed to execute all the pipeline steps automatically.

Load the PCAP data in the input folder, install the requisites, then run the run-all.sh script.

NOTE: For more information on the directory structure, check the pcap-folder-dir-tree files on this page

Citing this work

Please, cite it as:

L. A. de Oliveira, "User equipment traffic classification in the 5G core", UFJF, 06 2025.

Or use the BibTex below:

@masterthesis{de_Oliveira_2025,
    title={User equipment traffic classification in the 5G core}, 
    author={{de Oliveira}, Leonardo Azalim},
    month={06},
    year={2025},
    url={https://repositorio.ufjf.br/jspui/handle/ufjf/19385},
    doi={https://doi.org/10.5281/zenodo.17635905},
    institution={UFJF},
    publisher = {Universidade Federal de Juiz de Fora (UFJF)},
    copyright = {Creative Commons Attribution 3.0 Unported},
    language={en}
}

Acknowledgements

I'd like to acknowledge Mr Rodrigo Oliveira for all comments and tips he gave during the coding phase of this work. I'd like to also thank the anonymous reviewers and conference participants who provided valuable feedback on our previous work, contributing to the development of this research.

License

The original code from upstream did not explicitly specify any license terms. However, the work contained in this repository is licensed under the GPLv3, as indicated in the LICENSE file, which is reflected in the notice provided below:

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License version 3 as published by the Free Software Foundation.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

For the json2csv submodule license, check its own notice.

Name		Name	Last commit message	Last commit date
Latest commit History 386 Commits
img		img
pcap_json2csv @ 92c8695		pcap_json2csv @ 92c8695
traffic-gen		traffic-gen
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
accuracy_recall_same_value_proof.py		accuracy_recall_same_value_proof.py
add_label_to_name.sh		add_label_to_name.sh
box-plotter.py		box-plotter.py
dataset_CSV_characterization.py		dataset_CSV_characterization.py
delete_out_dirs.sh		delete_out_dirs.sh
dt_visualization.py		dt_visualization.py
export_JSON.py		export_JSON.py
hypothesis-test.py		hypothesis-test.py
inference.py		inference.py
ml.py		ml.py
model_tuning.py		model_tuning.py
pcap_extract.sh		pcap_extract.sh
requirements.txt		requirements.txt
run-all.sh		run-all.sh
stat-plotter.py		stat-plotter.py
util.py		util.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NWDAF ML

Tested Environment Configuration

Software

Hardware

Minimum

Recommended

Quick comparison between [Kim et al. 2022], our previous work and current work

Dataset description

Training set

Inference set

Install the prerequisites

Usage

Statistics

Execution order

Using the traffic generator scripts

Automated pipeline execution

Citing this work

Acknowledgements

License

About

Uh oh!

Releases

Languages

License

netlabufjf/nwdaf_ml

Folders and files

Latest commit

History

Repository files navigation

NWDAF ML

Tested Environment Configuration

Software

Hardware

Minimum

Recommended

Quick comparison between [Kim et al. 2022], our previous work and current work

Dataset description

Training set

Inference set

Install the prerequisites

Usage

Statistics

Execution order

Using the traffic generator scripts

Automated pipeline execution

Citing this work

Acknowledgements

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Languages