Coflex: Enhancing HW-NAS with Sparse Gaussian Processes for Efficient and Scalable Software-Hardware Co-Design
Coflex is a hardware-aware neural architecture search (HW-NAS) optimizer that jointly considers key parameters from the software-side neural network architecture and corresponding hardware design configurations. It operates through an iterative co-optimization framework consisting of a multi-objective Bayesian optimizer (front-end) and a performance evaluator (back-end).
In each optimization iteration, Coflex takes candidate configurations as input, evaluates their actual performance trade-offs between software accuracy (e.g., error rate) and hardware efficiency (e.g., energy-delay product), and updates the surrogate models in the Bayesian optimizer accordingly. This process enables Coflex to progressively refine the Pareto front toward a designated reference point (e.g., (0,0)) in the objective space, effectively navigating the inherent conflict between software and hardware objectives.
After multiple iterations, Coflex converges to a near-globally optimal Pareto front, where each point represents a non-dominated configuration offering an optimal trade-off between software performance and hardware cost. The final output provides interpretable architectural design recommendations for both neural network developers and hardware architects, along with the expected performance metrics of each configuration. As a result, Coflex delivers an automated, end-to-end software-hardware co-design pipeline.
The search space of HW-NAS encompasses a high-dimensional hyperparameter space composed of both software-wise parameters (e.g., neural network architectural choices) and hardware-wise parameters (e.g., hardware resource configurations). To initialize the optimization process, Coflex performs uniform sampling across all dimensions of this joint search space. These sampled configurations are then used to construct the initial Gaussian surrogate models within the multi-objective Bayesian optimization front-end.
This work leverages multiple standardized NAS benchmark suites to provide consistent neural architecture input representations for the Coflex optimizer. These benchmarks serve as the input source for both software and hardware configuration spaces.
If you wish to run Coflex on a specific NAS benchmark, please refer to the table below for the corresponding repository links. Make sure to download and store the datasets according to the instructions provided in the How to Run section.
Coflex is designed with high extensibility, supporting diverse NAS benchmarks across various tasks. If you intend to apply Coflex to a new benchmark not covered in this work, you may edit the internal data mapping logic in the Software Performance Evaluator and Hardware Performance Evaluator (DeFiNES) modules to ensure compatibility with the new input/output format.
Note:
- Hw space = Hardware search space size
- Sw space = Software search space size
- Total Parameters = Joint search space size = Hw × Sw
| Suite | NATS-Bench-SSS | TransNAS-Bench-101 | NAS-Bench-201 | NAS-Bench-NLP |
|---|---|---|---|---|
| ⚙️ Hw space | 2.81×10¹⁴ | 2.81×10¹⁴ | 2.81×10¹⁴ | 2.02×10¹⁵ |
| 🧠 Sw space | 3.20×10⁴ | 4.10×10³ | 6.50×10³ | 1.43×10⁴ |
| 📈 Total Parameters | 9.22×10¹⁸ | 1.15×10¹⁸ | 1.83×10¹⁸ | 2.89×10¹⁹ |
Coflex tackles the scalability bottlenecks in hardware-aware NAS by introducing a two-level sparse Gaussian process (SGP) framework:
🔹 Per-objective SGPs reduce complexity by modeling each optimization objective separately.
🔹 Pareto-based fusion combines these models using non-dominance filtering to preserve multi-objective structure.
This design enables Coflex to efficiently explore massive software-hardware search spaces (10¹⁹+ configs) while maintaining high-fidelity trade-off modeling.
To handle the scalability bottlenecks of standard Gaussian Processes in large-scale HW-NAS tasks, Coflex adopts sparse GP modeling with inducing points. Instead of maintaining a full covariance matrix, Coflex approximates it using a low-rank structure derived from a small set of representative inducing inputs. This significantly reduces computational cost and improves stability, enabling fast and reliable optimization over high-dimensional software-hardware design spaces.
🔹Download Link: FRCN_Simulator
🔹Download Link: RBFleX-NAS
This project supports two types of hardware deployment evaluators: DeFiNES and Scale-Sim, each offering distinct trade-offs between evaluation speed and accuracy:
# Scale-Sim is employed as a fast yet lower-accuracy evaluator.
# Average evaluation time: 3–5 seconds per query
# Output: Estimated cycle count
# Use case: Suitable for quick, large-scale architecture assessments during the early-stage search or pruning processes.
# DeFiNES serves as a high-accuracy, hardware-faithful evaluator, albeit with slower evaluation speed.
# Average evaluation time: ~200 seconds per query
# Accuracy:
# Average latency prediction error: ~3%
# Worst-case latency error (e.g., FSRCNN): up to 10%
# Energy prediction error: within 6%
# Use case: Ideal for precise, end-stage performance estimation and final candidate ranking.Please download the hardware deployment evaluator from the following link and follow the instructions in Preprocessing for Reproduction section to correctly install it for reproducing the results presented in the paper.
🔹Download Link: DeFiNES
🔹Download Link: Scale-Sim
pip install -r requirements.txtPlease follow the steps below to correctly set up the working environment for reproducing the experimental results of COFleX:
🔹Set the Working Directory Choose
cd COFleX/as the root working directory.
🔹Unpack Required Archives
unzip COFleX_Analysis.zip -d COFleX/
unzip design_space.zip -d COFleX/🔹Download & Unzip NAS-Benchmark
The Coflex framework supports multiple NAS benchmarks. Please use the corresponding download links as needed.
For NATS-Bench-SSS, Download Link:
NATS-sss-v1_0-50262-simple
unzip NATS-sss-v1_0-50262-simple.zip -d COFleX/🔹Prepare Dataset Download the ImageNet/val dataset and place it into the following directory:
The
CIFAR-10andCIFAR-100datasets will be automatically downloaded by the program intoCOFleX/dataset/.
TheImageNet/valsubset must be manually downloaded or obtained via the command line if a valid URL is available:
wget "https://your-server.com/path-to/imagenet_val.zip" -O imagenet_val.zip
mkdir -p COFleX/dataset/
unzip imagenet_val.zip -d COFleX/dataset/val/🔹Install Required Simulators Download and place the DeFiNES & Scale-Sim into the specified directory:
unzip DeFiNES.zip -d COFleX/Simulator/
unzip ScaleSim.zip -d COFleX/Simulator/Please ensure all environment variables and simulator dependencies are properly configured as described in each simulator's official documentation.
This work supports diverse workload inputs. Please refer to the following section for parameter redefinitions to adapt the implementation to your local execution environment:
# run_sss.py
# * Line 5
acc_code_path = "your-path-to/COFleX/COFleX_Analysis/RBFleX/imageNet_SSS"
# * Line 108 ~ 113
for N_HYPER in [10]: # 5,10,30
for ACQU in ["Coflex","qNParEGO","qNEHVI","qEHVI","random,"nsga", "pabo"]: # "Coflex","qNParEGO","qNEHVI","qEHVI","random,"nsga", "pabo"
for ITERS in [30]: # 5, 15, 30, 45
for N_INIT in [100]: # 10,50,100,300
for BS in [10]: # 1,4,10
for H_ARCH in ["DeFiNES"]: # "ScaleSim", "DeFiNES"
# * Line 182 & 183
parser.add_argument('-ih','--IN-H', default='your-image-H_szie', type=int, help='Height of input image for faster RCNN (default: 224)') # 224, 32
parser.add_argument('-iw','--IN-W', default='your-image-W_szie', type=int, help='Width of input image for faster RCNN (default: 224)') # 224, 32
# Simulator/FRCN_Simulator.py
# * Line 113
benchmark_root="your-path-to/COFleX/NATS-sss-v1_0-50262-simple",
# * Line 144
img_root="your-path-to/COFleX/COFleX/dataset"
# COFleX_Analysis/RBFleX/imageNet_SSS/Check_acc.py
# * Line 16
api_loc = 'your-path-to/COFleX/NATS-sss-v1_0-50262-simple'
# * Line 20
accuracy, latency, time_cost, current_total_time_cost = searchspace.simulate_train_eval(uid, dataset='select-dataset-as-you-want!', hp='90') # "cifar10", "cifar100" "ImageNet16-120"To reproduce the Figs/Tabs results, simply start with
run_sss.py
# Global Search in NATS Benchmark
# Supported Datasets: CIFAR10, CFIAR100, ImageNet
# Executed task: Image Classification
python run_sss.pyOutput Results Storage Location & Figs Reproduce When the program completes execution successfully, the results will be stored under
COFleX\COFleX_result\, which will include:
# train_input.py, representing the final software and hardware parameters generated through the HW-NAS
# optimization process
# train_output.py, representing the results obtained in each objective dimension during multi-objective
# optimization, which form the Pareto front
# hv.py, containing the Dominated Hypervolume progression of all solution sets searched by each HW-NAS
# method in every iteration
# opt_vs_time_analys.py, recording the solutions retained by each HW-NAS method during each iteration,
# demonstrating the optimization efficiency and convergence ability over time
# opt_efficiency_analys, recording the maximized software performance and minimized hardware consumption
#achieved in each dimension during iterative optimization To easily reproduce the figures presented in the paper, you may optionally download from
Results Saving
The
savingfolder contains five figure plotting scripts:
# 1_run_ploting_pareto_fronts.py, used to plot the Pareto front formed by multi-objective optimization,
illustrating the trade-off relationships
# 2_run_inverted_generational_dis.py, used to show the Pareto front closest to the reference point (0, 0).
The hyper-space enclosed by this front and the reference point is called the Pareto Optimal Region,
demonstrating the algorithm’s contraction and advancement capability. The smaller the value, the better
the final optimized solution set
# 3_run_hypervolume.py, used to show the Dominated Hypervolume of all solution sets searched by the HW-NAS
algorithm over multiple iterations, reflecting the algorithm’s exploration ability in the search space.
A larger value indicates a more comprehensive exploration, avoiding local optima
# 4_run_opt_efficiency_analysis.py, records the solutions retained by each HW-NAS method during
each iteration, demonstrating optimization efficiency and convergence ability across iterations
# 5_run_opt_vs_time_analysis.py, records the maximized software performance and minimized hardware
consumption achieved in each dimension during iterative optimization Please refer to the following section for
your-path-toredefinitions to adapt the implementation to your local execution environment, then you may run:
python 1_run_ploting_pareto_fronts.pyTo reproduce the results presented in Figure 4(a) of the paper. The expected output is illustrated as follow.
python 2_run_inverted_generational_dis.pyTo reproduce the results presented in Figure 4(b) of the paper. The expected output is illustrated as follow.
python 3_run_hypervolume.pyTo reproduce the results presented in Figure 4(c) of the paper. The expected output is illustrated as follow.
python 4_run_opt_efficiency_analysis.pyTo reproduce the results presented in Figure 4(d)&(e)&(f) of the paper. The expected output is illustrated as follow.
python 5_run_opt_vs_time_analysis.py To reproduce the results presented in Figure 4(f) of the paper. The expected output is illustrated as follow. For coflex, its optimization process demonstrates better stability, maintaining a lower Err vs EDP relationship in both the early and later stages, with a clear convergence appearing within the limited number of iterations, indicating that coflex may possess global optimal search capabilities. Compared to other methods, coflex has a better optimization advantage.
If you wish to retrain all HW-NAS algorithms on different workloads, please copy the result package from
COFleX\COFleX_result\into theResults Savingdirectory, and update thepathconfigs in all scripts under thesavingfolder to match your local deployment environment.







