Parmesan is an efficient design flow to map the training of a large DNN onto a system with general device topology to maximize the throughput. It works in an end-to-end manner and solves the whole optimization problem in two phases. The first phase aims at producing well-balanced partitions, and the second phase works towards placing the DNN on devices connected by an arbitrary topology network, considering the heterogeneity of the interconnection bandwidth.
-
comp_graph: pytorch computational graph extractor, and profiler. -
device_topo: device topology data structures, used for compute communication time for now. -
experiments: experiment scripts. -
optimizer: optimization algorithms (incl. model partitioning and device mapping). -
runtime: pipeline scheduler. -
simulator: simulator. -
test_networks: pytorch module definitions for development uses. -
tools: example code of communication profiling, and tool script for checking GPU utilization. -
misc: directory for saving generated data files (e.g., Tensorboard visualizations, images, model checkpoints).
-
run_dry_optimization.py: example code of performing partitioning and mapping with dry run optimization. Sample input and output are given in./misc/text_graph_in/ResNetand./misc/text_graph_out/ResNetrespectively. -
run_experiments.py: script for generating model definition files. -
run_new_runtime.py: script for pipeline training. Note that load_partition should be always given. Alternatively, you can run pipeline training in batch mode using:python3 experiments/batch_runtime_latency.py --work_dir experiments/batch_xxx.
Optimizer.optimization(opt_method, **kwargs)
opt_method (str): type of optimization algorithm. For our algorithm, use "parmesan".
-
For
opt_method="parmesan", the following keyword arguments can be specified:k (int): number of partitions after coarsening stage.
comm_coef (float): communication coefficient in coarsening.
new_flow (bool): if True, the optimization flow is coarsening + DP + uncoarsening. Otherwise, the optimization flow is coarsening + uncoarsening + DP.
search_k, search_comm_coef (bool): if True, perform searching of the corresponding parameter and determine the best set of parameters according to simulation time. If search is enabled, the manually set value for the corresponding parameter will be ignored. See
optimizer/search_params.pyfor more details. -
For
opt_method="ensemble", the following keyword arguments must be specified:sim (Simulator): the simulator object.
opt_method_to_params (dict): a dict. For each item, the key is the string of
opt_methodargument, and the value is the dict object of the correspondingkwargs. -
The sim_params (dict) can be specifid to configure the simulation:
concurrent_copy_comp (bool): if True, concurrent copy and computation will be adopted in simulation.
debug (bool): if True, a simulation tracing file will be wrote in
./misc/coarsen_graph/sim_tracing.json. You can usechrome://tracingto open it. -
The mapping_params (dict) can be specifid to configure the device mapping:
incl_allr (bool): if True, mapping will consider all reduce.
incl_p2p (bool): if True, mapping will consider p2p communication.
parallel_same_t (int): the number of threads for the same target
t.parallel_degree (int): the number of threads for all targets. The number of different targets will be determined by
parallel_degree / parallel_same_t.
- PyTorch == 1.10.1
- CUDA == 11.3
- Numba == 0.54.1
@article{liu2024parmesan,
author={Liu, Lixin and Liu, Tianji and Jiang, Bentian and Young, Evangeline F.Y.},
booktitle={IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (TCAD)},
title={Parmesan: Efficient Partitioning and Mapping Flow for DNN Training on General Device Topology},
year={2024},
}Parmesan is an open source project licensed under a BSD 3-Clause License that can be found in the LICENSE file.

