-
Notifications
You must be signed in to change notification settings - Fork 55
Description
Describe the bug
When calculating the bandwidth with TP in get_latency_fwd_per_tp_comm and get_latency_fwd_per_layer_shared_dp_comm, the calculation defaults to intra-node BW in the former and the latter depends on a magic number 8 which I assume is referring to NUM_GPUS_PER_NODE.
llm-analysis/llm_analysis/analysis.py
Lines 1221 to 1223 in d841e40
| # assuming tp_size <= number of GPUs per node, thus using intra-node bandwidth | |
| latency_per_all_reduce = (elems_per_all_reduce * dtype_bytes / | |
| (self.get_intra_node_bandwidth() * 10**9)) |
llm-analysis/llm_analysis/analysis.py
Lines 1247 to 1250 in d841e40
| # assuming tp and dp are preferred when sharding intra node, pp is only applied across nodes | |
| # when (dp_size * tp_size) <= 8, the data parallel processes are within a node | |
| bandwidth = self.get_intra_node_bandwidth() if ( | |
| dp_size * tp_size) <= 8 else self.get_inter_node_bandwidth() |
llm-analysis/llm_analysis/constant.py
Line 37 in d841e40
| NUM_GPUS_PER_NODE = 8 # number of GPUs per node |
Expected behavior
For get_latency_fwd_per_tp_comm, it should use get_intra_node_bandwidth when tp_size <= NUM_GPUS_PER_NODE and get_inter_node_bandwidth otherwise.
For get_latency_fwd_per_layer_shared_dp_comm, the magic number 8 should be replaced with NUM_GPUS_PER_NODE.
Looking at training, the tp_size <= NUM_GPUS_PER_NODE seems more like an enforcement than suggestion, should it be checked for infer as well?
llm-analysis/llm_analysis/analysis.py
Lines 2695 to 2699 in d841e40
| assert tp_size <= num_gpus_per_node, ( | |
| f"tp_size must be <= {num_gpus_per_node}(num_gpus_per_node), tensor" | |
| " parallelism requires high communication bandwidth to be efficient" | |
| " and is best kept within a single node where high bandwidth NVLink" | |
| " is available.") |
Additional context
I'd be more than happy to provide a PR if the report is valid
Minor Issue
A default for mlp_gated_linear_units is not set when it hits the first if and misses the second. Can be reproduced with
python3 -m llm_analysis.analysis infer -m meta-llama/Llama-3.1-405b
llm-analysis/llm_analysis/config.py
Lines 216 to 221 in d841e40
| if ffn_embed_dim: | |
| expansion_ratio = ffn_embed_dim / hidden_dim | |
| if expansion_ratio == 3.5: | |
| mlp_gated_linear_units = True | |
| else: | |
| mlp_gated_linear_units = False |