[BUG] NUM_GPUS_PER_NODE not respected in inference

**Describe the bug**
When calculating the bandwidth with TP in `get_latency_fwd_per_tp_comm`  and `get_latency_fwd_per_layer_shared_dp_comm`, the calculation defaults to intra-node BW in the former and the latter depends on a magic number `8` which I assume is referring to `NUM_GPUS_PER_NODE`.

https://github.com/cli99/llm-analysis/blob/d841e40aec8c84b9e76ba9cfbce67418adb172bd/llm_analysis/analysis.py#L1221-L1223

https://github.com/cli99/llm-analysis/blob/d841e40aec8c84b9e76ba9cfbce67418adb172bd/llm_analysis/analysis.py#L1247-L1250

https://github.com/cli99/llm-analysis/blob/d841e40aec8c84b9e76ba9cfbce67418adb172bd/llm_analysis/constant.py#L37

**Expected behavior**
For `get_latency_fwd_per_tp_comm`, it should use `get_intra_node_bandwidth` when `tp_size <= NUM_GPUS_PER_NODE` and `get_inter_node_bandwidth` otherwise.

For `get_latency_fwd_per_layer_shared_dp_comm`, the magic number `8` should be replaced with `NUM_GPUS_PER_NODE`.

Looking at `training`, the `tp_size <= NUM_GPUS_PER_NODE` seems more like an enforcement than suggestion, should it be checked for `infer` as well?

https://github.com/cli99/llm-analysis/blob/d841e40aec8c84b9e76ba9cfbce67418adb172bd/llm_analysis/analysis.py#L2695-L2699

**Additional context**
I'd be more than happy to provide a PR if the report is valid

**Minor Issue**
A default for `mlp_gated_linear_units` is not set when it hits the first if and misses the second. Can be reproduced with 

`python3 -m llm_analysis.analysis infer -m meta-llama/Llama-3.1-405b`
https://github.com/cli99/llm-analysis/blob/d841e40aec8c84b9e76ba9cfbce67418adb172bd/llm_analysis/config.py#L216-L221

	# assuming tp and dp are preferred when sharding intra node, pp is only applied across nodes
	# when (dp_size * tp_size) <= 8, the data parallel processes are within a node
	bandwidth = self.get_intra_node_bandwidth() if (
	dp_size * tp_size) <= 8 else self.get_inter_node_bandwidth()

	assert tp_size <= num_gpus_per_node, (
	f"tp_size must be <= {num_gpus_per_node}(num_gpus_per_node), tensor"
	" parallelism requires high communication bandwidth to be efficient"
	" and is best kept within a single node where high bandwidth NVLink"
	" is available.")

	if ffn_embed_dim:
	expansion_ratio = ffn_embed_dim / hidden_dim
	if expansion_ratio == 3.5:
	mlp_gated_linear_units = True
	else:
	mlp_gated_linear_units = False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] NUM_GPUS_PER_NODE not respected in inference #30

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

	# assuming tp_size <= number of GPUs per node, thus using intra-node bandwidth
	latency_per_all_reduce = (elems_per_all_reduce * dtype_bytes /
	(self.get_intra_node_bandwidth() * 10**9))

[BUG] NUM_GPUS_PER_NODE not respected in inference #30

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions