Skip to content

Conversation

@raviskolli
Copy link

@raviskolli raviskolli commented Apr 21, 2021

Print throughput in samples/sec for each training step.
Print average samples/sec excluding the initial setup at the end of training.

Sample prints for each train step:
{'train_step_runtime': 0.6309, 'train_step_samples_per_second': 57.063, 'epoch': 0.98}

98%|█████████▊| 2086/2120 [26:01<00:22, 1.49it/s]
98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]

{'train_step_runtime': 0.641, 'train_step_samples_per_second': 56.162, 'epoch': 0.98}

98%|█████████▊| 2087/2120 [26:02<00:22, 1.48it/s]
98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]

{'train_step_runtime': 0.6355, 'train_step_samples_per_second': 56.652, 'epoch': 0.98}

98%|█████████▊| 2088/2120 [26:03<00:21, 1.47it/s]
99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]

{'train_step_runtime': 0.6715, 'train_step_samples_per_second': 53.608, 'epoch': 0.99}

99%|█████████▊| 2089/2120 [26:03<00:21, 1.45it/s]
99%|█████████▊| 2090/2120 [26:04<00:20, 1.43it/s]

{'train_step_runtime': 0.6717, 'train_step_samples_per_second': 53.592, 'epoch': 0.99}

Sample prints at the end of training:
Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

@raviskolli raviskolli requested a review from ashbhandare April 21, 2021 18:49
@ashbhandare
Copy link

Can you put an example output in the description?

@raviskolli
Copy link
Author

Can you put an example output in the description?

Good idea! I updated the description with few sample prints.

@raviskolli raviskolli closed this Apr 22, 2021
@raviskolli raviskolli reopened this Apr 22, 2021
)

metrics = speed_metrics("train", start_time, self.state.max_steps)
ort_end_train_metrics = speed_metrics("train",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these will be calculated for non-ort runs too? Can we remove the ort_prefix, or add these under an if?

@ashbhandare
Copy link

ashbhandare commented Apr 22, 2021

Thank you for adding the sample output. From the logs:
`Training completed. Do not forget to share your model on huggingface.co/models =)

{'train_runtime': 1586.5542, 'train_samples_per_second': 1.336, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]

{'train_runtime': 1426.6043, 'train_samples_per_second': 427.611, 'epoch': 1.0}

100%|██████████| 2120/2120 [26:26<00:00, 1.22s/it]`

This is confusing to the user to see the same train_runtime, train_samples_per_second etc being printed twice with different values. Can we make it clear that the second speed metric is excluding the first step to avoid overhead?

Also, can we put the new metrics so that they get logged in the final ones as most users might expect to just look here and find the relevant info:
***** train metrics *****
epoch = 1.0
init_mem_cpu_alloc_delta = 1MB
init_mem_cpu_peaked_delta = 0MB
init_mem_gpu_alloc_delta = 0MB
init_mem_gpu_peaked_delta = 0MB
train_mem_cpu_alloc_delta = 46MB
train_mem_cpu_peaked_delta = 230MB
train_mem_gpu_alloc_delta = 808MB
train_mem_gpu_peaked_delta = 5793MB
train_runtime = 46.9366
train_samples = 4096
train_samples_per_second = 1.364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants