-
Notifications
You must be signed in to change notification settings - Fork 19
Open
Labels
Description
DGX-1V settings:
- Tesla V100
- CUDA 9.2
- r396.14
- libgdsync sm_70
gds_kernel_loopback_latency executes without errors.
gds_kernel_latency returns some errors:
STDOUT:
pre-posting took 2272.00 usec
batch info: rx+kernel+tx 20 per batch
pre-posted 60 sequences in 2 batches
GPU kernel calc buf size: 131072
iters=1000 tx/rx_depth=1024
testing....
[1] batch 1: posted 20 sequences
[1] batch 2: posted 20 sequences
pre-posting took 2757.00 usec
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
....
STDERR
[3] unexpected rx ev 12, batch len 20
[4] unexpected rx ev 11, batch len 20
[5] unexpected rx ev 13, batch len 20
[6] unexpected rx ev 11, batch len 20
[7] unexpected rx ev 13, batch len 20
[8] unexpected tx ev 18, batch len 20
[8] unexpected rx ev 14, batch len 20
[9] unexpected rx ev 16, batch len 20
….
Sometimes it gets stuck and sometimes it finishes the execution.
HPGMG doesn't show any error but results are incorrect (both CUDA 9.0 and 9.2). For instance, having a run with 2 procs, SA model, input params 5 and 8:
===== Warming up by running 10 solves ==========================================
FMGSolve... f-cycle norm=1.308041533821267e-02 rel=1.352815237149764e-02 done (0.014266 seconds)
FMGSolve... f-cycle norm=6.932829732453349e-05 rel=7.170137534722465e-05 done (0.008652 seconds)
FMGSolve... f-cycle norm=2.088214618404847e-04 rel=2.159693313379883e-04 done (0.008700 seconds)
FMGSolve... f-cycle norm=1.278232634365474e-03 rel=1.321985991790363e-03 done (0.008623 seconds)
FMGSolve... f-cycle norm=1.514943477709386e-03 rel=1.566799346255277e-03 done (0.008639 seconds)
FMGSolve... f-cycle norm=6.932835203155019e-05 rel=7.170143192683918e-05 done (0.008656 seconds)
FMGSolve... f-cycle norm=8.649945592323429e-04 rel=8.946029537476847e-04 done (0.009008 seconds)
FMGSolve... f-cycle norm=1.532839596755178e-02 rel=1.585308041816546e-02 done (0.008753 seconds)
FMGSolve... f-cycle norm=6.936241119703812e-05 rel=7.173665692302264e-05 done (0.008598 seconds)
FMGSolve... f-cycle norm=6.932763278943987e-05 rel=7.170068806537846e-05 done (0.008599 seconds)
WARMUP TIME: 0.093335
===== Running 100 solves ========================================================
FMGSolve... f-cycle norm=6.932763278943987e-05 rel=7.170068806537846e-05 done (0.009424 seconds)
Correct result would be
FMGSolve... f-cycle norm=6.934041112871547e-05 rel=7.171390380175266e-05 done (0.010723 seconds)
These errors don't appear on brdw0/1 P100.