Skip to content

hpgmg and gds_kernel_latency errors on DGX TeslaV100 #61

@e-ago

Description

@e-ago

DGX-1V settings:

  • Tesla V100
  • CUDA 9.2
  • r396.14
  • libgdsync sm_70

gds_kernel_loopback_latency executes without errors.

gds_kernel_latency returns some errors:

STDOUT:

pre-posting took 2272.00 usec

batch info: rx+kernel+tx 20 per batch
pre-posted 60 sequences in 2 batches
GPU kernel calc buf size: 131072
iters=1000 tx/rx_depth=1024

testing....
[1] batch 1: posted 20 sequences
[1] batch 2: posted 20 sequences
pre-posting took 2757.00 usec
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
gpu_wait_tracking_event nothing to do (12)
....

STDERR

[3] unexpected rx ev 12, batch len 20
[4] unexpected rx ev 11, batch len 20
[5] unexpected rx ev 13, batch len 20
[6] unexpected rx ev 11, batch len 20
[7] unexpected rx ev 13, batch len 20
[8] unexpected tx ev 18, batch len 20
[8] unexpected rx ev 14, batch len 20
[9] unexpected rx ev 16, batch len 20
….

Sometimes it gets stuck and sometimes it finishes the execution.

HPGMG doesn't show any error but results are incorrect (both CUDA 9.0 and 9.2). For instance, having a run with 2 procs, SA model, input params 5 and 8:

===== Warming up by running 10 solves ==========================================
FMGSolve... f-cycle     norm=1.308041533821267e-02  rel=1.352815237149764e-02  done (0.014266 seconds)
FMGSolve... f-cycle     norm=6.932829732453349e-05  rel=7.170137534722465e-05  done (0.008652 seconds)
FMGSolve... f-cycle     norm=2.088214618404847e-04  rel=2.159693313379883e-04  done (0.008700 seconds)
FMGSolve... f-cycle     norm=1.278232634365474e-03  rel=1.321985991790363e-03  done (0.008623 seconds)
FMGSolve... f-cycle     norm=1.514943477709386e-03  rel=1.566799346255277e-03  done (0.008639 seconds)
FMGSolve... f-cycle     norm=6.932835203155019e-05  rel=7.170143192683918e-05  done (0.008656 seconds)
FMGSolve... f-cycle     norm=8.649945592323429e-04  rel=8.946029537476847e-04  done (0.009008 seconds)
FMGSolve... f-cycle     norm=1.532839596755178e-02  rel=1.585308041816546e-02  done (0.008753 seconds)
FMGSolve... f-cycle     norm=6.936241119703812e-05  rel=7.173665692302264e-05  done (0.008598 seconds)
FMGSolve... f-cycle     norm=6.932763278943987e-05  rel=7.170068806537846e-05  done (0.008599 seconds)

WARMUP TIME: 0.093335


===== Running 100 solves ========================================================
FMGSolve... f-cycle     norm=6.932763278943987e-05  rel=7.170068806537846e-05  done (0.009424 seconds)

Correct result would be
FMGSolve... f-cycle norm=6.934041112871547e-05 rel=7.171390380175266e-05 done (0.010723 seconds)

These errors don't appear on brdw0/1 P100.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions