-
Notifications
You must be signed in to change notification settings - Fork 28
Description
I experimented with GauXC to compute semi-numerical exchange using NVIDIA GPUs and found that, for large systems and large batch sizes, the computed exchange matrix is incorrect. Results obtained with the host integrator were correct. Interestingly, the CUDA results were also correct when using extremely tight screening thresholds, which suggests that something may be wrong with the Exx screening on CUDA.
Tracing the issue indicates that the problem lies in how batches of batches are generated in the exx_ek_screening function.
GauXC/src/xc_integrator/integrator_util/exx_screening.cxx
Lines 256 to 304 in fe45b3b
| const size_t task_batch_size = 10000; | |
| // Setup EXX EK Screening memory on the device | |
| device_data.reset_allocations(); | |
| device_data.allocate_static_data_exx_ek_screening( ntasks, nbf, nshells, | |
| shpairs.npairs(), basis_map.max_l() ); | |
| device_data.send_static_data_density_basis( P_abs, ldp, nullptr, 0, nullptr, 0, nullptr, 0, basis ); | |
| device_data.send_static_data_exx_ek_screening( V_shell_max, ldv, basis_map, | |
| shpairs ); | |
| integrator_term_tracker enabled_terms; | |
| enabled_terms.exx_ek_screening = true; | |
| auto task_batch_begin = task_begin; | |
| while(task_batch_begin != task_end) { | |
| size_t nleft = std::distance(task_batch_begin, task_end); | |
| exx_detail::host_task_iterator task_batch_end; | |
| if(nleft > task_batch_size) | |
| task_batch_end = task_batch_begin + task_batch_size; | |
| else | |
| task_batch_end = task_end; | |
| device_data.zero_exx_ek_screening_intermediates(); | |
| // Loop over tasks and form basis-related buffers | |
| auto task_it = task_batch_begin; | |
| while( task_it != task_batch_end ) { | |
| // Determine next task patch, send relevant data (EXX_EK only) | |
| task_it = device_data.generate_buffers( enabled_terms, basis_map, task_it, | |
| task_batch_end ); | |
| // Evaluate collocation | |
| lwd->eval_collocation( &device_data ); | |
| // Evaluate EXX EK Screening Basis Statistics | |
| lwd->eval_exx_ek_screening_bfn_stats( &device_data ); | |
| } | |
| lwd->exx_ek_shellpair_collision( eps_E, eps_K, &device_data, task_batch_begin, | |
| task_batch_end, shpairs ); | |
| task_batch_begin = task_batch_end; | |
| } |
This function contains a nested loop over batches that is structured in such a way that, if the number of batches in a task is smaller than task_batch_size (which is hardcoded to 10000), some internal data buffers (bfn_max_device) will be overwritten in the next iteration of the inner loop. This can occur when there are many large batches and the GPU memory cannot fit a task of 10000 batches.
After removing the double loop structure and using the same batching strategy as in other functions, the results became correct.