Skip to content

Bug in CUDA exx ek screening #154

@vmitq

Description

@vmitq

I experimented with GauXC to compute semi-numerical exchange using NVIDIA GPUs and found that, for large systems and large batch sizes, the computed exchange matrix is incorrect. Results obtained with the host integrator were correct. Interestingly, the CUDA results were also correct when using extremely tight screening thresholds, which suggests that something may be wrong with the Exx screening on CUDA.

Tracing the issue indicates that the problem lies in how batches of batches are generated in the exx_ek_screening function.

const size_t task_batch_size = 10000;
// Setup EXX EK Screening memory on the device
device_data.reset_allocations();
device_data.allocate_static_data_exx_ek_screening( ntasks, nbf, nshells,
shpairs.npairs(), basis_map.max_l() );
device_data.send_static_data_density_basis( P_abs, ldp, nullptr, 0, nullptr, 0, nullptr, 0, basis );
device_data.send_static_data_exx_ek_screening( V_shell_max, ldv, basis_map,
shpairs );
integrator_term_tracker enabled_terms;
enabled_terms.exx_ek_screening = true;
auto task_batch_begin = task_begin;
while(task_batch_begin != task_end) {
size_t nleft = std::distance(task_batch_begin, task_end);
exx_detail::host_task_iterator task_batch_end;
if(nleft > task_batch_size)
task_batch_end = task_batch_begin + task_batch_size;
else
task_batch_end = task_end;
device_data.zero_exx_ek_screening_intermediates();
// Loop over tasks and form basis-related buffers
auto task_it = task_batch_begin;
while( task_it != task_batch_end ) {
// Determine next task patch, send relevant data (EXX_EK only)
task_it = device_data.generate_buffers( enabled_terms, basis_map, task_it,
task_batch_end );
// Evaluate collocation
lwd->eval_collocation( &device_data );
// Evaluate EXX EK Screening Basis Statistics
lwd->eval_exx_ek_screening_bfn_stats( &device_data );
}
lwd->exx_ek_shellpair_collision( eps_E, eps_K, &device_data, task_batch_begin,
task_batch_end, shpairs );
task_batch_begin = task_batch_end;
}

This function contains a nested loop over batches that is structured in such a way that, if the number of batches in a task is smaller than task_batch_size (which is hardcoded to 10000), some internal data buffers (bfn_max_device) will be overwritten in the next iteration of the inner loop. This can occur when there are many large batches and the GPU memory cannot fit a task of 10000 batches.

After removing the double loop structure and using the same batching strategy as in other functions, the results became correct.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions