Skip to content

Incorrect dpotrf_ result caused by loongarch la464 kernels #5602

@shankerwangmiao

Description

@shankerwangmiao

I found this bug when compiling xtb on loongarch64. The test c_api_example.c complained that the calculated result is not correct. However, all the unit tests of OpenBLAS passed. By further choosing different kernels using OPENBLAS_CORETYPE, I found that the issue happened when using the la464 kernel, and did not happen when using the la264 kernel and the generic kernel.

I further discovered that this error only happened on 3C6000. When all the binaries were copied to 3A6000, the result became correct, even if the la464 kernel was used. @jiegec helped me to narrow down the problem to the openblas routines called by xtb by hooking the library invocations and comparing the results between different kernels. He found that dpotrf_ was giving incorrect results.

Attached run_dpotrf.c is the minimum code that can reproduce the problem. The example input is dpotrf.in.txt and the expected output should be dpotrf.ans.txt.

He soon discovered that in

OpenBLAS/kernel/setparam-ref.c

Lines 1147 to 1223 in e07bea1

#if defined(LA464)
int L3_size = get_L3_size();
#ifdef SMP
if(blas_num_threads == 1){
#endif
//single thread
if (L3_size == 32){ // 3C5000 and 3D5000
TABLE_NAME.sgemm_p = 256;
TABLE_NAME.sgemm_q = 384;
TABLE_NAME.sgemm_r = 8192;
TABLE_NAME.dgemm_p = 112;
TABLE_NAME.dgemm_q = 289;
TABLE_NAME.dgemm_r = 4096;
TABLE_NAME.cgemm_p = 128;
TABLE_NAME.cgemm_q = 256;
TABLE_NAME.cgemm_r = 4096;
TABLE_NAME.zgemm_p = 128;
TABLE_NAME.zgemm_q = 128;
TABLE_NAME.zgemm_r = 2048;
} else { // 3A5000 and 3C5000L
TABLE_NAME.sgemm_p = 256;
TABLE_NAME.sgemm_q = 384;
TABLE_NAME.sgemm_r = 4096;
TABLE_NAME.dgemm_p = 112;
TABLE_NAME.dgemm_q = 300;
TABLE_NAME.dgemm_r = 3024;
TABLE_NAME.cgemm_p = 128;
TABLE_NAME.cgemm_q = 256;
TABLE_NAME.cgemm_r = 2048;
TABLE_NAME.zgemm_p = 128;
TABLE_NAME.zgemm_q = 128;
TABLE_NAME.zgemm_r = 1024;
}
#ifdef SMP
}else{
//multi thread
if (L3_size == 32){ // 3C5000 and 3D5000
TABLE_NAME.sgemm_p = 256;
TABLE_NAME.sgemm_q = 384;
TABLE_NAME.sgemm_r = 1024;
TABLE_NAME.dgemm_p = 112;
TABLE_NAME.dgemm_q = 289;
TABLE_NAME.dgemm_r = 342;
TABLE_NAME.cgemm_p = 128;
TABLE_NAME.cgemm_q = 256;
TABLE_NAME.cgemm_r = 512;
TABLE_NAME.zgemm_p = 128;
TABLE_NAME.zgemm_q = 128;
TABLE_NAME.zgemm_r = 512;
} else { // 3A5000 and 3C5000L
TABLE_NAME.sgemm_p = 256;
TABLE_NAME.sgemm_q = 384;
TABLE_NAME.sgemm_r = 2048;
TABLE_NAME.dgemm_p = 112;
TABLE_NAME.dgemm_q = 300;
TABLE_NAME.dgemm_r = 738;
TABLE_NAME.cgemm_p = 128;
TABLE_NAME.cgemm_q = 256;
TABLE_NAME.cgemm_r = 1024;
TABLE_NAME.zgemm_p = 128;
TABLE_NAME.zgemm_q = 128;
TABLE_NAME.zgemm_r = 1024;
}
}
#endif

GEMM_{P, Q, R} are given four sets of values for the la464 kernel according to whether the calculation is single threaded or multi-threaded and whether the running CPU has 32MiB L3 cache. I tested these parameters and found that only {112, 289, 342}, i.e. the parameters used by the multi-threaded 32MiB L3 cache case, gives incorrect results. If I change to other sets of values, the results are correct, regardless of running on 3A6000 or 3C6000 and regardless of openblas is compiled single threaded or multi-threaded. Otherwise, if I force use this set of values, the results are always incorrect, regardless of CPU type and parallel mode.

For comparison with other platforms, I chose the kernel for HASWELL for another testing, since they are both 256-bit wide. I applied the set of parameters in question to the HASWELL kernel, and found that the results became incorrect with the problematic values. By random tuning the parameters, I found that when R becomes large enough, the results become correct.

As suggests in the FAQ, the values chosen for GEMM_{P, Q, R} should have impact on the performance. The previous findings suggests that they also affect correctness. From the implementation of dpotrf_ I can easily conclude that R should be larger that P and Q. I wonder if there are extra expected constrains for the parameters? If not, it seems that there are bugs in the platform-generic code.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions