I wanted to use the integer gemm code from 0430cf0, and realized that there currently is no way of performing an operation on transposed matrices while I wanted to perform A^t A. In the BLAS context, the transpose or complex conjugate of a matrix is usually expressed as Op(A), where Op is expressed through the parameter TRANSA given by the character 'N', 'T', or 'C'.
I realize that since we actually have dimensions and slices as part of our matrix ArrayBase structures, we can just circumvent the issue by doing a transpose of the matrix view via fn t(mut self). The questions are:
- Performance: does code specific to a transposed matrix with unchanged memory layout have the same performance as generic code given different stride information?
- In how far is the gemm kernel for transposed matrices different from that for non-transposed matrices?
While addressing this issue, it's probably also worth investigating how DSYRK for the specific case of A^t A is implemented different in the BLIS library.
I have a hard time understanding how BLIS defines its kernels, specifically how the different cases of Op(A) Op(B) are implemented. I am happy do dig in and write a benchmark comparing the current ndarray approach to writing specific kernel. Can you point me to the right spot to look at?
I wanted to use the integer gemm code from 0430cf0, and realized that there currently is no way of performing an operation on transposed matrices while I wanted to perform
A^t A. In the BLAS context, the transpose or complex conjugate of a matrix is usually expressed asOp(A), whereOpis expressed through the parameterTRANSAgiven by the character'N','T', or'C'.I realize that since we actually have dimensions and slices as part of our matrix
ArrayBasestructures, we can just circumvent the issue by doing a transpose of the matrix view viafn t(mut self). The questions are:While addressing this issue, it's probably also worth investigating how
DSYRKfor the specific case ofA^t Ais implemented different in the BLIS library.I have a hard time understanding how BLIS defines its kernels, specifically how the different cases of
Op(A) Op(B)are implemented. I am happy do dig in and write a benchmark comparing the currentndarrayapproach to writing specific kernel. Can you point me to the right spot to look at?