diff --git a/Changelog.txt b/Changelog.txt index e4ba72986e..bc4f23535c 100644 --- a/Changelog.txt +++ b/Changelog.txt @@ -1,4 +1,120 @@ OpenBLAS ChangeLog +==================================================================== +Version 0.3.31 +15-Jan-2025 + +general: + - reverted a matrix partitioning optimization from 0.3.30 that could lead to + race conditions and subsequent invalid results in GEMM + - added the bfloat16 extensions BGEMM and BGEMV + - added a BLAS interface for the ?GEMM_BATCH extensions + - added the BLAS extensions ?GEMM_BATCH_STRIDED and their CBLAS interface + - added the basic infrastructure for half-precision float (FP16) format + using SH prefix + - reimplemented the LAPACK SLAED3/DLAED3 function using multithreading, thereby + improving the performance of the SSYEVD/DSYEVD eigensolver for symmetric matrices + on all platforms + - limited the number of retries for initial memory allocation to avoid infinite + hanging on low-memory systems + - fixed a thread lockup situation encountered with python 3.9 or older and numpy + - introduced a problem size threshold for multithreading in STRMV/DTRMV + - introduced a problem size threshold for multithreading in CHER/CHER2/CHPR/CHPR2 + and ZHER/ZHER2/ZHPR/ZHPR2 + - improved the problem size thresholds for multithreading in SGER/DGER + - improved autodetection of the Fortran compiler + - fixed passing of the INTERFACE64=1 option to the flang-new compiler + - fixed a potential deadlock in multithreaded code after calling fork() + - fixed builds using CMake on FreeBSD + - fixed builds using CMake from within Cygwin on Windows + - fixed builds using CMake and the NVHPC compiler on ARM64 + - fixed CMake build error from misdetecting compiler or OpenMP versions + - improved contents of the CMake-generated OpenBLASConfig.cmake file + - added support for cross-compilation to RISCV targets via CMake + - fixed cross-compilation to x86 targets from non-x86 architectures + - fixed failure to install cblas.h if NO_CBLAS=0 was specified + - fixed missing user-defined pre- and postfixes on functions in lapack.h,lapacke.h + - included fixes from the Reference-LAPACK project: + - fix ordering bug in ?LAED/?LASD (Reference-LAPACK PR 1140) + - revert changes in ?GEEV from PR 1129 (Reference-LAPACK PR 1142) + - fix workspace allocation in LAPACKE_?TRSEN (Reference-LAPACK PR 1144) + +riscv: + - added optimized SBGEMM kernels for ZVL128B and ZVL256B targets + - added optimized SHGEMM kernels for ZVL128B and ZVL256B targets + - added optimized SBGEMV and SHGEMV kernels for ZVL128B/ZVL256B + - improved performance of the GEMV kernel for ZVL256B + - improved the performance of the CROT and ZROT kernels for ZVL128B and x280 + - improved the detection of RVV1.0 capability + - improved performance of the matrix packing helper functions for ZVL128B and ZVL256B + - improved performance of OMATCOPY for ZVL128B and ZVL256B + +arm: + - fixed spurious executable stack in the getarch utility + +arm64: + - fixed spurious executable stack in the getarch utility + - fixed compiler warnings arising from the timer macro RPCC + - fixed cache size detection for Qualcomm Oryon under Windows on Arm + - fixed argument handling in the default SVE kernel for SDOT/DDOT + - building the BFLOAT16 kernels is now enabled by default + - improved the overall performance of GEMM,SYMM and HEMM on A64FX + - improved the performance of SDOT/DDOT on A64FX + - improved the multithreading performance of SDOT/DDOT on A64FX by + introduction of a throttling table matching thread count to problem size + - improved the performance of SGER/DGER on A64FX and NEOVERSEV1 + - improved the multithreading performance of GEMM on A64FX and NEOVERSEV1 + - improved the performance of the GEMV kernel for SVE-capable targets + - improved the multithreading performance of SGEMM on NEOVERSEV1 and V2 + - added optimized SAXPY/DAXPY SVE kernels for A64FX and NEOVERSEV1 + - added optimized BGEMM and BGEMV kernels for NEOVERSEV1 + - added an optimized BGEMM kernel for NEOVERSEN2 + - added support for the NEOVERSEV2 cpu + - added dedicated support for the Apple M4 cpu as VORTEXM4 + - added optimized SGEMM/SSYMM/STRMM/SSYRK/SSYR2K for SME-capable targets + (ARMV9SME and VORTEXM4) + - improved the precision of the SNRM2 kernel + - added cpu autodetection and compiler settings for Ampere One processors + - fixed cpu autodetection for Apple M systems running Linux + - fixed building on MacOS with AppleClang,gfortran and xcode v16 or newer + - fixed several errors in the C code replacements for the complex and double + precision complex LAPACK functions that get used (only) when compiling with + Microsoft C and NOFORTRAN=1 under MS Windows + +power: + - added initial support for the POWER11 architecture + - improved performance of DGEMM and DGEMV on POWER10 + - fixed the default compiler flags to use "-O3" instead of the possibly unsafe + "-Ofast" + - fixed building under MacOS (for old G4 Macs) with CMake + - fixed potential miscompilation of DGEMV and other assembly kernels by gcc15.1 + - fixed compilation with recent versions of flang + +loongarch64: + - fixed warnings and potential inaccuracies arising from incorrect saving of registers + - fixed enumeration of logical cores on big NUMA servers + - fixed building with LLVM and the INTERFACE64=1 option + +x86: + - fixed building the GEMM3M kernels for the GENERIC target + - fixed several errors in the C code replacements for the complex and double + precision complex LAPACK functions that get used (only) when compiling with + Microsoft C and NOFORTRAN=1 under MS Windows + +x86_64: + - added cpu autodetection for Intel Lunar Lake (Core Ultra 200V) + - changed all ?MIN and ?MAX assembly kernels to use unaligned operations + - fixed several errors in the C code replacements for the complex and double + precision complex LAPACK functions that get used (only) when compiling with + Microsoft C and NOFORTRAN=1 under MS Windows + - fixed potential crashes in builds for Cooper Lake, Sapphire Rapids or Zen5 cpus + under MS Windows + +zarch: + - added support for building with CMake + +sparc: + - fixed a potential crash in the DNRM2 kernel + ==================================================================== Version 0.3.30 19-Jun-2025