cpu: rv64: share RVV eltwise emitters by Ga1axy0 · Pull Request #9 · spacemit-com/oneDNN

Ga1axy0 · 2026-06-12T09:22:44Z

Description

This PR adds RVV JIT support for additional RV64 eltwise forward algorithms and factors shared eltwise code generation into a reusable emitter helper.

The new emitter is used by the RV64 eltwise JIT kernels and by the f16 softmax exp-sub-sum kernel through elt.exp(). This avoids duplicating the RVV exp polynomial sequence in softmax while keeping softmax kernels as regular jit_generator_t users.

The emitter is intended to cover the regular finite-input fast path. It does not classify or preserve special NaN/Inf values on its own. Some of the newly added algorithms, including exp, tanh, and gelu_tanh, apply explicit lower/upper bounds with RVV min/max instructions. Existing clamp-based eltwise algorithms such as hardsigmoid and clip also use min/max-style clamping without explicit special-value fixups.

If special NaN/Inf preservation is required for these paths, it likely needs to be discussed as a unified RVV JIT eltwise policy rather than handled only for the newly added bounded emitters. Adding per-lane special-value detection and fixup in the kernel would introduce extra comparisons, masks, and merges on the eltwise hot path, which may affect the overall performance of these kernels.

The added/updated eltwise coverage includes:

tanh
logistic
swish
elu
gelu_tanh
gelu_erf
exp

Validation

Benchdnn validation was run on a local RV64 environment with the option_set_all_algs case list for f32 and f16:

./benchdnn --eltwise   --mode=C   --dir=FWD_D   --dt=f32   --tag=abx,axb   --batch=inputs/eltwise/option_set_all_algs
./benchdnn --eltwise   --mode=C   --dir=FWD_D   --dt=f16   --tag=abx,axb   --batch=inputs/eltwise/option_set_all_algs
./benchdnn --softmax -v2 --mode=C --dir=FWD_D --sdt=f16 --ddt=f16 --stag=abx --alg=SOFTMAX,LOGSOFTMAX --axis=1,3 --batch=inputs/softmax/shapes_2d

Result notes:

f32: tests:1274 passed:1120 skipped:154 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
f16: tests:1274 passed:1106 skipped:168 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
All skipped cases are benchdnn Invalid case entries, such as invalid alpha/beta combinations for elu_dst, relu_dst, clip, clip_v2, and clip_v2_dst; f16 also skips round.
f16 softmax/logsoftmax correctness also passed after reusing the eltwise emitter for the softmax exp path:
tests:32 passed:32 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0

Checklist

General

Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
Have you formatted the code using clang-format?

Notes

To reuse the emitter from another RV64 JIT kernel:

Include cpu/rv64/jit_rvv_eltwise_emitter.hpp and create jit_rvv_eltwise_fwd_emitter_t elt(this) inside a jit_generator_t subclass.
Allocate temporary vector registers, scalar constants (alpha, beta, zero, one), floating temporaries, and integer temporaries, then pass them through eltwise_aux_regs_t. The actual source and destination vectors are passed explicitly to each helper.
Set the current RVV VL/SEW/LMUL and load or produce the input vector before calling the emitter. The f16 eltwise and softmax users widen f16 input to f32, call the emitter on f32 vectors, then narrow or store as needed.
Call the required helper, for example elt.exp(regs, v_dst, v_src) or elt.gelu_tanh(regs, v_dst, v_src). Helpers write the selected destination vector and may freely use the scratch registers listed in eltwise_aux_regs_t.
Store the destination vector or feed it into the caller JIT sequence. If the caller needs NaN/Inf behavior, handle it outside the emitter as noted above.

Softmax f16 example:

jit_rvv_eltwise_fwd_emitter_t elt(this);
const eltwise_aux_regs_t regs {v_bias, v_tmpv, v_poly, v_red, f_sub,
        f_zero, f_zero, f_zero, f_tmp0, f_tmp1, t4, t5};

vle16_v(v_in16, reg_src);
vfwcvt_f_f_v(v_x, v_in16);
vfsub_vf(v_x, v_x, f_sub);
elt.exp(regs, v_x, v_x);
vfadd_vv(v_acc, v_acc, v_x);
vse32_v(v_x, reg_tmp);

Here softmax provides the surrounding algorithm semantics: it subtracts the row maximum before calling elt.exp(), accumulates the exponentials, and stores the temporary f32 values for the later normalization pass.

Ga1axy0 force-pushed the upload branch from e4c9d3e to e7a9198 Compare June 15, 2026 03:54

cpu: rv64: share RVV eltwise emitters

f1ebc42

Ga1axy0 force-pushed the upload branch from e7a9198 to f1ebc42 Compare June 15, 2026 05:41

Ga1axy0 closed this Jun 16, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpu: rv64: share RVV eltwise emitters#9

cpu: rv64: share RVV eltwise emitters#9
Ga1axy0 wants to merge 1 commit into
spacemit-com:upstream-spacemit-opsfrom
Ga1axy0:upload

Ga1axy0 commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Ga1axy0 commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Validation

Checklist

General

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ga1axy0 commented Jun 12, 2026 •

edited

Loading