Skip to content

cpu: rv64: share RVV eltwise emitters#9

Closed
Ga1axy0 wants to merge 1 commit into
spacemit-com:upstream-spacemit-opsfrom
Ga1axy0:upload
Closed

cpu: rv64: share RVV eltwise emitters#9
Ga1axy0 wants to merge 1 commit into
spacemit-com:upstream-spacemit-opsfrom
Ga1axy0:upload

Conversation

@Ga1axy0

@Ga1axy0 Ga1axy0 commented Jun 12, 2026

Copy link
Copy Markdown

Description

This PR adds RVV JIT support for additional RV64 eltwise forward algorithms and factors shared eltwise code generation into a reusable emitter helper.

The new emitter is used by the RV64 eltwise JIT kernels and by the f16 softmax exp-sub-sum kernel through elt.exp(). This avoids duplicating the RVV exp polynomial sequence in softmax while keeping softmax kernels as regular jit_generator_t users.

The emitter is intended to cover the regular finite-input fast path. It does not classify or preserve special NaN/Inf values on its own. Some of the newly added algorithms, including exp, tanh, and gelu_tanh, apply explicit lower/upper bounds with RVV min/max instructions. Existing clamp-based eltwise algorithms such as hardsigmoid and clip also use min/max-style clamping without explicit special-value fixups.

If special NaN/Inf preservation is required for these paths, it likely needs to be discussed as a unified RVV JIT eltwise policy rather than handled only for the newly added bounded emitters. Adding per-lane special-value detection and fixup in the kernel would introduce extra comparisons, masks, and merges on the eltwise hot path, which may affect the overall performance of these kernels.

The added/updated eltwise coverage includes:

  • tanh
  • logistic
  • swish
  • elu
  • gelu_tanh
  • gelu_erf
  • exp

Validation

Benchdnn validation was run on a local RV64 environment with the option_set_all_algs case list for f32 and f16:

./benchdnn --eltwise   --mode=C   --dir=FWD_D   --dt=f32   --tag=abx,axb   --batch=inputs/eltwise/option_set_all_algs
./benchdnn --eltwise   --mode=C   --dir=FWD_D   --dt=f16   --tag=abx,axb   --batch=inputs/eltwise/option_set_all_algs
./benchdnn --softmax -v2 --mode=C --dir=FWD_D --sdt=f16 --ddt=f16 --stag=abx --alg=SOFTMAX,LOGSOFTMAX --axis=1,3 --batch=inputs/softmax/shapes_2d

Result notes:

  • f32: tests:1274 passed:1120 skipped:154 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
  • f16: tests:1274 passed:1106 skipped:168 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0
  • All skipped cases are benchdnn Invalid case entries, such as invalid alpha/beta combinations for elu_dst, relu_dst, clip, clip_v2, and clip_v2_dst; f16 also skips round.
  • f16 softmax/logsoftmax correctness also passed after reusing the eltwise emitter for the softmax exp path:
    tests:32 passed:32 skipped:0 mistrusted:0 unimplemented:0 invalid_arguments:0 failed:0 listed:0

Checklist

General

  • Do all unit and benchdnn tests (make test and make test_benchdnn_*) pass locally for each commit?
  • Have you formatted the code using clang-format?

Notes

To reuse the emitter from another RV64 JIT kernel:

  • Include cpu/rv64/jit_rvv_eltwise_emitter.hpp and create jit_rvv_eltwise_fwd_emitter_t elt(this) inside a jit_generator_t subclass.
  • Allocate temporary vector registers, scalar constants (alpha, beta, zero, one), floating temporaries, and integer temporaries, then pass them through eltwise_aux_regs_t. The actual source and destination vectors are passed explicitly to each helper.
  • Set the current RVV VL/SEW/LMUL and load or produce the input vector before calling the emitter. The f16 eltwise and softmax users widen f16 input to f32, call the emitter on f32 vectors, then narrow or store as needed.
  • Call the required helper, for example elt.exp(regs, v_dst, v_src) or elt.gelu_tanh(regs, v_dst, v_src). Helpers write the selected destination vector and may freely use the scratch registers listed in eltwise_aux_regs_t.
  • Store the destination vector or feed it into the caller JIT sequence. If the caller needs NaN/Inf behavior, handle it outside the emitter as noted above.

Softmax f16 example:

jit_rvv_eltwise_fwd_emitter_t elt(this);
const eltwise_aux_regs_t regs {v_bias, v_tmpv, v_poly, v_red, f_sub,
        f_zero, f_zero, f_zero, f_tmp0, f_tmp1, t4, t5};

vle16_v(v_in16, reg_src);
vfwcvt_f_f_v(v_x, v_in16);
vfsub_vf(v_x, v_x, f_sub);
elt.exp(regs, v_x, v_x);
vfadd_vv(v_acc, v_acc, v_x);
vse32_v(v_x, reg_tmp);

Here softmax provides the surrounding algorithm semantics: it subtracts the row maximum before calling elt.exp(), accumulates the exponentials, and stores the temporary f32 values for the later normalization pass.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant