Performance improvements to support the Flex waveform #58

tmiw · 2025-11-06T17:10:16Z

This PR contains various performance improvements to ensure that the hybrid C/Python version of RADEV1 can run acceptably on a Raspberry Pi 4 (the hardware inside the Flex 8000 and Aurora series of radios):

Disables the Python garbage collector on rade_open() (ensures that each run through rade_tx() and rade_rx() take a deterministic amount of time).
Sync/SNR are retrieved during rade_rx() to avoid taking the Python lock multiple times per block of audio (also improves determinism).
radae/dsp.py:
- Reuse blocks of memory during computations where possible (reduces calls to malloc(), etc.)
- Use np.lib.stride_tricks.as_strided() to generate arrays of sliding arrays where possible (i.e. [[1,2,3],[2,3,4],...]) and then perform a single e.g. np.matmul on the result. This reduces the overhead from having to go back and forth between Python and C while using NumPy.
- Similarly, in check_pilots we generate a single array of items from the randomly-selected rx samples and perform a single NumPy call to calculate the result (which we put back into Dt1 and Dt2).

Performance comparison in the GitHub environment using ctest -V -R radae_rx_profile:

main (f4254de):

36:          3742067 function calls (3660181 primitive calls) in 16.440 seconds
36: 
36:    Ordered by: internal time
36: 
36:    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
36:       448    1.514    0.003    1.589    0.004 dsp.py:57(bpf)
36:       409    1.359    0.003    1.573    0.004 dsp.py:381(est_pilots)
36:       411    1.354    0.003    1.411    0.003 dsp.py:196(refine)
36:      2863    1.145    0.000    1.145    0.000 {built-in method torch._C._nn.linear}
36:      2045    0.991    0.000    0.991    0.000 {built-in method torch.gru}
36:       448    0.970    0.002   14.053    0.031 radae_rxe.py:171(do_radae_rx)
36:      2045    0.921    0.000    0.921    0.000 {built-in method torch.conv1d}
36:      2055    0.891    0.000    0.891    0.000 {built-in method torch._weight_norm}
36:       409    0.869    0.002    2.814    0.007 dsp.py:422(do_pilot_eq_one)
36:       410    0.822    0.002    0.886    0.002 dsp.py:236(check_pilots)
36:     25012    0.690    0.000    0.690    0.000 {built-in method torch.matmul}
36:        38    0.544    0.014    0.778    0.020 dsp.py:141(detect_pilots)
36:     53/50    0.350    0.007    0.354    0.007 {built-in method _imp.create_dynamic}
36:      6544    0.180    0.000    0.281    0.000 radae_base.py:80(n)
36:    2234/1    0.129    0.000   16.444   16.444 {built-in method builtins.exec}

This PR:

36:          2984008 function calls (2902122 primitive calls) in 13.412 seconds
36: 
36:    Ordered by: internal time
36: 
36:    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
36:      2863    1.064    0.000    1.064    0.000 {built-in method torch._C._nn.linear}
36:       448    0.973    0.002   11.118    0.025 radae_rxe.py:171(do_radae_rx)
36:       448    0.947    0.002    0.963    0.002 dsp.py:59(bpf)
36:        38    0.900    0.024    0.916    0.024 dsp.py:153(detect_pilots)
36:       409    0.889    0.002    2.141    0.005 dsp.py:452(do_pilot_eq_one)
36:      2045    0.813    0.000    0.813    0.000 {built-in method torch.conv1d}
36:      2055    0.809    0.000    0.809    0.000 {built-in method torch._weight_norm}
36:       409    0.805    0.002    0.915    0.002 dsp.py:409(est_pilots)
36:      2045    0.804    0.000    0.804    0.000 {built-in method torch.gru}
36:     25012    0.559    0.000    0.559    0.000 {built-in method torch.matmul}
36:       410    0.495    0.001    0.568    0.001 dsp.py:252(check_pilots)
36:       411    0.337    0.001    0.583    0.001 dsp.py:208(refine)
36:     53/50    0.331    0.006    0.336    0.007 {built-in method _imp.create_dynamic}
36:      6544    0.161    0.000    0.256    0.000 radae_base.py:80(n)
36:    2234/1    0.120    0.000   13.416   13.416 {built-in method builtins.exec}

(~20% improvement based on the first line of both results)

Real-world CPU usage testing with the Flex waveform in freedv-gui:

Idle, no sync (peak, approximate; measured using top): 95% CPU using main, 30% using this PR (!)
Synced and actively decoding: 83% CPU using main, 60% CPU using this PR

…and Python." This reverts commit 6a7f9be.

…g pilot checking/estimation.

…ne-tuning allocations.

drowe67 · 2025-11-13T23:51:05Z

Sorry I just saw this one now, as it was in my junk folder. Will take a look when I get some time.

Would suggest just the minimum optimisation is done to get the in Flex port running. As per PLT policy with RADE V1 we don't want to make RADE V1 optimisation/maintenance a regular activity - it will all be deleted soon.

tmiw · 2025-11-14T00:09:47Z

Would suggest just the minimum optimisation is done to get the in Flex port running. As per PLT policy with RADE V1 we don't want to make RADE V1 optimisation/maintenance a regular activity - it will all be deleted soon.

Agreed. Walter and I have been testing with the PR version of RADE and the waveform seems to be holding up well. I did test the Flex waveform with main and the RX audio was nowhere near as smooth, even though in theory we still have remaining idle CPU.

drowe67 · 2025-11-21T22:03:30Z

@tmiw:

Is this still WIP and likely to have more changes? Or have you hit the real time perf requirements for Flex support and feel this can be merged as is?
Is this PR being used for freedv-gui 2.1.0? Or is that still running off the main branch?

drowe67 · 2025-11-21T22:05:19Z

@tmiw - do the ctests pass on the Pi 4 platform? (Assuming they can be run)

drowe67 · 2025-11-21T22:21:13Z

This PR has some pretty extensive mods to several really important DSP functions. It's really hard to tell from simply looking at the source mods if the changes are OK. Really, really easy for subtle issues to cause problems here, and they may not be picked up by the ctests. Nervous about this code being pushed out to non-Flex users at this time.

I think we need some evidence that each function has identical performance to the main version, for example run this version and main, dumping output values from each function over 60s (say on a worst case MP channel run), comparing MSE. Sort of thing we do for C or stm32 ports. Some of the ctests may cover this already. However, this is a lot of work.

Alternatively, use this branch just for the Flex port, and consider it experimental.

tmiw · 2025-11-22T22:42:06Z

Is this still WIP and likely to have more changes? Or have you hit the real time perf requirements for Flex support and feel this can be merged as is?

No further changes planned unless bugs are uncovered.

Is this PR being used for freedv-gui 2.1.0? Or is that still running off the main branch?

Yep, it's being used for freedv-gui too since the waveform does share a fair amount of code with it.

do the ctests pass on the Pi 4 platform? (Assuming they can be run)

They do run but take quite a while (for example, ~2600 seconds on the CM4 board I'm using for waveform testing). No issues that I can tell.

Also, GitHub has Linux ARM runners now so I added ARM to the automated ctests in this repo.

Note: main doesn't currently pass on either x86_64 or ARM due to some changes made in Opus. This PR updates the default OPUS_URL to the same one freedv-gui's using (as tests were failing there too due to those changes).

I think we need some evidence that each function has identical performance to the main version, for example run this version and main, dumping output values from each function over 60s (say on a worst case MP channel run), comparing MSE. Sort of thing we do for C or stm32 ports. Some of the ctests may cover this already. However, this is a lot of work.

I'll have to think about this some more.

drowe67 · 2025-11-23T17:16:54Z

I'll have to think about this some more.

If you decide you are still keen to merge this code or use it for non-Flex targets pls let me know and I'll design adequate tests for you to implement. Please do not start coding any more tests until I have approved a test plan.

Alternatively, just using the code in this PR for experimental use on Flex only is acceptable to me.

Perhaps the appropriate use of this PR is something we need to discuss at PLT level:

It implies further resources being diverted to RADE V1 which we have previously decided not to do.
It concerns me that highly modified, experimental RADE V1 code has found its way into freedv-gui based on the decision of a single team member and prior to adequate review. This is at odds with the policy decisions we made at PLT level after the unfortunate premature release of RADE V1 12 months ago.

tmiw · 2025-11-24T20:18:26Z

I'll have to think about this some more.

If you decide you are still keen to merge this code or use it for non-Flex targets pls let me know and I'll design adequate tests for you to implement. Please do not start coding any more tests until I have approved a test plan.

Alternatively, just using the code in this PR for experimental use on Flex only is acceptable to me.

Perhaps the appropriate use of this PR is something we need to discuss at PLT level:

It implies further resources being diverted to RADE V1 which we have previously decided not to do.

It concerns me that highly modified, experimental RADE V1 code has found its way into freedv-gui based on the decision of a single team member and prior to adequate review. This is at odds with the policy decisions we made at PLT level after the unfortunate premature release of RADE V1 12 months ago.

I went ahead and updated CMake in freedv-gui to only use this branch for the Flex and KA9Q/web SDR integrations for now. Something we can maybe consider too is limiting the changes to only the C code and bpf / detect_pilots as that seems to provide the biggest benefit (and if needed for Flex to work properly, we can have a separate PR for the rest of the DSP changes). This is up to PLT, though.

drowe67 · 2025-12-01T01:59:54Z

@tmiw - further to our PLT discussion - when convenient could you pls to break out just the BPF optimisation into a separate PR:

This is used by RADE V1 and V2 so it makes sense to put some effort into optimisation.
There is a pretty good test framework already so review should be straight fwd.
A good test will set us up nicely for a C port of the BPF down the track, and allow delegation to other developers.

tmiw · 2025-12-01T21:26:05Z

@tmiw - further to our PLT discussion - when convenient could you pls to break out just the BPF optimisation into a separate PR:

This is used by RADE V1 and V2 so it makes sense to put some effort into optimisation.

There is a pretty good test framework already so review should be straight fwd.

A good test will set us up nicely for a C port of the BPF down the track, and allow delegation to other developers.

Done: #60. I also added some additional comments to hopefully explain what the new code is doing.

Port bpf changes from PR #58.

drowe67 · 2025-12-17T22:06:34Z

Closing so these proposed changes can be broken out into smaller PRs such as #60

tmiw added 22 commits May 2, 2025 16:48

Experiment: disable Python GC to improve macOS audio.

42ee685

Reduce the number of times that we have to lock the GIL.

1041c2b

Need to get initial values for sync/SNR too.

edbb998

WIP: reduce memory allocations for better performance.

7563868

Merge branch 'main' into ms-disable-python-gc

49ca82a

Merge branch 'main' into ms-disable-python-gc

9073600

Further optimizations to reduce overhead switching from/to C and Python.

6a7f9be

Revert "Further optimizations to reduce overhead switching from/to C …

50638ee

…and Python." This reverts commit 6a7f9be.

Try tweaked version of detect_pilots changes again.

f05ba59

Avoid unnecessary memory allocations and duplicate calculations durin…

d7f55b5

…g pilot checking/estimation.

Need to set current t/ffine_range in order to actually cut down on fi…

dc6aa1d

…ne-tuning allocations.

Avoid duplicate weight calculations during est_pilots.

13ea1c2

More tweaks.

8a36d40

Reset self.nin when returning to search state.

c2fc17a

Remove unneeded abs() calls.

a0c60b6

Merge branch 'main' into ms-disable-python-gc

dcfa8f8

Update Opus URL.

1c3f11a

Use as_strided trick to improve refine performance.

6196ce5

Stride bpf() as well.

771766f

Test optimization to check_pilots.

fcefe47

Reenable Python garbage collector on close.

8362323

Remove commented code.

edeafe5

Add aarch64 to GitHub tests.

19541e9

tmiw mentioned this pull request Dec 1, 2025

Port bpf changes from PR #58. #60

Merged

drowe67 added a commit that referenced this pull request Dec 6, 2025

Merge pull request #60 from drowe67/ms-bpf-optim

98094c4

Port bpf changes from PR #58.

Merge branch 'main' into ms-disable-python-gc

244676b

drowe67 closed this Dec 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance improvements to support the Flex waveform #58

Performance improvements to support the Flex waveform #58

Uh oh!

tmiw commented Nov 6, 2025

Uh oh!

drowe67 commented Nov 13, 2025

Uh oh!

tmiw commented Nov 14, 2025

Uh oh!

drowe67 commented Nov 21, 2025

Uh oh!

drowe67 commented Nov 21, 2025

Uh oh!

drowe67 commented Nov 21, 2025 •

edited

Loading

Uh oh!

tmiw commented Nov 22, 2025

Uh oh!

drowe67 commented Nov 23, 2025 •

edited

Loading

Uh oh!

tmiw commented Nov 24, 2025

Uh oh!

drowe67 commented Dec 1, 2025 •

edited

Loading

Uh oh!

tmiw commented Dec 1, 2025

Uh oh!

drowe67 commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Performance improvements to support the Flex waveform #58

Performance improvements to support the Flex waveform #58

Uh oh!

Conversation

tmiw commented Nov 6, 2025

Uh oh!

drowe67 commented Nov 13, 2025

Uh oh!

tmiw commented Nov 14, 2025

Uh oh!

drowe67 commented Nov 21, 2025

Uh oh!

drowe67 commented Nov 21, 2025

Uh oh!

drowe67 commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmiw commented Nov 22, 2025

Uh oh!

drowe67 commented Nov 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmiw commented Nov 24, 2025

Uh oh!

drowe67 commented Dec 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tmiw commented Dec 1, 2025

Uh oh!

drowe67 commented Dec 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

drowe67 commented Nov 21, 2025 •

edited

Loading

drowe67 commented Nov 23, 2025 •

edited

Loading

drowe67 commented Dec 1, 2025 •

edited

Loading