Skip to content

Conversation

@tmiw
Copy link
Collaborator

@tmiw tmiw commented Nov 6, 2025

This PR contains various performance improvements to ensure that the hybrid C/Python version of RADEV1 can run acceptably on a Raspberry Pi 4 (the hardware inside the Flex 8000 and Aurora series of radios):

  • Disables the Python garbage collector on rade_open() (ensures that each run through rade_tx() and rade_rx() take a deterministic amount of time).
  • Sync/SNR are retrieved during rade_rx() to avoid taking the Python lock multiple times per block of audio (also improves determinism).
  • radae/dsp.py:
    • Reuse blocks of memory during computations where possible (reduces calls to malloc(), etc.)
    • Use np.lib.stride_tricks.as_strided() to generate arrays of sliding arrays where possible (i.e. [[1,2,3],[2,3,4],...]) and then perform a single e.g. np.matmul on the result. This reduces the overhead from having to go back and forth between Python and C while using NumPy.
    • Similarly, in check_pilots we generate a single array of items from the randomly-selected rx samples and perform a single NumPy call to calculate the result (which we put back into Dt1 and Dt2).

Performance comparison in the GitHub environment using ctest -V -R radae_rx_profile:

main (f4254de):

36:          3742067 function calls (3660181 primitive calls) in 16.440 seconds
36: 
36:    Ordered by: internal time
36: 
36:    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
36:       448    1.514    0.003    1.589    0.004 dsp.py:57(bpf)
36:       409    1.359    0.003    1.573    0.004 dsp.py:381(est_pilots)
36:       411    1.354    0.003    1.411    0.003 dsp.py:196(refine)
36:      2863    1.145    0.000    1.145    0.000 {built-in method torch._C._nn.linear}
36:      2045    0.991    0.000    0.991    0.000 {built-in method torch.gru}
36:       448    0.970    0.002   14.053    0.031 radae_rxe.py:171(do_radae_rx)
36:      2045    0.921    0.000    0.921    0.000 {built-in method torch.conv1d}
36:      2055    0.891    0.000    0.891    0.000 {built-in method torch._weight_norm}
36:       409    0.869    0.002    2.814    0.007 dsp.py:422(do_pilot_eq_one)
36:       410    0.822    0.002    0.886    0.002 dsp.py:236(check_pilots)
36:     25012    0.690    0.000    0.690    0.000 {built-in method torch.matmul}
36:        38    0.544    0.014    0.778    0.020 dsp.py:141(detect_pilots)
36:     53/50    0.350    0.007    0.354    0.007 {built-in method _imp.create_dynamic}
36:      6544    0.180    0.000    0.281    0.000 radae_base.py:80(n)
36:    2234/1    0.129    0.000   16.444   16.444 {built-in method builtins.exec}

This PR:

36:          2984008 function calls (2902122 primitive calls) in 13.412 seconds
36: 
36:    Ordered by: internal time
36: 
36:    ncalls  tottime  percall  cumtime  percall filename:lineno(function)
36:      2863    1.064    0.000    1.064    0.000 {built-in method torch._C._nn.linear}
36:       448    0.973    0.002   11.118    0.025 radae_rxe.py:171(do_radae_rx)
36:       448    0.947    0.002    0.963    0.002 dsp.py:59(bpf)
36:        38    0.900    0.024    0.916    0.024 dsp.py:153(detect_pilots)
36:       409    0.889    0.002    2.141    0.005 dsp.py:452(do_pilot_eq_one)
36:      2045    0.813    0.000    0.813    0.000 {built-in method torch.conv1d}
36:      2055    0.809    0.000    0.809    0.000 {built-in method torch._weight_norm}
36:       409    0.805    0.002    0.915    0.002 dsp.py:409(est_pilots)
36:      2045    0.804    0.000    0.804    0.000 {built-in method torch.gru}
36:     25012    0.559    0.000    0.559    0.000 {built-in method torch.matmul}
36:       410    0.495    0.001    0.568    0.001 dsp.py:252(check_pilots)
36:       411    0.337    0.001    0.583    0.001 dsp.py:208(refine)
36:     53/50    0.331    0.006    0.336    0.007 {built-in method _imp.create_dynamic}
36:      6544    0.161    0.000    0.256    0.000 radae_base.py:80(n)
36:    2234/1    0.120    0.000   13.416   13.416 {built-in method builtins.exec}

(~20% improvement based on the first line of both results)

Real-world CPU usage testing with the Flex waveform in freedv-gui:

  • Idle, no sync (peak, approximate; measured using top): 95% CPU using main, 30% using this PR (!)
  • Synced and actively decoding: 83% CPU using main, 60% CPU using this PR

@drowe67
Copy link
Owner

drowe67 commented Nov 13, 2025

Sorry I just saw this one now, as it was in my junk folder. Will take a look when I get some time.

Would suggest just the minimum optimisation is done to get the in Flex port running. As per PLT policy with RADE V1 we don't want to make RADE V1 optimisation/maintenance a regular activity - it will all be deleted soon.

@tmiw
Copy link
Collaborator Author

tmiw commented Nov 14, 2025

Would suggest just the minimum optimisation is done to get the in Flex port running. As per PLT policy with RADE V1 we don't want to make RADE V1 optimisation/maintenance a regular activity - it will all be deleted soon.

Agreed. Walter and I have been testing with the PR version of RADE and the waveform seems to be holding up well. I did test the Flex waveform with main and the RX audio was nowhere near as smooth, even though in theory we still have remaining idle CPU.

@drowe67
Copy link
Owner

drowe67 commented Nov 21, 2025

@tmiw:

  1. Is this still WIP and likely to have more changes? Or have you hit the real time perf requirements for Flex support and feel this can be merged as is?

  2. Is this PR being used for freedv-gui 2.1.0? Or is that still running off the main branch?

@drowe67
Copy link
Owner

drowe67 commented Nov 21, 2025

@tmiw - do the ctests pass on the Pi 4 platform? (Assuming they can be run)

@drowe67
Copy link
Owner

drowe67 commented Nov 21, 2025

This PR has some pretty extensive mods to several really important DSP functions. It's really hard to tell from simply looking at the source mods if the changes are OK. Really, really easy for subtle issues to cause problems here, and they may not be picked up by the ctests. Nervous about this code being pushed out to non-Flex users at this time.

I think we need some evidence that each function has identical performance to the main version, for example run this version and main, dumping output values from each function over 60s (say on a worst case MP channel run), comparing MSE. Sort of thing we do for C or stm32 ports. Some of the ctests may cover this already. However, this is a lot of work.

Alternatively, use this branch just for the Flex port, and consider it experimental.

@tmiw
Copy link
Collaborator Author

tmiw commented Nov 22, 2025

  1. Is this still WIP and likely to have more changes? Or have you hit the real time perf requirements for Flex support and feel this can be merged as is?

No further changes planned unless bugs are uncovered.

  1. Is this PR being used for freedv-gui 2.1.0? Or is that still running off the main branch?

Yep, it's being used for freedv-gui too since the waveform does share a fair amount of code with it.

do the ctests pass on the Pi 4 platform? (Assuming they can be run)

They do run but take quite a while (for example, ~2600 seconds on the CM4 board I'm using for waveform testing). No issues that I can tell.

Also, GitHub has Linux ARM runners now so I added ARM to the automated ctests in this repo.

Note: main doesn't currently pass on either x86_64 or ARM due to some changes made in Opus. This PR updates the default OPUS_URL to the same one freedv-gui's using (as tests were failing there too due to those changes).

I think we need some evidence that each function has identical performance to the main version, for example run this version and main, dumping output values from each function over 60s (say on a worst case MP channel run), comparing MSE. Sort of thing we do for C or stm32 ports. Some of the ctests may cover this already. However, this is a lot of work.

I'll have to think about this some more.

@drowe67
Copy link
Owner

drowe67 commented Nov 23, 2025

I'll have to think about this some more.

If you decide you are still keen to merge this code or use it for non-Flex targets pls let me know and I'll design adequate tests for you to implement. Please do not start coding any more tests until I have approved a test plan.

Alternatively, just using the code in this PR for experimental use on Flex only is acceptable to me.

Perhaps the appropriate use of this PR is something we need to discuss at PLT level:

  1. It implies further resources being diverted to RADE V1 which we have previously decided not to do.
  2. It concerns me that highly modified, experimental RADE V1 code has found its way into freedv-gui based on the decision of a single team member and prior to adequate review. This is at odds with the policy decisions we made at PLT level after the unfortunate premature release of RADE V1 12 months ago.

@tmiw
Copy link
Collaborator Author

tmiw commented Nov 24, 2025

I'll have to think about this some more.

If you decide you are still keen to merge this code or use it for non-Flex targets pls let me know and I'll design adequate tests for you to implement. Please do not start coding any more tests until I have approved a test plan.

Alternatively, just using the code in this PR for experimental use on Flex only is acceptable to me.

Perhaps the appropriate use of this PR is something we need to discuss at PLT level:

  1. It implies further resources being diverted to RADE V1 which we have previously decided not to do.
  2. It concerns me that highly modified, experimental RADE V1 code has found its way into freedv-gui based on the decision of a single team member and prior to adequate review. This is at odds with the policy decisions we made at PLT level after the unfortunate premature release of RADE V1 12 months ago.

I went ahead and updated CMake in freedv-gui to only use this branch for the Flex and KA9Q/web SDR integrations for now. Something we can maybe consider too is limiting the changes to only the C code and bpf / detect_pilots as that seems to provide the biggest benefit (and if needed for Flex to work properly, we can have a separate PR for the rest of the DSP changes). This is up to PLT, though.

@drowe67
Copy link
Owner

drowe67 commented Dec 1, 2025

@tmiw - further to our PLT discussion - when convenient could you pls to break out just the BPF optimisation into a separate PR:

  1. This is used by RADE V1 and V2 so it makes sense to put some effort into optimisation.
  2. There is a pretty good test framework already so review should be straight fwd.
  3. A good test will set us up nicely for a C port of the BPF down the track, and allow delegation to other developers.

@tmiw tmiw mentioned this pull request Dec 1, 2025
@tmiw
Copy link
Collaborator Author

tmiw commented Dec 1, 2025

@tmiw - further to our PLT discussion - when convenient could you pls to break out just the BPF optimisation into a separate PR:

  1. This is used by RADE V1 and V2 so it makes sense to put some effort into optimisation.
  2. There is a pretty good test framework already so review should be straight fwd.
  3. A good test will set us up nicely for a C port of the BPF down the track, and allow delegation to other developers.

Done: #60. I also added some additional comments to hopefully explain what the new code is doing.

drowe67 added a commit that referenced this pull request Dec 6, 2025
@drowe67
Copy link
Owner

drowe67 commented Dec 17, 2025

Closing so these proposed changes can be broken out into smaller PRs such as #60

@drowe67 drowe67 closed this Dec 17, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants