Potential Memory Leak Introduced in Snapshot (August - Sept 2023?)

Can't help you with the memory issues, but you can solve your IRQ balancing issues with this script:
(credits to @fif, the actual author here)

/etc/config/init.d/irqbalance-manual

#!/bin/sh /etc/rc.common

# Version 2023-05-29

START=13
USE_PROCD=1
AFFINITY_MIN=2
AFFINITY_MAX=8
AFFINITY_ALL="$(printf %x $(( AFFINITY_MAX * 2 - 1 )))"

set_affinities() {
        local callback="$1" irq desc ret=0
        sed -nre 's!^[[:space:]]*([0-9]+):[[:space:]]+.*[[:space:]]GIC-0[[:space:]]+[0-9]+[[:space:]]+(Level|Edge)[[:space:]]+(.+)$!\1 \3! p' /proc/interrupts | \
        while read irq desc
        do
                case "$desc" in
                        arch*)  ;; # Properly balanced
                        ce*)    ;; # Wifi firmware crashes
                        edma*)  ;; # Hangs wifi on high throughput https://forum.openwrt.org/t/dynalink-dl-wrx36-askey-rt5010w-ipq8072a-technical-discussion/110454/1743
                        xhci*)  ;; # Crashes with USB drive https://forum.openwrt.org/t/dynalink-dl-wrx36-askey-rt5010w-ipq8072a-technical-discussion/110454/1736
                        *)      "$callback" "/proc/irq/$irq/smp_affinity" || ret=1 ;;
                esac
        done
        return $ret
}

set_affinity_per_cpu() {
        local procfile="$1" ret=0
        echo "$AFFINITY" > "$procfile" || ret=$?
        if [ $AFFINITY -ge $AFFINITY_MAX ]
        then
                AFFINITY=$AFFINITY_MIN
        else
                AFFINITY=$(( AFFINITY * 2 ))
        fi
        return $ret
}

set_affinity_shared() {
        local procfile="$1" ret=0
        echo "$AFFINITY_ALL" > "$procfile" || ret=$?
        return $ret
}

start_service() {
        reload_service
}

reload_service() {
        AFFINITY=$AFFINITY_MIN
        set_affinities set_affinity_per_cpu
}

stop_service() {
        set_affinities set_affinity_shared
}

then, for the final touches, add in System > Startup > Local Startup (i.e., /etc/rc.local)

if [ ! -f /etc/init.d/irqbalance-manual ]; then
    ln -s /etc/config/init.d/irqbalance-manual /etc/init.d/irqbalance-manual
    service irqbalance-manual start
    service irqbalance-manual enable
fi
3 Likes

Thanks!
I actually don’t manually adjust… I run irqbalance.
It was just something I noticed poking around based on things I remembered from dev cycle…

1 Like

Irqbalance does not work as expected on IPQ807x, because it doesn't recognize all the interrupts.

See this post for more information.

1 Like

Thank you @Spacebar and @vit0r
I was not aware of the irqbalance issue with ipq807. I checked and indeed CPU0 was taking all of the load.

With that, I have implemented a script based on the above bitthief's init.d script. The load is now across the CPUs. I updated to 9536446965 and interestinglythe memory has been stable. Even after a few fast.com and bufferbloat tests.

I'll keep an eye on it and post back.

Have to say that it has not looked this good for this many minutes with load in quite a while.

Ok, so this is promising. But strange.

After ~6 hours of uptime, the memory was rock solid. Dips under load, but recovered. All 4 CPUs bearing equal load. I passed a good 5GB WAN->Wired LAN and it did not OOM -- It stayed up around 90MB free.

I kept the download going (I think it was in total around 15GB WAN->Wired LAN, and the router did freeze (both LEDs solid blue). Pulled the plug and restarted.

@hnyman As your name is all over the interrupt ticket :slight_smile: (and you spoke about freezes), in your experience did any of the interrupt assignments cause more consistent freezes?

@robimarko Want to pick your brain... Does this result (the not running out of memory with turning off irqbalance and assigning manually) make sense to you at all?

Below is the script I modified based on bitthief's repo and run once at startup:

#!/bin/sh

set_affinity() {
irq=$(awk "/$1/{ print substr($1, 1, length($1)-1); exit }" /proc/interrupts)
[ -n "$irq" ] && echo $2 > /proc/irq/$irq/smp_affinity
logger -t /tmp/set_smp_affinity.sh "Setting Affinity: $1 to $2"
}

#assign 4 rx interrupts to each core
set_affinity 'reo2host-destination-ring1' 1
set_affinity 'reo2host-destination-ring2' 2
set_affinity 'reo2host-destination-ring3' 4
set_affinity 'reo2host-destination-ring4' 8

#assign 3 tcl completions to last 3 CPUs
set_affinity 'wbm2host-tx-completions-ring1' 2
set_affinity 'wbm2host-tx-completions-ring2' 4
set_affinity 'wbm2host-tx-completions-ring3' 8

#assign lan/wan to CPU 4
set_affinity 'edma_txcmpl' 8
set_affinity 'edma_rxfill' 8
set_affinity 'edma_rxdesc' 8
set_affinity 'edma_misc' 8

exit 0

Ugh, honestly that is rather weird.
IRQ numbers are not stable and you cannot rely on them but have to parse the IRQ by name.

Agree, and based on that I ran some test on version 9536446965.
It looks like regardless of irqbalance enabled/not, manual irq assignment/not, I cannot reproduce the memory exhaustion.

Looking at commits, I can't see anything relevant at all. I don't know what to say.

I'll keep an eye on things for a while and close the ticket and note here if anything change.

Having read some of the other posts, I have also updated my script to put all edma on the 4th core. (Updated above as well).

With IRQ assignment (no irqbalance):

Uptime 17h 34m 56s
Total Available 111/406
Used 287/406
Cached 27/406

12 Wifi Clients attached (7 to IPQ8074 5GHz, 5 to IPQ8074 2GHz)

WAN:
RX: 113.82 GB (83826117 Pkts.)
TX: 3.60 GB (35512879 Pkts.)

Reboot. NO IRQ assignment (no irqbalance):
No significant memory drop

Reboot, With IRQ Balance enabled.
No significant memory drop

I'm experiencing this problem too, Xiaomi AX3600: ath11k firmware crash - qcom-q6v5-wcss-pil cd00000.q6v5_wcss: fatal error received: - #20 by Catfriend1

I would suggest trying the latest snapshot because it seems to have magically fixed it for me. Crossing fingers.

If running irqbalance try turning off/uninstalling.

Then try the intrupt script I posted and check.

If that doesn’t as well, I suggest adding your comment to the other thread (and the git ticket) as well so that the information is in one place.

1 Like

Do you mean the recent R23911?

WLAN.HK.2.9.0.1-01890-QCAHKSWPL_SILICONZ-1

The OpenWRT build I’ve got on right now is:

(r24096-9536446965)

So basically grab the latest snapshot.
I don’t touch the wifi firmware so whatever comes currently with build…. I’m away so can’t check…

1 Like

I also didn't touch the WiFi fw as its included.

Hmm still a problem :frowning:

:frowning:
Did you try disabling irqbalance (if you have it on) and runninng the script here?

[EDIT] Please make sure you have awk installed for the script to work.
If you don't, you can manually run at startup:

echo 1 > /proc/irq/66/smp_affinity
echo 2 > /proc/irq/67/smp_affinity
echo 4 > /proc/irq/68/smp_affinity
echo 8 > /proc/irq/69/smp_affinity

echo 2 > /proc/irq/48/smp_affinity
echo 4 > /proc/irq/53/smp_affinity
echo 8 > /proc/irq/56/smp_affinity

echo 8 > /proc/irq/32/smp_affinity
echo 8 > /proc/irq/33/smp_affinity
echo 8 > /proc/irq/35/smp_affinity
echo 8 > /proc/irq/36/smp_affinity

I do strongly recommend getting awk etc and making the script run successfully because the irqs can and do change...

1 Like

Just to understand it better, can this save memory?
Thanks for the script.

Try anyways and let’s see ..

Hi!

It seems to me that the problem is caused by the mechanism for obtaining information about the wifi status.
What am I talking about, when you open the tab where wifi information is displayed (Status or Wireles tab), it starts processes that ultimately lead to memory leaks.

How to check.

Reboot the router (preferably by turning off the power) and by no means go into Luci.
Monitor the amount of free memory using the console and the “free” or “htop” command (whichever you prefer).

If you want to provoke a memory leak and speed it up, open Luci's "Wireless" tab and in another tab perform "Channel analyze"

My current configuration Redmi AX6, openwrt 23.05
But the problem is also reproduced on snapshot.

Additional packages - Stubby, HTTPS-DNS, SQM and many others.

The main thing is not to open the previously mentioned tabs.

Even earlier, I drew robmarko’s attention to the peculiarity of the increase in the temperature of the radio module precisely after performing “Scan”

Increase of WIPHY temperature when you are doing scan is not weird, as you are actively using it and as long as its normal that is expected.

Yes, sure.
But there is a characteristic change in temperature.

Let me remind you that the temperature before scanning is 10-15 degrees lower for any length of time before scanning and does not decrease after it has been completed.

But the values ​​are certainly not critical. I remembered this as an indication of a global change in the operating mode of the chip.
I remembered this in the hope that it might prompt a rethink.

I've been suffering from this since the last few releases. Enabling/disabling packet steering and QoS didn't help at all. And today I flashed the r24328-91169898ce and within 3 hours I got 3 reboots :confused:

I have several devices connected to wifi and 2 devices wired. And whenever I do a speedtest etc, I see memory usage increases and for some reason it's not getting released back.

Device: ax3600

1 Like