Potential Memory Leak Introduced in Snapshot (August - Sept 2023?)

It does appear that others are also having this issue based on the git ticket.

I also use openssl so I tried changing to mbedtls and updated to r24096-9536446965.

libmbedtls12
libustream-mbedtls20201210
openvpn-mbedtls
wpad-mbedtls

However, it did not help what-so-ever:

Uptime: 2h 31m 59s
Total Available 33 / 406
Used 367 / 406
Cached 22 / 406

With r23900-7f54d9ba1a (openssl) it never drops below 90MB. Even days later...

Hi, this got me curious so I had to check if my wax-family has the same issue.
This is 1 week and uptime today is 13 days with version OpenWrt SNAPSHOT r24024-6585498372.

I did a iperf test on both, but memory usage was the unchanged.
I've installed wpad-openssl 2023-09-08-e5ccbfc6-3 on both btw. They share the same build...


@Edrikk Sorry If I'm asking a question you have already answered.
But have you tried a clean install, no extra packages and go from there. If memory stays the same, add 1 package you are using, if this one don't do anything with memory usage, go to next package... repeat til you can see the memory is going up.

Thanks and all good :slight_smile:

I'm afraid that I can't do a full reinstall due to this being a main router.
I did disable all services and turn things on one by one, and the behaviour didn't change past the good build.
Given that others are seeing it now as well, I'm quite confident that it's not unique to me.

But here's the thing that I just discovered. As I noted above, my memory was down to ~30MB free.
Then I had to do a "git update / upgrade" on a wireless client that I have, and boom, the memory was freed up from the ~30 to 122MB!

Total Available 122 / 406
Used 274 / 406
Cached 23 / 406

I then remembered an issue from the "original" AX3600 thread, so tried this test. From the above 122MB free, I started a fast.com speed test on wired lan, which brought the free memory down to about 30MB. I then did the same fast.com test on a wireless device, and boom the RAM was freed.

I decided to run both in parallel. Again, Wired caused a drop down to about 70-80MB (from 122MB). When that happened I triggered a wireless fast.com speed test, and again, the memory was freed!

@robimarko this seems to line up with posts such as this from the old AX3600 thread. I remember there was a lot of discussion about it, and then the issue appeared to be fixed; But maybe somehow it's back?

As an FYI btw, I also noticed that at some point (quite possibly with Kernel 6.1?) that the IRQs have changed. dchard use to have a script on his AX6 which set IRQs 73, 74, and 75. They no longer exist...

#assign 3 tcl completions to 3 CPUs
echo 4 > /proc/irq/73/smp_affinity
echo 2 > /proc/irq/74/smp_affinity
echo 1 > /proc/irq/75/smp_affinity

Below is from K6.1 (including my last good version without this memory issue); I can't really go back further without issue given the target changes etc.

root@RM-AX3600:~# cat /proc/interrupts
           CPU0       CPU1       CPU2       CPU3
  9:          0          0          0          0 GIC-0  39 Level     arch_mem_timer
 13:    1476694    4503624    2316503    2717233 GIC-0  20 Level     arch_timer
 16:          0          0          0          0       MSI   0 Edge      PCIe PME, aerdrv
 17:          0          0          0          0 GIC-0 239 Level     bam_dma
 18:          0          0          0          0 GIC-0 270 Level     bam_dma
 19:     182052          0          0          0 GIC-0 178 Level     bam_dma
 20:          2          0          0          0 GIC-0 354 Edge      smp2p
 21:         10          0          0          0 GIC-0 340 Level     msm_serial0
 23:          0          0          0          0 GIC-0 216 Level     4a9000.thermal-sensor
 24:          0          0          0          0 GIC-0  35 Edge      wdt_bark
 25:          0          0          0          0 GIC-0 357 Edge      q6v5 wdog
 26:          0          0          0          0  pmic_arb 51380237 Edge      pm-adc5
 27:          5          0          0          0 GIC-0  47 Edge      cpr3
 28:          0          0          0          0     smp2p   0 Edge      q6v5 fatal
 29:          1          0          0          0     smp2p   1 Edge      q6v5 ready
 30:          0          0          0          0     smp2p   2 Edge      q6v5 handover
 31:          0          0          0          0     smp2p   3 Edge      q6v5 stop
 32:   15701314          0          0          0 GIC-0 377 Level     edma_txcmpl
 33:          0          0          0          0 GIC-0 385 Level     edma_rxfill
 34:          0          0          0          0   msmgpio  34 Edge      keys
 35:   15497062          0          0          0 GIC-0 393 Level     edma_rxdesc
 36:          0          0          0          0 GIC-0 376 Level     edma_misc
 37:         32          0          0          0       MSI 524288 Edge      ath10k_pci
 38:         64          0          0          0 GIC-0 353 Edge      glink-native
 39:          5          0          0          0 GIC-0 348 Edge      ce0
 40:    2562308          0          0          0 GIC-0 347 Edge      ce1
 41:    1622770          0          0          0 GIC-0 346 Edge      ce2
 42:      65747          0          0          0 GIC-0 343 Edge      ce3
 43:          1          0          0          0 GIC-0 443 Edge      ce5
 44:      58524          0          0          0 GIC-0  72 Edge      ce7
 45:          0          0          0          0 GIC-0 334 Edge      ce9
 46:          1          0          0          0 GIC-0 333 Edge      ce10
 47:          0          0          0          0 GIC-0  69 Edge      ce11
 48:     589101          0          0          0 GIC-0 189 Edge      wbm2host-tx-completions-ring1
 49:          0          0          0          0 GIC-0 323 Edge      reo2ost-exception
 50:      16800          0          0          0 GIC-0 322 Edge      wbm2host-rx-release
 51:          0          0          0          0 GIC-0 209 Edge      rxdma2host-destination-ring-mac1
 52:          0          0          0          0 GIC-0 212 Edge      host2rxdma-host-buf-ring-mac1
 53:     265646          0          0          0 GIC-0 190 Edge      wbm2host-tx-completions-ring2
 54:          0          0          0          0 GIC-0 211 Edge      rxdma2host-destination-ring-mac3
 55:          1          0          0          0 GIC-0 235 Edge      host2rxdma-host-buf-ring-mac3
 56:     126901          0          0          0 GIC-0 191 Edge      wbm2host-tx-completions-ring3
 57:          0          0          0          0 GIC-0 210 Edge      rxdma2host-destination-ring-mac2
 58:          0          0          0          0 GIC-0 215 Edge      host2rxdma-host-buf-ring-mac2
 59:        672          0          0          0 GIC-0 321 Edge      reo2host-status
 60:     380979          0          0          0 GIC-0 261 Edge      ppdu-end-interrupts-mac1
 61:          1          0          0          0 GIC-0 255 Edge      rxdma2host-monitor-status-ring-mac1
 62:    1630888          0          0          0 GIC-0 263 Edge      ppdu-end-interrupts-mac3
 63:          1          0          0          0 GIC-0 260 Edge      rxdma2host-monitor-status-ring-mac3
 64:          0          0          0          0 GIC-0 262 Edge      ppdu-end-interrupts-mac2
 65:          0          0          0          0 GIC-0 256 Edge      rxdma2host-monitor-status-ring-mac2
 66:      43559          0          0          0 GIC-0 267 Edge      reo2host-destination-ring1
 67:      55062          0          0          0 GIC-0 268 Edge      reo2host-destination-ring2
 68:      72891          0          0          0 GIC-0 271 Edge      reo2host-destination-ring3
 69:      33098          0          0          0 GIC-0 320 Edge      reo2host-destination-ring4
IPI0:      8085       8067       8194       7734       Rescheduling interrupts
IPI1:    552938    4878960    2380807    2379183       Function call interrupts
IPI2:         0          0          0          0       CPU stop interrupts
IPI3:         0          0          0          0       CPU stop (for crash dump) interrupts
IPI4:         0          0          0          0       Timer broadcast interrupts
IPI5:      1432       1087       1166       1067       IRQ work interrupts
IPI6:         0          0          0          0       CPU wake-up interrupts
Err:          0

Can't help you with the memory issues, but you can solve your IRQ balancing issues with this script:
(credits to @fif, the actual author here)

/etc/config/init.d/irqbalance-manual

#!/bin/sh /etc/rc.common

# Version 2023-05-29

START=13
USE_PROCD=1
AFFINITY_MIN=2
AFFINITY_MAX=8
AFFINITY_ALL="$(printf %x $(( AFFINITY_MAX * 2 - 1 )))"

set_affinities() {
        local callback="$1" irq desc ret=0
        sed -nre 's!^[[:space:]]*([0-9]+):[[:space:]]+.*[[:space:]]GIC-0[[:space:]]+[0-9]+[[:space:]]+(Level|Edge)[[:space:]]+(.+)$!\1 \3! p' /proc/interrupts | \
        while read irq desc
        do
                case "$desc" in
                        arch*)  ;; # Properly balanced
                        ce*)    ;; # Wifi firmware crashes
                        edma*)  ;; # Hangs wifi on high throughput https://forum.openwrt.org/t/dynalink-dl-wrx36-askey-rt5010w-ipq8072a-technical-discussion/110454/1743
                        xhci*)  ;; # Crashes with USB drive https://forum.openwrt.org/t/dynalink-dl-wrx36-askey-rt5010w-ipq8072a-technical-discussion/110454/1736
                        *)      "$callback" "/proc/irq/$irq/smp_affinity" || ret=1 ;;
                esac
        done
        return $ret
}

set_affinity_per_cpu() {
        local procfile="$1" ret=0
        echo "$AFFINITY" > "$procfile" || ret=$?
        if [ $AFFINITY -ge $AFFINITY_MAX ]
        then
                AFFINITY=$AFFINITY_MIN
        else
                AFFINITY=$(( AFFINITY * 2 ))
        fi
        return $ret
}

set_affinity_shared() {
        local procfile="$1" ret=0
        echo "$AFFINITY_ALL" > "$procfile" || ret=$?
        return $ret
}

start_service() {
        reload_service
}

reload_service() {
        AFFINITY=$AFFINITY_MIN
        set_affinities set_affinity_per_cpu
}

stop_service() {
        set_affinities set_affinity_shared
}

then, for the final touches, add in System > Startup > Local Startup (i.e., /etc/rc.local)

if [ ! -f /etc/init.d/irqbalance-manual ]; then
    ln -s /etc/config/init.d/irqbalance-manual /etc/init.d/irqbalance-manual
    service irqbalance-manual start
    service irqbalance-manual enable
fi
3 Likes

Thanks!
I actually don’t manually adjust… I run irqbalance.
It was just something I noticed poking around based on things I remembered from dev cycle…

1 Like

Irqbalance does not work as expected on IPQ807x, because it doesn't recognize all the interrupts.

See this post for more information.

1 Like

Thank you @Spacebar and @vit0r
I was not aware of the irqbalance issue with ipq807. I checked and indeed CPU0 was taking all of the load.

With that, I have implemented a script based on the above bitthief's init.d script. The load is now across the CPUs. I updated to 9536446965 and interestinglythe memory has been stable. Even after a few fast.com and bufferbloat tests.

I'll keep an eye on it and post back.

Have to say that it has not looked this good for this many minutes with load in quite a while.

Ok, so this is promising. But strange.

After ~6 hours of uptime, the memory was rock solid. Dips under load, but recovered. All 4 CPUs bearing equal load. I passed a good 5GB WAN->Wired LAN and it did not OOM -- It stayed up around 90MB free.

I kept the download going (I think it was in total around 15GB WAN->Wired LAN, and the router did freeze (both LEDs solid blue). Pulled the plug and restarted.

@hnyman As your name is all over the interrupt ticket :slight_smile: (and you spoke about freezes), in your experience did any of the interrupt assignments cause more consistent freezes?

@robimarko Want to pick your brain... Does this result (the not running out of memory with turning off irqbalance and assigning manually) make sense to you at all?

Below is the script I modified based on bitthief's repo and run once at startup:

#!/bin/sh

set_affinity() {
irq=$(awk "/$1/{ print substr($1, 1, length($1)-1); exit }" /proc/interrupts)
[ -n "$irq" ] && echo $2 > /proc/irq/$irq/smp_affinity
logger -t /tmp/set_smp_affinity.sh "Setting Affinity: $1 to $2"
}

#assign 4 rx interrupts to each core
set_affinity 'reo2host-destination-ring1' 1
set_affinity 'reo2host-destination-ring2' 2
set_affinity 'reo2host-destination-ring3' 4
set_affinity 'reo2host-destination-ring4' 8

#assign 3 tcl completions to last 3 CPUs
set_affinity 'wbm2host-tx-completions-ring1' 2
set_affinity 'wbm2host-tx-completions-ring2' 4
set_affinity 'wbm2host-tx-completions-ring3' 8

#assign lan/wan to CPU 4
set_affinity 'edma_txcmpl' 8
set_affinity 'edma_rxfill' 8
set_affinity 'edma_rxdesc' 8
set_affinity 'edma_misc' 8

exit 0

Ugh, honestly that is rather weird.
IRQ numbers are not stable and you cannot rely on them but have to parse the IRQ by name.

Agree, and based on that I ran some test on version 9536446965.
It looks like regardless of irqbalance enabled/not, manual irq assignment/not, I cannot reproduce the memory exhaustion.

Looking at commits, I can't see anything relevant at all. I don't know what to say.

I'll keep an eye on things for a while and close the ticket and note here if anything change.

Having read some of the other posts, I have also updated my script to put all edma on the 4th core. (Updated above as well).

With IRQ assignment (no irqbalance):

Uptime 17h 34m 56s
Total Available 111/406
Used 287/406
Cached 27/406

12 Wifi Clients attached (7 to IPQ8074 5GHz, 5 to IPQ8074 2GHz)

WAN:
RX: 113.82 GB (83826117 Pkts.)
TX: 3.60 GB (35512879 Pkts.)

Reboot. NO IRQ assignment (no irqbalance):
No significant memory drop

Reboot, With IRQ Balance enabled.
No significant memory drop

I'm experiencing this problem too, Xiaomi AX3600: ath11k firmware crash - qcom-q6v5-wcss-pil cd00000.q6v5_wcss: fatal error received: - #20 by Catfriend1

I would suggest trying the latest snapshot because it seems to have magically fixed it for me. Crossing fingers.

If running irqbalance try turning off/uninstalling.

Then try the intrupt script I posted and check.

If that doesn’t as well, I suggest adding your comment to the other thread (and the git ticket) as well so that the information is in one place.

1 Like

Do you mean the recent R23911?

WLAN.HK.2.9.0.1-01890-QCAHKSWPL_SILICONZ-1

The OpenWRT build I’ve got on right now is:

(r24096-9536446965)

So basically grab the latest snapshot.
I don’t touch the wifi firmware so whatever comes currently with build…. I’m away so can’t check…

1 Like

I also didn't touch the WiFi fw as its included.

Hmm still a problem :frowning:

:frowning:
Did you try disabling irqbalance (if you have it on) and runninng the script here?

[EDIT] Please make sure you have awk installed for the script to work.
If you don't, you can manually run at startup:

echo 1 > /proc/irq/66/smp_affinity
echo 2 > /proc/irq/67/smp_affinity
echo 4 > /proc/irq/68/smp_affinity
echo 8 > /proc/irq/69/smp_affinity

echo 2 > /proc/irq/48/smp_affinity
echo 4 > /proc/irq/53/smp_affinity
echo 8 > /proc/irq/56/smp_affinity

echo 8 > /proc/irq/32/smp_affinity
echo 8 > /proc/irq/33/smp_affinity
echo 8 > /proc/irq/35/smp_affinity
echo 8 > /proc/irq/36/smp_affinity

I do strongly recommend getting awk etc and making the script run successfully because the irqs can and do change...

1 Like

Just to understand it better, can this save memory?
Thanks for the script.

Try anyways and let’s see ..

Hi!

It seems to me that the problem is caused by the mechanism for obtaining information about the wifi status.
What am I talking about, when you open the tab where wifi information is displayed (Status or Wireles tab), it starts processes that ultimately lead to memory leaks.

How to check.

Reboot the router (preferably by turning off the power) and by no means go into Luci.
Monitor the amount of free memory using the console and the “free” or “htop” command (whichever you prefer).

If you want to provoke a memory leak and speed it up, open Luci's "Wireless" tab and in another tab perform "Channel analyze"

My current configuration Redmi AX6, openwrt 23.05
But the problem is also reproduced on snapshot.

Additional packages - Stubby, HTTPS-DNS, SQM and many others.

The main thing is not to open the previously mentioned tabs.

Even earlier, I drew robmarko’s attention to the peculiarity of the increase in the temperature of the radio module precisely after performing “Scan”