[PSA] "master" builds and complex wireless seem to cause memory exhaustion

:mega: This is a "public service announcement" not a demand for resolution.

On an Archer C7, an ar71xx, 16/128 device, recent builds off "master" fail to run reliably due to memory exhaustion. External symptoms include "slow" response to ssh, APs not coming up, and the like. logread can show if you're impacted (assuming that the logger wasn't what was killed)

Those building off "master", which is not considered stable, production-ready code, might want to check their logs for evidence of this issue.

One example of what I've been seeing (and the same build off "lede-17.01" is stable) is in the expandable area below. It typically happens in the first minute after boot. Memory consumption under 17.01 is typically around 55-65 MB of 122 MB available.

View the log details

Thu Mar 15 13:46:02 2018 daemon.notice netifd: Network device 'wlan1-2' link is up
Thu Mar 15 13:46:02 2018 daemon.notice netifd: Network device 'wlan1-3' link is up
Thu Mar 15 13:46:02 2018 daemon.notice netifd: Network device 'wlan1-4' link is up
Thu Mar 15 13:46:02 2018 daemon.notice netifd: Network device 'wlan1-5' link is up
Thu Mar 15 13:46:02 2018 kern.info kernel: [   38.449126] br-VLAN_84: port 4(gre4t-gt97.84) entered forwarding state
Thu Mar 15 13:46:02 2018 kern.info kernel: [   38.455770] br-VLAN_84: topology change detected, propagating
Thu Mar 15 13:46:02 2018 kern.info kernel: [   38.848866] br-VLAN_1010: port 5(wlan1-1) entered forwarding state
Thu Mar 15 13:46:02 2018 kern.info kernel: [   38.855150] br-VLAN_1010: topology change detected, propagating
Thu Mar 15 13:46:02 2018 kern.info kernel: [   39.088714] br-VLAN_84: port 5(wlan1-4) entered learning state
Thu Mar 15 13:46:03 2018 kern.info kernel: [   39.248605] br-VLAN_1010: port 8(wlan1-5) entered learning state
Thu Mar 15 13:46:03 2018 kern.info kernel: [   39.968149] br-VLAN_1010: port 6(wlan1-2) entered forwarding state
Thu Mar 15 13:46:03 2018 kern.info kernel: [   39.974430] br-VLAN_1010: topology change detected, propagating
Thu Mar 15 13:46:03 2018 kern.info kernel: [   39.980506] br-VLAN_1010: port 7(wlan1-3) entered forwarding state
Thu Mar 15 13:46:03 2018 kern.info kernel: [   39.986776] br-VLAN_1010: topology change detected, propagating
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.660466] netifd invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), nodemask=0, order=0, oom_score_adj=0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.672230] CPU: 0 PID: 942 Comm: netifd Not tainted 4.9.86 #0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.678161] Stack : 804b765a 00000032 00000000 00000001 87cc8cd4 80443247 803f8508 000003ae
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.686700]         804b37a0 ffffffff 00000001 00200000 00000001 800a90b0 00000000 00000004
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.695244]         00000006 80455234 803fbef8 87159ba4 00000000 800d6d8c 804b765a 00000074
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.703799]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.712343]         00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.720887]         ...
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.723379] Call Trace:
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.725868] [<8006b0f0>] show_stack+0x70/0x8c
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.730319] [<80112490>] dump_header.isra.5+0x80/0x174
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.735543] [<800dc854>] oom_kill_process+0xac/0x404
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.740602] [<800dd0f4>] out_of_memory+0x3c4/0x3e8
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.745472] [<800e0698>] __alloc_pages_nodemask+0x92c/0x9a8
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.751145] [<800db450>] filemap_fault+0x404/0x5cc
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.756013] [<800f8eec>] __do_fault+0x68/0x108
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.760540] [<800fc454>] handle_mm_fault+0x448/0xb2c
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.765589] [<800713fc>] __do_page_fault+0x23c/0x468
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.770648] [<80065f40>] ret_from_exception+0x0/0x10
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.775685] Mem-Info:
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.778015] active_anon:296 inactive_anon:3 isolated_anon:0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.778015]  active_file:18 inactive_file:24 isolated_file:0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.778015]  unevictable:0 dirty:0 writeback:0 unstable:0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.778015]  slab_reclaimable:181 slab_unreclaimable:1551
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.778015]  mapped:19 shmem:13 pagetables:40 bounce:0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.778015]  free:3767 free_pcp:41 free_cma:0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.809933] Node 0 active_anon:1184kB inactive_anon:12kB active_file:72kB inactive_file:96kB unevictable:0kB isolated(anon):0kB isolated(file):0kB mapped:76kB dirty:0kB writeback:0kB shmem:52kB writeback_tmp:0kB unstable:0kB pages_scanned:0 all_unreclaimable? yes
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.833578] Normal free:15068kB min:16384kB low:20480kB high:24576kB active_anon:1184kB inactive_anon:12kB active_file:72kB inactive_file:96kB unevictable:0kB writepending:0kB present:131072kB managed:125164kB mlocked:0kB slab_reclaimable:724kB slab_unreclaimable:6204kB kernel_stack:328kB pagetables:160kB bounce:0kB free_pcp:164kB local_pcp:164kB free_cma:0kB
Thu Mar 15 13:46:05 2018 kern.emerg kernel: lowmem_reserve[]: 0 0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.869213] Normal: 387*4kB (U) 254*8kB (U) 144*16kB (UM) 53*32kB (U) 13*64kB (U) 12*128kB (U) 4*256kB (U) 6*512kB (U) 1*1024kB (U) 0*2048kB 0*4096kB = 15068kB
Thu Mar 15 13:46:05 2018 kern.emerg kernel: 55 total pagecache pages
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.886148] 0 pages in swap cache
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.889520] Swap cache stats: add 0, delete 0, find 0/0
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.894813] Free swap  = 0kB
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.897743] Total swap = 0kB
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.900665] 32768 pages RAM
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.903497] 0 pages HighMem/MovableOnly
Thu Mar 15 13:46:05 2018 kern.warn kernel: [   40.907378] 1477 pages reserved
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.910566] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.919235] [  505]     0   505      298       19       4       0        0             0 ubusd
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.927982] [  506]     0   506      225       12       4       0        0             0 askfirst
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.936982] [  868]     0   868      306       32       4       0        0             0 logd
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.945643] [  885]     0   885      381       27       3       0        0             0 rpcd
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.954302] [  942]     0   942      435       60       4       0        0             0 netifd
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.963137] [ 1012]     0  1012      266       13       4       0        0             0 dropbear
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.972148] [ 1405]     0  1405      299       15       4       0        0             0 ntpd
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.980806] [ 1493]     0  1493      883       58       5       0        0             0 uhttpd
Thu Mar 15 13:46:05 2018 kern.info kernel: [   40.989643] [ 2593]     0  2593      938       61       5       0        0             0 hostapd
Thu Mar 15 13:46:05 2018 kern.err kernel: [   40.998561] Out of memory: Kill process 2593 (hostapd) score 2 or sacrifice child
Thu Mar 15 13:46:05 2018 kern.err kernel: [   41.006155] Killed process 2593 (hostapd) total-vm:3752kB, anon-rss:236kB, file-rss:8kB, shmem-rss:0kB
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.017325] br-VLAN_1010: port 8(wlan1-5) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.039605] device wlan1-5 left promiscuous mode
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.044299] br-VLAN_1010: port 8(wlan1-5) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.059688] br-VLAN_84: port 5(wlan1-4) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.079635] device wlan1-4 left promiscuous mode
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.084330] br-VLAN_84: port 5(wlan1-4) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.099536] br-VLAN_1010: port 7(wlan1-3) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.117789] device wlan1-3 left promiscuous mode
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.122487] br-VLAN_1010: port 7(wlan1-3) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.140563] br-VLAN_1010: port 6(wlan1-2) entered disabled state
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.158408] device wlan1-2 left promiscuous mode
Thu Mar 15 13:46:05 2018 kern.info kernel: [   41.163108] br-VLAN_1010: port 6(wlan1-2) entered disabled state
Thu Mar 15 13:46:05 2018 daemon.notice netifd: Network device 'wlan1-5' link is down
Thu Mar 15 13:46:08 2018 daemon.notice netifd: radio0 (2690): Segmentation fault
Thu Mar 15 13:46:08 2018 daemon.notice netifd: radio0 (2690): ./mac80211.sh: eval: line 1: can't fork
Thu Mar 15 13:46:08 2018 kern.info kernel: [   44.780790] device wlan1-1 left promiscuous mode
Thu Mar 15 13:46:08 2018 kern.info kernel: [   44.785583] br-VLAN_1010: port 5(wlan1-1) entered disabled state
Thu Mar 15 13:46:41 2018 kern.notice kernel: [   77.357090] random: crng init done
Thu Mar 15 13:57:33 2018 authpriv.info dropbear[2721]: Child connection from <redacted>
Thu Mar 15 13:57:40 2018 authpriv.notice dropbear[2721]: Password auth succeeded for 'root' from <redacted>

Must be target specific, mvebu/rango:

 OpenWrt SNAPSHOT, r6468-35e01cf
 -----------------------------------------------------
root@bsaedgy:~# uptime
 17:24:09 up  1:31,  load average: 0.00, 0.00, 0.00
root@bsaedgy:~# uname -a
Linux bsaedgy 4.14.27 #0 SMP Thu Mar 15 20:00:39 2018 armv7l GNU/Linux

I am not experiencing any such things, so I wouldn't be so quick to assume it's a general thing (even for your architecture). Ramips/mt7261:

OpenWrt SNAPSHOT, r6467+2-d0988235dd
 -----------------------------------------------------
root@lede:~# uname -a
Linux lede 4.14.27 #0 SMP Thu Mar 15 20:28:52 2018 mips GNU/Linux
root@lede:~# uptime 
 19:35:15 up 21:50,  load average: 0.00, 0.00, 0.00

Dmesg is squeaky clean here. Have a few ar71xx devices running here and there, haven't heard any complaints (there's a bit of a lag there sometimes :smile:)

May be config, as I run several VLANs. Looking into it now, going to do a git bisect once I get going. Even with a relatively recent i3-7100T CPU @ 3.40GHz and ccache, clean builds take a while...

I have a few Ubiquiti UniFi AP AC Pros and just checked one, nothing similar. Have it running with one extra VLAN.

Seems hostapd is getting killed on your device?

Whatever the low-memory "panic" decides -- it's rough when it picks the logger or uci. Thankfully it hasn't picked dropbear...

Fresh "master" build, seems OK until I add an AP to the mesh on the 5 GHz radio. At the last boot with the AP included:

root@office:/etc# logread | fgrep 'Kill process'
Fri Mar 16 12:35:32 2018 kern.err kernel: [   60.603550] Out of memory: Kill process 2671 (hostapd) score 7 or sacrifice child
Fri Mar 16 12:35:32 2018 kern.err kernel: [   61.996528] Out of memory: Kill process 1545 (uhttpd) score 2 or sacrifice child
Fri Mar 16 12:35:32 2018 kern.err kernel: [   62.938938] Out of memory: Kill process 2369 (hostapd) score 2 or sacrifice child
Fri Mar 16 12:35:32 2018 kern.err kernel: [   63.729451] Out of memory: Kill process 966 (netifd) score 1 or sacrifice child
Fri Mar 16 12:35:32 2018 kern.err kernel: [   64.500244] Out of memory: Kill process 966 (netifd) score 1 or sacrifice child
Fri Mar 16 12:35:32 2018 kern.err kernel: [   65.231871] Out of memory: Kill process 892 (logd) score 1 or sacrifice child

(The first minute of logged data doesn't appear in logread)

Edit: Changing to the AP-included configuration after a successful boot in the simpler configuration causes failure. With logread -f

root@office:/etc/config# reload_config 
root@office:/etc/config# Fri Mar 16 13:07:06 2018 daemon.notice netifd: Network device 'wlan0' link is down
Fri Mar 16 13:07:06 2018 kern.info kernel: [  293.701304] br-mesh: port 1(wlan0) entered disabled state
Fri Mar 16 13:07:06 2018 daemon.notice netifd: bridge 'br-mesh' link is down
Fri Mar 16 13:07:06 2018 daemon.notice netifd: Interface 'mesh' has link connectivity loss
[...]
Fri Mar 16 13:07:21 2018 kern.info kernel: [  308.369236] br-mesh: port 1(wlan0) entered learning state
Failed to find log object: Not found
Failed to find log object: Not found
Shared connection to <host> closed.

Device comes back on line a couple minutes later, logs start at [324.695090]

root@office:/etc/config# logread | fgrep 'Kill process'
Fri Mar 16 13:07:44 2018 kern.err kernel: [  324.943139] Out of memory: Kill process 2668 (dropbear) score 1 or sacrifice child
Fri Mar 16 13:07:44 2018 kern.err kernel: [  326.084375] Out of memory: Kill process 2668 (dropbear) score 1 or sacrifice child
Fri Mar 16 13:07:44 2018 kern.err kernel: [  327.111300] Out of memory: Kill process 910 (rpcd) score 0 or sacrifice child
Fri Mar 16 13:07:44 2018 kern.err kernel: [  327.821928] Out of memory: Kill process 2726 (logread) score 0 or sacrifice child
Fri Mar 16 13:07:44 2018 kern.err kernel: [  328.537878] Out of memory: Kill process 506 (ubusd) score 0 or sacrifice child
Fri Mar 16 13:07:44 2018 kern.err kernel: [  329.424216] Out of memory: Kill process 1420 (ntpd) score 0 or sacrifice child```

It's "stable" once it's up and running

root@office:~# uptime
 13:52:54 up 50 min,  load average: 0.00, 0.01, 0.04

Edit: Setting kernel CONFIG_COMPACTION does not help

diff --git a/target/linux/ar71xx/config-4.9 b/target/linux/ar71xx/config-4.9
index 026d5bb..af40ae5 100644
--- a/target/linux/ar71xx/config-4.9
+++ b/target/linux/ar71xx/config-4.9
@@ -278,6 +278,7 @@ CONFIG_CMDLINE="rootfstype=squashfs noinitrd"
 CONFIG_CMDLINE_BOOL=y
 # CONFIG_CMDLINE_OVERRIDE is not set
 CONFIG_COMMON_CLK=y
+CONFIG_COMPACTION=y
 CONFIG_CPU_BIG_ENDIAN=y
 CONFIG_CPU_GENERIC_DUMP_TLB=y
 CONFIG_CPU_HAS_PREFETCH=y
@@ -380,6 +381,7 @@ CONFIG_MDIO_BITBANG=y
 CONFIG_MDIO_BOARDINFO=y
 CONFIG_MDIO_GPIO=y
 CONFIG_MICREL_PHY=y
+CONFIG_MIGRATION=y
 CONFIG_MIPS=y
 CONFIG_MIPS_ASID_BITS=8
 CONFIG_MIPS_ASID_SHIFT=0

I believe I've got the same issue, but on OpenWRT 18.06-rc1 and a Linksys Audi (EA3500).

After a few hours of usage, oom-killer starts rampaging, taking down the network connection.

NAME="OpenWrt"
VERSION="18.06.0-rc1"
ID="openwrt"
ID_LIKE="lede openwrt"
PRETTY_NAME="OpenWrt 18.06.0-rc1"
VERSION_ID="18.06.0-rc1"
HOME_URL="http://lede-project.org/"
BUG_URL="http://bugs.lede-project.org/"
SUPPORT_URL="http://forum.openwrt.org/"
BUILD_ID="r7090-d2aa3a1b62"
LEDE_BOARD="kirkwood/generic"
LEDE_ARCH="arm_xscale"
LEDE_TAINTS=""
LEDE_DEVICE_MANUFACTURER="OpenWrt"
LEDE_DEVICE_MANUFACTURER_URL="http://lede-project.org/"
LEDE_DEVICE_PRODUCT="Generic"
LEDE_DEVICE_REVISION="v0"
LEDE_RELEASE="OpenWrt 18.06.0-rc1 r7090-d2aa3a1b62"

I'm running with both the 2.4 GHz and 5 GHz radios on. (I didn't get this issue until I enabled wireless.)

I'll try turning off the 5 GHz one and report back.