Mt7530 switch init fail in memory pressure condition

Hi !
I've recently purchased Tp-link archer c6u router. It's mt7621 based with mt7530 switch.
I manage this device remotely. Once I issued /etc/init.d/network and lost access from inet.
I started investigation as soon as I had local access to the router.
It was inaccessible also from the lan. Only wifi worked.
After some testing I found the condition when It happens and how to reproduce.
It reproduces stable in high memory usage conditions but not only. It can also happen after memory was released, again and again when restarting the network.
Sometimes only part of the ports die, sometimes - all of them.
This is very painful bug because of losing access

Openwrt 21.02 release, self-built version

swapon
/dev/zram0 partition   60M   6M  101
/dev/sda3  partition 56.5M 104K   -2

free -h
              total        used        free      shared  buff/cache   available
Mem:          118Mi        46Mi        28Mi       1.0Mi        43Mi        57Mi
Swap:         116Mi       6.0Mi       110Mi

screen nice -n 10 stress --vm 1 --vm-bytes 71000000

/etc/init.d/network restart

ALL PORTS DIED HERE

[10389.945893] netifd: page allocation failure: order:8, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
[10389.958689] CPU: 1 PID: 20444 Comm: netifd Not tainted 5.4.143 #0
[10389.964763] Stack : 00000008 80082090 00000000 00000000 80730000 80738c9c 80737960 86815b7c
[10389.973104]         808f0000 80784da3 806b7a74 806b7a74 00000001 00000001 86815b20 00000007
[10389.981438]         00000000 00000000 80930000 00000000 30232033 000018a1 2e352064 34312e34
[10389.989771]         00000000 00000204 00000000 000ea0e1 80000000 807a0000 00000000 00040dc0
[10389.998104]         807a0000 00000201 00000240 00040dc0 00000000 80381e08 00000004 808f0004
[10390.006439]         ...
[10390.008879] Call Trace:
[10390.011344] [<8000b68c>] show_stack+0x30/0x100
[10390.015800] [<805f1254>] dump_stack+0xa4/0xdc
[10390.020167] [<80170c90>] warn_alloc+0xc0/0x138
[10390.024602] [<80171af4>] __alloc_pages_nodemask+0xdec/0xeb8
[10390.030161] [<8014bfb8>] kmalloc_order+0x2c/0x70
[10390.034778] [<8040807c>] mtk_open+0x158/0x804
[10390.039127] [<8045d5e4>] __dev_open+0xf4/0x188
[10390.043559] [<8045da44>] __dev_change_flags+0x18c/0x1e4
[10390.048768] [<8045dac4>] dev_change_flags+0x28/0x70
[10390.053637] [<8048a364>] dev_ifsioc+0x2ac/0x34c
[10390.058155] [<8048a5f0>] dev_ioctl+0xd4/0x3f8
[10390.062510] [<804304ec>] sock_ioctl+0x354/0x4bc
[10390.067040] [<801adbb4>] do_vfs_ioctl+0xb8/0x7c0
[10390.071645] [<801ae30c>] ksys_ioctl+0x50/0xb4
[10390.076000] [<80014598>] syscall_common+0x34/0x58
[10390.080969] Mem-Info:
[10390.083314] active_anon:6858 inactive_anon:6892 isolated_anon:32
[10390.083314]  active_file:733 inactive_file:741 isolated_file:1
[10390.083314]  unevictable:2 dirty:0 writeback:0 unstable:0
[10390.083314]  slab_reclaimable:921 slab_unreclaimable:5251
[10390.083314]  mapped:1082 shmem:0 pagetables:215 bounce:0
[10390.083314]  free:3817 free_pcp:32 free_cma:0
[10390.115576] Node 0 active_anon:27432kB inactive_anon:27848kB active_file:2988kB inactive_file:3468kB unevictable:8kB isolated(anon):128kB isolated(file):4kB mapped:4776kB dirty:0kB writeback:0kB shmem:0kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[10390.138417] Normal free:13876kB min:13312kB low:14336kB high:15360kB active_anon:27432kB inactive_anon:27760kB active_file:2808kB inactive_file:3508kB unevictable:8kB writepending:0kB present:131072kB managed:121444kB mlocked:8kB kernel_stack:1064kB pagetables:860kB bounce:0kB free_pcp:76kB local_pcp:0kB free_cma:0kB
[10390.166446] lowmem_reserve[]: 0 0 0
[10390.170027] Normal: 179*4kB (UMEH) 432*8kB (UMEH) 295*16kB (UMEH) 82*32kB (UMEH) 32*64kB (UEH) 1*128kB (E) 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 13692kB
[10390.184391] 1886 total pagecache pages
[10390.188228] 8 pages in swap cache
[10390.191621] Swap cache stats: add 401854, delete 401849, find 843/245547
[10390.198360] Free swap  = 92764kB
[10390.201609] Total swap = 119280kB
[10390.204977] 32768 pages RAM
[10390.207817] 0 pages HighMem/MovableOnly
[10390.211689] 2407 pages reserved
[10390.223045] mt7530 mdio-bus:1f lan1: failed to open master eth0
[10390.234450] br-lan: port 1(lan1) entered blocking state
[10390.239720] br-lan: port 1(lan1) entered disabled state
[10390.245563] device lan1 entered promiscuous mode
[10390.323712] mt7530 mdio-bus:1f lan2: failed to open master eth0
[10390.342913] br-lan: port 2(lan2) entered blocking state
[10390.348196] br-lan: port 2(lan2) entered disabled state
[10390.354181] device lan2 entered promiscuous mode
[10390.369054] mt7530 mdio-bus:1f lan3: failed to open master eth0
[10390.375953] br-lan: port 3(lan3) entered blocking state
[10390.381289] br-lan: port 3(lan3) entered disabled state
[10390.387290] device lan3 entered promiscuous mode
[10390.403983] mt7530 mdio-bus:1f lan4: failed to open master eth0
[10390.410879] br-lan: port 4(lan4) entered blocking state
[10390.416235] br-lan: port 4(lan4) entered disabled state
[10390.422333] device lan4 entered promiscuous mode
[10390.460254] mt7530 mdio-bus:1f wan: failed to open master eth0

network restart after memory release. ONLY SOME PORTS DIED

[11055.700602] netifd: page allocation failure: order:8, mode:0x40dc0(GFP_KERNEL|__GFP_COMP|__GFP_ZERO), nodemask=(null),cpuset=/,mems_allowed=0
[11055.717937] CPU: 2 PID: 18423 Comm: netifd Not tainted 5.4.143 #0
[11055.724034] Stack : 00000008 80082090 00000000 00000000 80730000 80738c9c 80737960 850b3b7c
[11055.732393]         808f0000 80784da3 806b7a74 806b7a74 00000002 00000001 850b3b20 00000007
[11055.740725]         00000000 00000000 80930000 00000000 30232033 00001a90 2e352064 34312e34
[11055.749056]         00000000 00000226 00000000 000af471 80000000 807a0000 00000000 00040dc0
[11055.757388]         807a0000 00000201 00000240 00040dc0 00000000 80381e08 00000008 808f0008
[11055.765721]         ...
[11055.768159] Call Trace:
[11055.770624] [<8000b68c>] show_stack+0x30/0x100
[11055.775079] [<805f1254>] dump_stack+0xa4/0xdc
[11055.779443] [<80170c90>] warn_alloc+0xc0/0x138
[11055.783876] [<80171af4>] __alloc_pages_nodemask+0xdec/0xeb8
[11055.789433] [<8014bfb8>] kmalloc_order+0x2c/0x70
[11055.794048] [<8040807c>] mtk_open+0x158/0x804
[11055.798395] [<8045d5e4>] __dev_open+0xf4/0x188
[11055.802826] [<8045da44>] __dev_change_flags+0x18c/0x1e4
[11055.808033] [<8045dac4>] dev_change_flags+0x28/0x70
[11055.812900] [<8048a364>] dev_ifsioc+0x2ac/0x34c
[11055.817417] [<8048a5f0>] dev_ioctl+0xd4/0x3f8
[11055.821770] [<804304ec>] sock_ioctl+0x354/0x4bc
[11055.826300] [<801adbb4>] do_vfs_ioctl+0xb8/0x7c0
[11055.830904] [<801ae30c>] ksys_ioctl+0x50/0xb4
[11055.835256] [<80014598>] syscall_common+0x34/0x58
[11055.840214] Mem-Info:
[11055.842619] active_anon:963 inactive_anon:1427 isolated_anon:0
[11055.842619]  active_file:5477 inactive_file:2508 isolated_file:32
[11055.842619]  unevictable:2 dirty:0 writeback:0 unstable:0
[11055.842619]  slab_reclaimable:1039 slab_unreclaimable:5204
[11055.842619]  mapped:1396 shmem:830 pagetables:168 bounce:0
[11055.842619]  free:9529 free_pcp:73 free_cma:0
[11055.875224] Node 0 active_anon:3852kB inactive_anon:5708kB active_file:21908kB inactive_file:10032kB unevictable:8kB isolated(anon):0kB isolated(file):128kB mapped:5640kB dirty:0kB writeback:0kB shmem:3320kB writeback_tmp:0kB unstable:0kB all_unreclaimable? no
[11055.898285] Normal free:38116kB min:4096kB low:5120kB high:6144kB active_anon:3600kB inactive_anon:5708kB active_file:21908kB inactive_file:9872kB unevictable:8kB writepending:0kB present:131072kB managed:121444kB mlocked:8kB kernel_stack:1088kB pagetables:672kB bounce:0kB free_pcp:204kB local_pcp:0kB free_cma:0kB
[11055.926090] lowmem_reserve[]: 0 0 0
[11055.929655] Normal: 600*4kB (UMEH) 141*8kB (UMEH) 244*16kB (UMH) 119*32kB (UMEH) 203*64kB (UMEH) 80*128kB (UMEH) 10*256kB (UMH) 2*512kB (UH) 0*1024kB 0*2048kB 0*4096kB = 38056kB
[11055.945552] 8845 total pagecache pages
[11055.949377] 0 pages in swap cache
[11055.952803] Swap cache stats: add 1869830, delete 1869832, find 6125/1461324
[11055.959923] Free swap  = 0kB
[11055.962928] Total swap = 0kB
[11055.965872] 32768 pages RAM
[11055.968720] 0 pages HighMem/MovableOnly
[11055.972599] 2407 pages reserved
[11055.980960] mt7530 mdio-bus:1f lan1: failed to open master eth0
[11055.987977] br-lan: port 1(lan1) entered blocking state
[11055.993306] br-lan: port 1(lan1) entered disabled state
[11055.999333] device lan1 entered promiscuous mode
[11056.025781] mt7530 mdio-bus:1f lan2: failed to open master eth0
[11056.032530] br-lan: port 2(lan2) entered blocking state
[11056.037873] br-lan: port 2(lan2) entered disabled state
[11056.044122] device lan2 entered promiscuous mode
[11056.158492] mtk_soc_eth 1e100000.ethernet eth0: configuring for fixed/rgmii link mode
[11056.166548] device eth0 left promiscuous mode
[11056.171503] mtk_soc_eth 1e100000.ethernet eth0: Link is Up - 1Gbps/Full - flow control rx/tx
[11056.180650] mt7530 mdio-bus:1f lan3: configuring for phy/gmii link mode
[11056.188085] 8021q: adding VLAN 0 to HW filter on device lan3
[11056.197026] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[11056.205512] br-lan: port 3(lan3) entered blocking state
[11056.210834] br-lan: port 3(lan3) entered disabled state
[11056.217906] device lan3 entered promiscuous mode
[11056.222658] device eth0 entered promiscuous mode
[11056.242081] mt7530 mdio-bus:1f lan4: configuring for phy/gmii link mode
[11056.249376] 8021q: adding VLAN 0 to HW filter on device lan4
[11056.258672] br-lan: port 4(lan4) entered blocking state
[11056.264070] br-lan: port 4(lan4) entered disabled state
[11056.270785] device lan4 entered promiscuous mode
[11056.297909] mt7530 mdio-bus:1f wan: configuring for phy/gmii link mode
[11056.305185] 8021q: adding VLAN 0 to HW filter on device wan
[11059.219573] br-lan: port 5(wlan0) entered blocking state
[11059.225075] br-lan: port 5(wlan0) entered disabled state
[11059.231226] device wlan0 entered promiscuous mode
[11059.383786] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[11059.385006] mt7530 mdio-bus:1f wan: Link is Up - 1Gbps/Full - flow control off
[11059.390748] br-lan: port 5(wlan0) entered blocking state
[11059.403045] br-lan: port 5(wlan0) entered forwarding state
[11059.410583] IPv6: ADDRCONF(NETDEV_CHANGE): wan: link becomes ready
[11059.418043] IPv6: ADDRCONF(NETDEV_CHANGE): br-lan: link becomes ready
[11059.499731] br-lan: port 6(wlan1) entered blocking state
[11059.505163] br-lan: port 6(wlan1) entered disabled state
[11059.511270] device wlan1 entered promiscuous mode
[11059.516546] br-lan: port 6(wlan1) entered blocking state
[11059.521937] br-lan: port 6(wlan1) entered forwarding state
[11059.672751] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[11059.698046] br-lan: port 7(wlan1-1) entered blocking state
[11059.703650] br-lan: port 7(wlan1-1) entered disabled state
[11059.710079] device wlan1-1 entered promiscuous mode
[11059.758590] br-lan: port 7(wlan1-1) entered blocking state
[11059.764196] br-lan: port 7(wlan1-1) entered forwarding state
[11059.912745] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1-1: link becomes ready
[11059.947388] br-lan: port 8(wlan0-1) entered blocking state
[11059.952988] br-lan: port 8(wlan0-1) entered disabled state
[11059.959397] device wlan0-1 entered promiscuous mode
[11059.970806] br-lan: port 8(wlan0-1) entered blocking state
[11059.976374] br-lan: port 8(wlan0-1) entered forwarding state

dead ports appear in ethtool with no link detected status and also attached device reports no link

        Supported ports: [ TP MII ]
        Supported link modes:   10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: Yes
        Supported FEC modes: Not reported
        Advertised link modes:  10baseT/Half 10baseT/Full 
                                100baseT/Half 100baseT/Full 
                                1000baseT/Full 
        Advertised pause frame use: Symmetric Receive-only
        Advertised auto-negotiation: Yes
        Advertised FEC modes: Not reported
        Link partner advertised link modes:  10baseT/Half 10baseT/Full 
                                             100baseT/Half 100baseT/Full 
                                             1000baseT/Half 1000baseT/Full 
        Link partner advertised pause frame use: Symmetric Receive-only
        Link partner advertised auto-negotiation: Yes
        Link partner advertised FEC modes: Not reported
        Speed: 1000Mb/s
        Duplex: Full
        Port: MII
        PHYAD: 2
        Transceiver: external
        Auto-negotiation: on
        Supports Wake-on: d
        Wake-on: d
        Link detected: no

I've traced kernel execution and found exact location where alloc fails

mtk_eth_soc.c  mtk_open()
err = mtk_start_dma(eth); // err=-ENOMEM
err = mtk_dma_init(eth); // err=-ENOMEM
err = mtk_init_fq_dma(eth); // err=-ENOMEM
eth->scratch_head = kcalloc(cnt, MTK_QDMA_PAGE_SIZE,
				    GFP_KERNEL);
cnt=512, MT_QDMA_PAGE_SIZE=2048. allocating 1 MB

Yes, allocating 1 MB in kernel can be a big deal on 128 mb system !
There may be enough ram but no contiguous block of 1 Mb

I tried decreasing MTK_DMA_SIZE

128 (256 mb alloc) - network works stable, may be a bit slower than normal
96 (192 mb alloc) - not working at all
64 (128 mb alloc) - network works unstable with short-time hangs

The lowest usable value is 128. It significantly decreases probability of init error but it still happens
I haven't observed init errors with 64 but its unstable

I've rewritten the code a bit
Solution is to preallocate full size dma buffer in module init function when number of adapters is known, not on every interface initialization. During boot memory is not fragmented. This works fine
I do not pretend to be a developer and won't submit any patches, I just suggest the solution that worked for me

Is this something that happens only on this device or also on other mediatek based devices?

In another mediatek device, I see the lan ports going down and up often and I have not understood why that happens.

i have only one. its the first time i deal with mediatek and i regret i bought it. wifi driver has painful bugs
the code involved in ethernet part is in official linux kernel.
the bug happens only during link init - in openwrt it happens when network restarts. if it happens randomly its not the same

I guess stability depends on which chipset is used. mt7915 is very stable. Maybe it's because mediatek employees contribute heavily to it as opposed to mt76xx.

i dont think so. i found exact reason and fixed it for myself. as long as mediatek ethernet driver is used and system has not too much ram its subject to this bug