RT3352 Ethernet woes after 22.03 release

Recently I dug out the MF283+ and decided to try running 23.05 release, only to be greeted by stuck boot process at lzma-loader stage. After fixing that thanks to @mpratt14 with https://github.com/openwrt/openwrt/pull/14151 I got the unit to boot successfully, but then I was greeted by a kernel panic right after the switch was configured, so I built the kernel with debug symbols, and reproduced that:

[   79.086872] skbuff: skb_over_panic: text:80281a6c len:7038 put:7038 head:81b7e9e0 data:81b7ea22 tail:0x81b805a0 end:0x81b7f040 dev:eth0
[   79.111165] Kernel bug detected[#1]:
[   79.118253] CPU: 0 PID: 1128 Comm: netifd Not tainted 5.15.141 #0
[   79.130316] $ 0   : 00000000 00000001 0000007b 00000000
[   79.140690] $ 4   : 8052ee30 8052ee30 8059de50 8080bcb0
[   79.151061] $ 8   : 00000001 8080bcc8 00000000 00000001
[   79.161428] $12   : 20323261 80881100 00000002 6c696174
[   79.171797] $16   : 80a1d4e0 817de3c0 80a1d000 00000002
[   79.182168] $20   : a1bac020 00000002 82584c60 00000000
[   79.192537] $24   : 00000000 8022ef64                  
[   79.202904] $28   : 818f6000 8080be50 00001b7e 8029d9d8
[   79.213274] Hi    : 00000084
[   79.218961] Lo    : b61f0000
[   79.224647] epc   : 8029d9d8 skb_panic+0x58/0x5c
[   79.233822] ra    : 8029d9d8 skb_panic+0x58/0x5c
[   79.242969] Status: 1100a403 KERNEL EXL IE 
[   79.251261] Cause : 10800024 (ExcCode 09)
[   79.259184] PrId  : 0001964c (MIPS 24KEc)
[   79.267108] Modules linked in: rt2800soc rt2800mmio rt2800lib pppoe ppp_async option nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet usb_wwan rt2x00soc rt2x00mmio rt2x00lib qmi_wwan pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mac80211 cfg80211 usbserial usbnet slhc nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc_ccitt compat cdc_wdm sha512_generic seqiv jitterentropy_rng drbg hmac cmac leds_gpio ohci_platform ohci_hcd fsl_mph_dr_of ehci_platform ehci_fsl ehci_hcd gpio_button_hotplug usbcore nls_base usb_common mii crc32c_generic
[   79.415271] Process netifd (pid: 1128, threadinfo=49841ac7, task=18b01252, tls=77e1adfc)
[   79.431297] Stack : 00000760 804d9ecc 80281a6c 00001b7e 00001b7e 81b7e9e0 81b7ea22 81b805a0
[   79.447897]         81b7f040 80a1d000 7fffffff 8029f658 00000002 00000000 80a1d4e0 80a1d4e0
[   79.464495]         81b7e9e0 80281a6c 80530000 8052b540 00000000 00000012 00000000 8006c25c
[   79.481093]         7fffffff 00c00004 00000f00 00000004 00000000 00c00004 0000000c 20000000
[   79.497689]         02584ca2 40000000 00000000 80883280 00000042 01b7ea22 00006c00 80a1d4bc
[   79.514288]         ...
[   79.519129] Call Trace:
[   79.523955] [<8029d9d8>] skb_panic+0x58/0x5c
[   79.532434] [<8029f658>] skb_put+0x54/0x5c
[   79.540566] [<80281a6c>] fe_poll+0x450/0x54c
[   79.549046] [<802b89cc>] __napi_poll.constprop.0+0x7c/0x17c
[   79.560130] [<802b8bd0>] net_rx_action+0x104/0x1e8
[   79.569634] [<803f6b90>] __do_softirq+0x218/0x268
[   79.578993] [<800024f8>] handle_int+0x138/0x144
[   79.587977] [<8027f8dc>] rmb+0x4/0xc
[   79.595064] [<80280460>] fe_r32+0x24/0x30
[   79.603014] [<80280db4>] fe_open+0x224/0x354
[   79.611478] [<802b9828>] __dev_open+0x148/0x160
[   79.620485] [<802b9be8>] __dev_change_flags+0x1b0/0x1b8
[   79.630832] [<802b9c18>] dev_change_flags+0x28/0x70
[   79.640492] [<802dcc80>] dev_ioctl+0x1c0/0x4c8
[   79.649306] [<80295814>] sock_ioctl+0x74/0x3ec
[   79.658133] [<80110bf8>] vfs_ioctl+0x28/0x40
[   79.666622] 
[   79.669560] Code: afa30010  0c012de9  24849c38 <000c000d> 8c8200a4  8c880054  8c8300a0  00451023  01053821 
[   79.688938] 
[   79.692055] ---[ end trace 2d162392f91344c8 ]---
[   79.701283] Kernel panic - not syncing: Fatal exception in interrupt

This doesn't happen if there is no external link at the Ethernet switch present, and if the cable is plugged in later, the network works most of the time. Looks like too little space available in the buffer underlying skb to store the packet, for some reason. And the packet is way too big, as referenced in the log.

But then, sometimes the old issue found around 22.03 release sometimes manifests itself, with around 50% probability. Haven't counted that precisely, but the Ethernet interface is stuck, and the famous eth0 (mtk_soc_eth): transmit queue 0 timed out message appears.

[  172.138571] ------------[ cut here ]------------
[  172.147759] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:477 dev_watchdog+0x104/0x1a0
[  172.164239] NETDEV WATCHDOG: eth0 (mtk_soc_eth): transmit queue 0 timed out
[  172.178079] Modules linked in: rt2800soc rt2800mmio rt2800lib pppoe ppp_async option nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet usb_wwan rt2x00soc rt2x00mmio rt2x00lib qmi_wwan pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mac80211 cfg80211 usbserial usbnet slhc nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c crc_ccitt compat cdc_wdm sha512_generic seqiv jitterentropy_rng drbg hmac cmac leds_gpio ohci_platform ohci_hcd fsl_mph_dr_of ehci_platform ehci_fsl ehci_hcd gpio_button_hotplug usbcore nls_base usb_common mii crc32c_generic
[  172.326369] CPU: 0 PID: 0 Comm: swapper Not tainted 5.15.141 #0
[  172.338165] Stack : 00000000 00000000 8080be0c 806e0000 80530000 8048f334 8052a09c 80529c43
[  172.354843]         806e32b0 00000000 00200000 8004ebdc 804898b8 00000001 8080bdc8 95b41ce7
[  172.371520]         00000000 00000000 8048f334 8080bc68 00000001 8080bc80 00000000 00000004
[  172.388211]         203a6d6d 00000000 80700000 70617773 00000009 80480000 00000000 00000009
[  172.404896]         804e0060 00200000 805a0000 805a0000 00000000 8022ef64 00000000 806e0000
[  172.421573]         ...
[  172.426436] Call Trace:
[  172.431342] [<80006d38>] show_stack+0x64/0xf4
[  172.440094] [<80022868>] __warn+0x80/0xdc
[  172.448091] [<80022924>] warn_slowpath_fmt+0x60/0x94
[  172.457995] [<802edd18>] dev_watchdog+0x104/0x1a0
[  172.467406] [<8005c780>] call_timer_fn.constprop.0+0x1c/0x80
[  172.478709] [<8005cb2c>] run_timer_softirq+0x314/0x324
[  172.488973] [<803f6b90>] __do_softirq+0x218/0x268
[  172.498354] [<800024f8>] handle_int+0x138/0x144
[  172.507397] [<80002380>] __r4k_wait+0x20/0x40
[  172.516105] [<803f68e8>] default_idle_call+0x28/0x38
[  172.526042] [<800440dc>] do_idle+0x90/0xd0
[  172.534266] [<80044340>] cpu_startup_entry+0x18/0x20
[  172.544202] [<805b3e60>] start_kernel+0x6b0/0x6e4
[  172.553634] 
[  172.556592] ---[ end trace db4bfa66de86dd16 ]---
[  172.565795] mtk_soc_eth 10100000.ethernet eth0: transmit timed out
[  172.578095] mtk_soc_eth 10100000.ethernet eth0: dma_cfg:0000005d
[  172.590057] mtk_soc_eth 10100000.ethernet eth0: tx_ring=0, base=0204c000, max=1024, ctx=27, dtx=25, fdx=25, next=27
[  172.610802] mtk_soc_eth 10100000.ethernet eth0: rx_ring=0, base=02230000, max=1024, calc=1023, drx=0

Just as in in this topic: Stuck after flashing ZTE MF283+.

I started digging around Github to find, that for mt7621, the flow control on CPU port was disabled (https://patchwork.ozlabs.org/project/openwrt/patch/20200211101741.17350-1-ynezz@true.cz/) - previously it was enabled unconditionally for mt7621_gsw. So my initial idea was to disable that for rt305x_esw as well, and check if it helps, but to no avail - I couldn't find documentation of switches' FPA2 register anywhere - only the comment to the initial value specified GbE, full duplex, flow control enabled.

I'd be glad for any pointers to documentation that could help me switch that, or for any other ideas, what to check, to get the device working again, before falling back to bisecting changes between v21.02.0 and v23.02.0 - although, I'll probably have to do the same between v22.03.0 and v23.05.0, as the issues look distinct at the first glance.

Edit: I found some information on potential root cause here: https://github.com/openwrt/openwrt/issues/9284 and some patches courtesy of @lynxis, but rebased on top of current tree, I face hang at boot too :confused:

Needless to say, this driver needs a lot of help... I would say a complete rewrite someday...

Interestingly enough, the sister driver of this (for mt7628 and similar) is upstreamed into Linux, in a lot better shape, and the area around skb_put() calls are written similarly. Comparing the two, we have quite a few small changes to try.

I'm going to be busy for the next couple of days, but you can start experimenting...

comparing this in Linux (starting at the else block)

to this in openwrt

examples of what I'm talking about...

Maybe there is a solution, after all: https://github.com/openwrt/openwrt/issues/9284#issuecomment-1846577397
I cherry-picked it and works perfectly on top of current main. Of course works at 23.05 too.