I think its just a bug that is very rare, and in my instance only showed up after I reflashed OpenWRT bc I wanted a fresh copy. But then the router kept restating every 5 mins or so, which then made me reflash it again and everything's been working fine since last night.
if anyone is serious about getting to the bottom of this..... please share what media, client nic / os....
-need to eliminate this is gigabit / autonegotiate related.
-if you know how to kill power scaling in userspace..... please do so... ( and let me know how to aswell )
-disable auto-negotiation ( on the next day )
-static macs all round on the third day
-no stp on the fourth
Still happening?
Assumptions......
switch driver ... basically by hard setting as many parameters switch side on the affected router..... it will narrow down things me think......
Now if you could only capture the last 30 ( unique ) proto packets ( incoming ) prior to the error.....
Code wise;
-This sounded strange
/* Forward all unknown frames to CPU port for Linux processing */
-This pricked some interest too
vi drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c
/* Inter frame gap is set to 12 */
val = 12 << NSS_COMMON_GMAC_CTL_IFG_OFFSET |
12 << NSS_COMMON_GMAC_CTL_IFG_LIMIT_OFFSET;
/* We also initiate an AXI low power exit request */
if only i had skills - and an effected router
o' some serial / debug = 1000 action would hurt either..... if whoever attacks this knows what / where to force verbosity on..... cause it sure ain't comin..... 8k.o -vvvv etc. etc.
@hingbong
I have a NBG6817 which is normally running the OEM software, but I can rapidly switch to the latest OpenWRT snapshot given the dual boot capabilities of this router. I can confirm the constant reboots using the latest snapshot r9057-8c6f00e. After having made the logs persistent on an external flash drive, one sees a whole bunch of messages "len xyz is larger than size (1536)" that don't cause a kernel panic as the code simply drops the frame. But at some point it goes wrong, specifically when the frames are starting to get much bigger than 1536 e.g. 19xx. Clearly, the panic is related to this error...
Thu Jan 17 14:22:16 2019 daemon.warn dnsmasq-dhcp[2289]: VUSoloSE is a CNAME, not giving it to the DHCP lease of 192.168.1.8
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.011375] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.011614] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.057590] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.515955] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.562652] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.794549] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.893907] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.094066] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.201133] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.794684] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.893968] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.285892] Unable to handle kernel paging request at virtual address 2131d6e2
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.285920] pgd = d954c000
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.292091] [2131d6e2] *pgd=00000000
Thu Jan 17 14:25:12 2019 kern.emerg kernel: [16875.294697] Internal error: Oops: 5 [#1] SMP ARM
I have been able to reproduce the crash using the PC of my spouse who has an Intel 82577LM ethernet card. If I turn on jumbo packets and set MTU to 4088 and execute an internet speedtest, it will reliably crash the kernel. With my newer Intel I218-LM it takes quite a lot more effort to make it crash but I managed to find a pattern too that would cause it to panic as well. We normally enable jumbo packets for increased performance to our Western Digital NAS. Our PC's are connected to NAS via a low cost TP link gigabit 8-port switch which supports Jumbo packets. If we disable jumbo packets everywhere, I can see quite a number of errors and warnings, but no crash and solid WiFi and LAN performance.
In the case the NBG6817 does support Jumbo packets, I could try to set an MTU of 4088 and see how it goes. Alternatively, there is the question: why does Windows (10, 1809) send packets >1500 to the NBG6817 if it's MTU is set to 1500. Could TCP window negociation go wrong somehow ? Could also try to see if I can reproduce this from Ubuntu or Debian Linux.
I am a Linux kernel developer, so debugging and/or changing kernel code is not an issue. Lack of time is a tougher problem...
Will report back on further tests.
Ewald
It's a good question, but ultimately the kernel driver needs to not crash regardless of what other people send since you have experience in kernel development, is there any chance you could look in the code and see if there's an obvious bug near where this message is triggered? One assumes it's probably some buffer overwrite when a packet is too big, or some such thing.
@dlakelan,
No discussion, it should not crash. I was merely trying to understand where in the kernel driver it could go wrong.
Interestingly, as expected, the problem goes away, if we enable jumbo frames. The router survived 90 minutes of stress test of packets of all sizes up to 4088 bytes...
Unfortunately due to a bug in OpenWRT, you can't set the MTU in /etc/config/network, neither in the interface, nor device section.
And regrettably, the driver does not allow setting the MTU in active state.
So you need to issue:
ifconfig eth1 down; ifconfig eth1 mtu 4088; ifconfig eth1 up
I know this is just a workaround, but at least my router is not (continuously) crashing anymore and, if time allows, I know where to look for a defect now and where to add some debug code.
Need to find a clean way to set the mtu at router boot...
EDIT: Ok, this won't win any prize but it works.
Set MTU in Luci for your ethernet port (interfaces - lan - advanced settings). This will add the mtu option to /etc/config/network
Add to the /etc/rc.local
MTU=$(uci get network.lan.mtu)
ifconfig eth1 down; ifconfig eth1 mtu $MTU; ifconfig eth1 up
reboot
You can change the MTU size using Luci (or edit /etc/config/network), the changes will be picked up by rc.local
I was rapidly scanning the code. The notion of simply dropping a frame larger than the allocated skb/mtu is a bit unusual to me. Normally, one would expect a new skb reservation of bigger size, cacheable memcpy/dma copy followed by recycle of receiver buffer slot. Also the use of buffer size of exactly 1536 bytes with zero headroom when MTU=1500 is not something I am used to. A minimum of 2 (or 4 for word alignment) extra bytes "buffer" is generally implemented on top of some "SKB_HEADROOM" type kernel variable. Interestingly, that is exactly what happens for GMAC4 frames (line 3416). Will take some more analysis to understand if the code has to work around possible HW faults.
Also for MTU 4088, the code uses a 4K buffer (multiple of 2K), so plenty on headroom there.
Ewald
Thanks for the pointers!. In kernel 4.14.93 it looks like ring_mode.c is no longer used, although it's compiled in.
Testing a few modifications now (e.g. adding dev_kfree_skb_any).
EDIT: giving up for now, when making code changes to not drop oversized frames, the kernel survives almost 4 hours of stress test, but ultimately crashes. So more changes and debugging are needed for MTU=1500 + other systems sending it a bunch of oversized packets. Additionally, I would need to open up the router and hook up a console to interact with u-boot and see kernel message in real time since they don't always make it to persistent storage or enable netconsole in OpenWRT (which requires more patches and work). When setting the MTU to 9k or 4K (4088), the system remains stable, so there is a proper workaround.
I’m encountering the same issue with my R7800. It is spontaneously rebooting, but without any logs shown on the serial console. It just froze solid, and after some seconds, just power cycle.
I’m trying to get the NSS drivers working and managed to get the NSS cores working for packet acceleration but the spontaneous reboots makes it unuseable.
I thought the issue was caused by the NSS drivers but this thread seems to suggest otherwise.
I’m current using the lede-17.01 branch as well as the v17.01.6 tagged release. Both exhibit the same spontaneous reboot issue.
Does anyone know if there’s codes in the kernel causing such reboots?
It's very likely that all ipq8065 (and probably also ipq8064) devices are affected alike, when encountering packets with a specific MTU >1500 bytes (all of my stations use the default MTU sizes, so I haven't seen this behaviour myself) - but given that enabling the NSS cores and their driver is a very invasive change, it would be important to rule out those changes before investigating further into this direction (from a NSS specific angle, at least).
Does this concern the same error ?
Never got the stack dump myself, always Oops is my last line.
This stack trace has ath10k calls. The Oops problem seems related to the stmmac ethernet core of the ipq806x SoC.
What I'd encountered is probably different from what others are reporting here. My builds are not using the STMMAC drivers. Instead it is using the qca-nss-gmac drivers.
In any case, I tried using an MTU of 4000 and am able to download a 16MB file without issue, so my R7800 rebooting spontaneously is probably not related to jumbo frames.
I have been trying to fix the stmmac_main.c by adding code to properly deal with this unexpected larger size packages, but it requires a lot of code changes due to the way the rx ring buffer sizes are (pre)allocated based on MTU size.
The kernel panic that I hit was due to a missing call to set the proper dma size (basically dma would overrun the buffer based on the real size of the packet) size after which an skb buffer free would free illegal memory. There is now a patch for this problem: here. I managed to fix this and 2 other issues, but kept hitting new bugs like starvation (basically ethernet port hangs).
I believe the correct fix for this is now posted here. Most hangs went away, but some remained...
When you configure a larger MTU, the driver allocates a 2K, 4K or 16K buffer and in this way manages to bypass a number of defects due to the extra headroom.
That is why with MTU=4088 my router been performing flawlessly for the last 18 days, despite all the stress tests thrown it it.
I think it's worth to back-port the master branch patches (a long list...) to 4.14.93 (or whatever is the current release). There are 20 or so major code changes/patches submitted since the 4.14 version. If I have some time I will try to generate a patch set and build a kernel.
It might be worth looking at forward porting ipq806x to 4.19 as well, yes, the next OpenWrt release will still ship with kernel 4.14, but master can switch immediately afterwards (and 4.19 support patches that don't toggle the default are already accepted).