I’m encountering the same issue with my R7800. It is spontaneously rebooting, but without any logs shown on the serial console. It just froze solid, and after some seconds, just power cycle.
I’m trying to get the NSS drivers working and managed to get the NSS cores working for packet acceleration but the spontaneous reboots makes it unuseable.
I thought the issue was caused by the NSS drivers but this thread seems to suggest otherwise.
I’m current using the lede-17.01 branch as well as the v17.01.6 tagged release. Both exhibit the same spontaneous reboot issue.
Does anyone know if there’s codes in the kernel causing such reboots?
It's very likely that all ipq8065 (and probably also ipq8064) devices are affected alike, when encountering packets with a specific MTU >1500 bytes (all of my stations use the default MTU sizes, so I haven't seen this behaviour myself) - but given that enabling the NSS cores and their driver is a very invasive change, it would be important to rule out those changes before investigating further into this direction (from a NSS specific angle, at least).
Does this concern the same error ?
Never got the stack dump myself, always Oops is my last line.
This stack trace has ath10k calls. The Oops problem seems related to the stmmac ethernet core of the ipq806x SoC.
What I'd encountered is probably different from what others are reporting here. My builds are not using the STMMAC drivers. Instead it is using the qca-nss-gmac drivers.
In any case, I tried using an MTU of 4000 and am able to download a 16MB file without issue, so my R7800 rebooting spontaneously is probably not related to jumbo frames.
I have been trying to fix the stmmac_main.c by adding code to properly deal with this unexpected larger size packages, but it requires a lot of code changes due to the way the rx ring buffer sizes are (pre)allocated based on MTU size.
The kernel panic that I hit was due to a missing call to set the proper dma size (basically dma would overrun the buffer based on the real size of the packet) size after which an skb buffer free would free illegal memory. There is now a patch for this problem: here. I managed to fix this and 2 other issues, but kept hitting new bugs like starvation (basically ethernet port hangs).
I believe the correct fix for this is now posted here. Most hangs went away, but some remained...
When you configure a larger MTU, the driver allocates a 2K, 4K or 16K buffer and in this way manages to bypass a number of defects due to the extra headroom.
That is why with MTU=4088 my router been performing flawlessly for the last 18 days, despite all the stress tests thrown it it.
I think it's worth to back-port the master branch patches (a long list...) to 4.14.93 (or whatever is the current release). There are 20 or so major code changes/patches submitted since the 4.14 version. If I have some time I will try to generate a patch set and build a kernel.
It might be worth looking at forward porting ipq806x to 4.19 as well, yes, the next OpenWrt release will still ship with kernel 4.14, but master can switch immediately afterwards (and 4.19 support patches that don't toggle the default are already accepted).
I've just done a very quick test of the kernel 4.19 forward port (excluding the dsa changes) on my nbg6817, seems to be working fine (it might need some further refinements for USB3 on ipq8065, at least manual enabling of kmod-usb-dwc3 && kmod-usb-dwc3-qcom, but I'm not using USB on my router).
The netgear-r7800 and tplink-c2600 routers have been placed behind different cable modems that do not emit jumbo frames (the UBEE modem did) and for that reason these routers now work reliable.
It's a pity that the work-around (setting MTU to a large value and thereby causing stmmac to allocate a larger DMS buffer) came to late for me; the concerned production environment of a client of mine needed a quick solution (i.e. change of modem).
I still have one r7800 in semi-production. To reproduce the error a host connected to one of its LAN phys should emit jumbo frames. Still need to figure out how to force that on the Apple computers that are used; just setting the MTU to a large value on the Ethernet interface of the Mac did not do the trick. Any suggestion howto ?
Continuing the discussion from NBG6817: OpenWrt rebooting constantly:
Same problem here: my setup: Huawei B715 as modem (DMZ mode) and R7800 with OpenWRT 18.06.1(OpenWrt 18.06.1 r7258-5eb055306f / LuCI openwrt-18.06 branch (git-18.228.31946-f64b152) )
[46158.104334] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1541 larger than size (1536)
[46158.137186] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1541 larger than size (1536)
@por, I did not find a 100% repeatable pattern that would cause the NBG6817 to crash on any given device. On the 8 year old PC of my spouse, a simple internet speedtest (speedtest.net) does the job. On my PC, I ran the Samba stress test to a NAS (MyBookLive) and that caused it to crash after just 5 minutes. On Ubuntu 18.x there was not much needed, just booting the PC. The ipq806x-gmac-dwmac error messages come all the time, but if you are a little (un)lucky the freeing of overflow data does not cause a panic right away, due to way the kernel aligns dma buffers. Normally devices should not send jumbo-packets to a device that has jumbo packets disabled, so that adds another bit of luck-factor to the mix.