NBG6817: OpenWrt rebooting constantly

Sadly, I didnt solve my problem with the reboots. Does R7800 have the same chipset or why do you use this thread? :slight_smile:

I think my cause was my Sonos bridge sending jumbo frames that made the router reboots, due to some unknown cause.

ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1629 larger than size (1536)
Because I also got this, len xxxx larger then size, I searched it on Google, only this page.

The r7800 and nbg6817 are very similar, apart from a few device specific issues (NAND vs. eMMC, no eSATA on the nbg6817), issues present in one of them are very likely to plague the other as well.

As there are no more reports about that error, it is likely something pretty rare caused by the combination of chipset, network config and the surrounding network devices.

Just for reference, the error comes from Linux sources:
https://elixir.bootlin.com/linux/v4.14.83/source/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c#L3390

And I am not sure if that is the ultimate reason for your crashes, as the kernel is supposed to drop those oversized packets.

did you find a solution to fix this?

No I did not. After two afternoons debugging my wife got mad, so i switched the router with an FritzBox 7590.

What kind of equipment is in your ethernet switch hingbong? Have you located the unit sending jumbo frames?

We get very 30 minutes the same error on eth0 on the R7800 (18.06.1):

kern.err kernel: [ 584.144343] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1994 larger than size

After about ten or so of these messages the router crashes and sometimes it does becomes totally unresponsive.
Now a crontab runs every 10 minutes and when it greps the nefarious message, it reboots the router thereby assuring access stays possible.

As the R7800 is connected on eth0 (WAN) to a cable modem (running in bridge mode, no DHCP) and the modem does not offer a way to disable jumbo frames, we do have a serious problem here.

The same thing is happening to me. I'm using Netgear R7800.
My latest error is

ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1926 larger than size (1536)

and its the same one, keeps happing every 5 mins or so.

EDIT: Restoring the router seems to stop it

what to restore? the stock firmwire or something else?

Would be interesting to know what is common is your environments (modem, jumbo frames from modem, connection type, pppoe (?), ...)
There is no general flood of that error, so it is possible triggered by some pretty rare combination of hardware environment & settings.

I suspect this may have something to do with the thermal characteristics of these devices.....

I posted an OEM dts in another thread.... from what I saw in there..... half of it related to thermal / cpu scaling etc.....

Perhaps the time factor is a "bug" which resides in alot of things not speaking politely to the board?

https://www.dropbox.com/s/go2draj8q4yn95r/Installed%20Programs.htm?dl=0

https://www.dropbox.com/s/1mcpk4qta7ot8fy/Report%20of%20%28HINGBONG-PC%29.html?dl=0
after long time poweroff,boot my pc it will cause this bug.these two files are my hardware and my installed program,i dont know what cause this.

While ipq806x, like all highend ARM routers, is running hot, it's quite unlikely to be the reason here, mine has survived quite some thermal stress this (rather hot) summer. At least the quoted messages don't suggest thermal related issues, but the ethernet and switch chip/ drivers are shared among all ipq806x devices, which makes this less likely to the culprit either.

The only source of instability I've observed so far is when enabling flow-offloading (and I've seen similar spurious reboots on ath79 and lantiq with flow-offloading enabled as well).

Reflashed OpenWRT on the router

I think its just a bug that is very rare, and in my instance only showed up after I reflashed OpenWRT bc I wanted a fresh copy. But then the router kept restating every 5 mins or so, which then made me reflash it again and everything's been working fine since last night.

qca8337 ... rgmii ... clocks... cond_resched....

if anyone is serious about getting to the bottom of this..... please share what media, client nic / os....

-need to eliminate this is gigabit / autonegotiate related.

-if you know how to kill power scaling in userspace..... please do so... ( and let me know how to aswell :slight_smile: )

-disable auto-negotiation ( on the next day )

-static macs all round on the third day

-no stp on the fourth

Still happening?

Assumptions......

  1. switch driver ... basically by hard setting as many parameters switch side on the affected router..... it will narrow down things me think......

Now if you could only capture the last 30 ( unique ) proto packets ( incoming ) prior to the error..... :wink:

Code wise;

-This sounded strange
/* Forward all unknown frames to CPU port for Linux processing */

-This pricked some interest too
vi drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c

/* Inter frame gap is set to 12 */
val = 12 << NSS_COMMON_GMAC_CTL_IFG_OFFSET |
      12 << NSS_COMMON_GMAC_CTL_IFG_LIMIT_OFFSET;
/* We also initiate an AXI low power exit request */

if only i had skills - and an effected router :wink:

o' some serial / debug = 1000 action would hurt either..... if whoever attacks this knows what / where to force verbosity on..... cause it sure ain't comin..... 8k.o -vvvv etc. etc.

https://github.com/coolsnowwolf/lede/blob/master/feeds.conf.default
i use coolsnowwolf modified feeds,and this bug disappeared.
His feeds are based on old openwrt feeds.

edit: for some days,the bug comes again......

@hingbong
I have a NBG6817 which is normally running the OEM software, but I can rapidly switch to the latest OpenWRT snapshot given the dual boot capabilities of this router. I can confirm the constant reboots using the latest snapshot r9057-8c6f00e. After having made the logs persistent on an external flash drive, one sees a whole bunch of messages "len xyz is larger than size (1536)" that don't cause a kernel panic as the code simply drops the frame. But at some point it goes wrong, specifically when the frames are starting to get much bigger than 1536 e.g. 19xx. Clearly, the panic is related to this error...

Thu Jan 17 14:22:16 2019 daemon.warn dnsmasq-dhcp[2289]: VUSoloSE is a CNAME, not giving it to the DHCP lease of 192.168.1.8
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.011375] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.011614] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.057590] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.515955] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.562652] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.794549] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.893907] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.094066] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.201133] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.794684] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.893968] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.285892] Unable to handle kernel paging request at virtual address 2131d6e2
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.285920] pgd = d954c000
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.292091] [2131d6e2] *pgd=00000000
Thu Jan 17 14:25:12 2019 kern.emerg kernel: [16875.294697] Internal error: Oops: 5 [#1] SMP ARM

I have been able to reproduce the crash using the PC of my spouse who has an Intel 82577LM ethernet card. If I turn on jumbo packets and set MTU to 4088 and execute an internet speedtest, it will reliably crash the kernel. With my newer Intel I218-LM it takes quite a lot more effort to make it crash but I managed to find a pattern too that would cause it to panic as well. We normally enable jumbo packets for increased performance to our Western Digital NAS. Our PC's are connected to NAS via a low cost TP link gigabit 8-port switch which supports Jumbo packets. If we disable jumbo packets everywhere, I can see quite a number of errors and warnings, but no crash and solid WiFi and LAN performance.

In the case the NBG6817 does support Jumbo packets, I could try to set an MTU of 4088 and see how it goes. Alternatively, there is the question: why does Windows (10, 1809) send packets >1500 to the NBG6817 if it's MTU is set to 1500. Could TCP window negociation go wrong somehow ? Could also try to see if I can reproduce this from Ubuntu or Debian Linux.

I am a Linux kernel developer, so debugging and/or changing kernel code is not an issue. Lack of time is a tougher problem...
Will report back on further tests.
Ewald

1 Like

It's a good question, but ultimately the kernel driver needs to not crash regardless of what other people send :wink: since you have experience in kernel development, is there any chance you could look in the code and see if there's an obvious bug near where this message is triggered? One assumes it's probably some buffer overwrite when a packet is too big, or some such thing.

@dlakelan,
No discussion, it should not crash. I was merely trying to understand where in the kernel driver it could go wrong.
Interestingly, as expected, the problem goes away, if we enable jumbo frames. The router survived 90 minutes of stress test of packets of all sizes up to 4088 bytes...
Unfortunately due to a bug in OpenWRT, you can't set the MTU in /etc/config/network, neither in the interface, nor device section.
And regrettably, the driver does not allow setting the MTU in active state.
So you need to issue:

ifconfig eth1 down; ifconfig eth1 mtu 4088; ifconfig eth1 up

I know this is just a workaround, but at least my router is not (continuously) crashing anymore and, if time allows, I know where to look for a defect now and where to add some debug code.
Need to find a clean way to set the mtu at router boot...

EDIT: Ok, this won't win any prize but it works.

  1. Set MTU in Luci for your ethernet port (interfaces - lan - advanced settings). This will add the mtu option to /etc/config/network
  2. Add to the /etc/rc.local

MTU=$(uci get network.lan.mtu)
ifconfig eth1 down; ifconfig eth1 mtu $MTU; ifconfig eth1 up

  1. reboot

You can change the MTU size using Luci (or edit /etc/config/network), the changes will be picked up by rc.local

Ewald

2 Likes

You could put this command into the custom firewall script and it'd get run when the net comes up.