NBG6817: OpenWrt rebooting constantly

slh · January 20, 2019, 11:00pm

It's very likely that all ipq8065 (and probably also ipq8064) devices are affected alike, when encountering packets with a specific MTU >1500 bytes (all of my stations use the default MTU sizes, so I haven't seen this behaviour myself) - but given that enabling the NSS cores and their driver is a very invasive change, it would be important to rule out those changes before investigating further into this direction (from a NSS specific angle, at least).

quarky · January 21, 2019, 12:03am

I’ll try with jumbo frames and see if I can reliably re-produce the reboots.

anon50098793 · January 21, 2019, 12:42am

Yup, triggered......

Mon Jan 21 00:34:43 2019 kern.warn kernel: [187728.111885] ------------[ cut here ]------------                          
Mon Jan 21 00:34:43 2019 kern.warn kernel: [187728.112021] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 dev_wat0
Mon Jan 21 00:34:43 2019 kern.info kernel: [187728.115610] NETDEV WATCHDOG: eth0 (ipq806x-gmac-dwmac): transmit queue 0 t
           
[187760.323976] pgd = dba00000                                                                                           
[187760.330049] [e7e6e5f8] *pgd=00000000                                                                                 
[187760.332840] Internal error: Oops: 5 [#1] SMP ARM                                                                     
[187760.336649] Modules linked in: cp210x pl2303 ch341 ftdi_sio usbserial pppoe ppp_async ath10k_pci ath10k_core ath pppe
[187760.390804]  iptable_filter ipt_ECN ip_tables crc_ccitt compat sch_cake nf_conntrack act_skbedit act_mirred em_u32 cg
[187760.454441] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.14.63 #0                                       
[187760.476668] Hardware name: Generic DT based system                                                                   
[187760.483701] task: c0c06f80 task.stack: c0c00000                                                                      
[187760.488576] PC is at skb_release_data+0x68/0x174                                                                     
[187760.493425] LR is at consume_skb+0x58/0x78                                                                           
[187760.498020] pc : [<c0732680>]    lr : [<c0732348>]    psr: 20000113                                                  
[187760.502017] sp : c0c01ca0  ip : c0c00000  fp : 00000000                                                              
[187760.508612] r10: c0c03c08  r9 : 00000000  r8 : dcb19580                                                              
[187760.513909] r7 : d7ef0a80  r6 : db9da140  r5 : db9da140  r4 : 00000000                                               
[187760.519205] r3 : 000000ba  r2 : 00000000  r1 : 00000002  r0 : e7e6e5e4                                               
[187760.525544] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none                                        
[187760.532141] Control: 10c5787d  Table: 5da0006a  DAC: 00000051                                                        
[187760.539431] Process swapper/0 (pid: 0, stack limit = 0xc0c00210)                                                     
[187760.545248] Stack: (0xc0c01ca0 to 0xc0c02000)                                                                                                
[187760.770185] [<c0732680>] (skb_release_data) from [<c0732348>] (consume_skb+0x58/0x78)                                
[187760.778548] [<c0732348>] (consume_skb) from [<bf47212c>] (ieee80211_rx_napi+0x7a4/0xa14 [mac80211])                  
[187760.786547] [<bf47212c>] (ieee80211_rx_napi [mac80211]) from [<bf5115cc>] (ath10k_htt_t2h_msg_handler+0x17e0/0x1854 )
[187760.795744] [<bf5115cc>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf512414>] (ath10k_htt_txrx_compl_task+0x)
[187760.807223] [<bf512414>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf54db48>] (ath10k_pci_napi_poll+0x78/0x1)
[187760.820068] [<bf54db48>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c074768c>] (net_rx_action+0x144/0x31c)           
[187760.832075] [<c074768c>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)                                 
[187760.841625] [<c03015c8>] (__do_softirq) from [<c031d280>] (irq_exit+0xdc/0x148)                                      
[187760.849604] [<c031d280>] (irq_exit) from [<c0359b38>] (__handle_domain_irq+0xa8/0xc8)                                
[187760.857065] [<c0359b38>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)                          
[187760.864789] [<c0301488>] (gic_handle_irq) from [<c030c70c>] (__irq_svc+0x6c/0x90)                                                                                                                         
[187760.902383] [<c030c70c>] (__irq_svc) from [<c0308928>]                                
[187760.938220] ---[ end trace 272e43a1859da539 ]---

por · January 21, 2019, 8:19am

Does this concern the same error ?
Never got the stack dump myself, always Oops is my last line.
This stack trace has ath10k calls. The Oops problem seems related to the stmmac ethernet core of the ipq806x SoC.

quarky · January 21, 2019, 1:39pm

What I'd encountered is probably different from what others are reporting here. My builds are not using the STMMAC drivers. Instead it is using the qca-nss-gmac drivers.

In any case, I tried using an MTU of 4000 and am able to download a 16MB file without issue, so my R7800 rebooting spontaneously is probably not related to jumbo frames.

Guess, it's more detective work for me ...

ewald · February 5, 2019, 5:22pm

I have been trying to fix the stmmac_main.c by adding code to properly deal with this unexpected larger size packages, but it requires a lot of code changes due to the way the rx ring buffer sizes are (pre)allocated based on MTU size.
The kernel panic that I hit was due to a missing call to set the proper dma size (basically dma would overrun the buffer based on the real size of the packet) size after which an skb buffer free would free illegal memory. There is now a patch for this problem: here. I managed to fix this and 2 other issues, but kept hitting new bugs like starvation (basically ethernet port hangs).
I believe the correct fix for this is now posted here. Most hangs went away, but some remained...
When you configure a larger MTU, the driver allocates a 2K, 4K or 16K buffer and in this way manages to bypass a number of defects due to the extra headroom.
That is why with MTU=4088 my router been performing flawlessly for the last 18 days, despite all the stress tests thrown it it.

I think it's worth to back-port the master branch patches (a long list...) to 4.14.93 (or whatever is the current release). There are 20 or so major code changes/patches submitted since the 4.14 version. If I have some time I will try to generate a patch set and build a kernel.

slh · February 5, 2019, 8:57pm

It might be worth looking at forward porting ipq806x to 4.19 as well, yes, the next OpenWrt release will still ship with kernel 4.14, but master can switch immediately afterwards (and 4.19 support patches that don't toggle the default are already accepted).

hnyman · February 5, 2019, 9:35pm

Have you noticed the ongoing work at https://git.openwrt.org/?p=openwrt/staging/chunkeey.git;a=shortlog

slh · February 5, 2019, 9:48pm

Thanks, no, I didn't so far.

slh · February 6, 2019, 12:18am

I've just done a very quick test of the kernel 4.19 forward port (excluding the dsa changes) on my nbg6817, seems to be working fine (it might need some further refinements for USB3 on ipq8065, at least manual enabling of kmod-usb-dwc3 && kmod-usb-dwc3-qcom, but I'm not using USB on my router).

hingbong · February 6, 2019, 2:08am

when compiling,do you have this error?

Applying /home/hingbong/openwrt-r7800/target/linux/generic/hack-4.19/302-powerpc-Enable-kernel-XZ-compression-option-on-BOOK3.patch using plaintext: 
patching file arch/powerpc/Kconfig
Hunk #1 FAILED at 199.
1 out of 1 hunk FAILED -- saving rejects to file arch/powerpc/Kconfig.rej
patching file arch/powerpc/platforms/Kconfig.cputype
Hunk #1 FAILED at 75.
1 out of 1 hunk FAILED -- saving rejects to file arch/powerpc/platforms/Kconfig.cputype.rej
Patch failed!  Please fix /home/hingbong/openwrt-r7800/target/linux/generic/hack-4.19/302-powerpc-Enable-kernel-XZ-compression-option-on-BOOK3.patch!

slh · February 6, 2019, 2:29am

I just applied https://git.openwrt.org/?p=openwrt/staging/chunkeey.git;a=commitdiff;h=5789287f4437624989166c60452f5bc9ce06fd82 to current master, as it doesn't touch anything outside of target/linux/ipq806x/ (aside from package/kernel/linux/modules/usb.mk and target/linux/generic/config-4.19), the patch in question can't affect powerpc specific code - and yes, it did compile (and run) for me.

hingbong · February 6, 2019, 5:51am

do you get the pppoe service restart itself?

por · February 6, 2019, 9:18am

Interesting.

The netgear-r7800 and tplink-c2600 routers have been placed behind different cable modems that do not emit jumbo frames (the UBEE modem did) and for that reason these routers now work reliable.

It's a pity that the work-around (setting MTU to a large value and thereby causing stmmac to allocate a larger DMS buffer) came to late for me; the concerned production environment of a client of mine needed a quick solution (i.e. change of modem).

I still have one r7800 in semi-production. To reproduce the error a host connected to one of its LAN phys should emit jumbo frames. Still need to figure out how to force that on the Apple computers that are used; just setting the MTU to a large value on the Ethernet interface of the Mac did not do the trick. Any suggestion howto ?

zkzkzk2015 · February 6, 2019, 9:49am

Continuing the discussion from NBG6817: OpenWrt rebooting constantly:
Same problem here: my setup: Huawei B715 as modem (DMZ mode) and R7800 with OpenWRT 18.06.1(OpenWrt 18.06.1 r7258-5eb055306f / LuCI openwrt-18.06 branch (git-18.228.31946-f64b152) )

[46158.104334] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1541 larger than size (1536)
[46158.137186] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1541 larger than size (1536)

ewald · February 9, 2019, 6:28pm

@por, I did not find a 100% repeatable pattern that would cause the NBG6817 to crash on any given device. On the 8 year old PC of my spouse, a simple internet speedtest (speedtest.net) does the job. On my PC, I ran the Samba stress test to a NAS (MyBookLive) and that caused it to crash after just 5 minutes. On Ubuntu 18.x there was not much needed, just booting the PC. The ipq806x-gmac-dwmac error messages come all the time, but if you are a little (un)lucky the freeing of overflow data does not cause a panic right away, due to way the kernel aligns dma buffers. Normally devices should not send jumbo-packets to a device that has jumbo packets disabled, so that adds another bit of luck-factor to the mix.

hingbong · February 10, 2019, 1:44am

i use linux 4.19 and the dsa https://git.openwrt.org/?p=openwrt/staging/chunkeey.git;a=commit;h=0ebf2d98c1a6debf035055b7d006a66e8024336b,set the Jambo packet,i doesn't reboot

vanyaindigo · February 26, 2019, 10:06am

Same problem here https://forum.openwrt.org/t/ethernet-eth0-len-1571-larger-than-size-1536/32098
at NBG6817 with OpenWRT 18.06.2

Evgeniy1 · March 9, 2019, 5:26pm

it's bug only in 4.14.xx kernel (18.06.xx) ?
4.4.xx kernel in 17.01.xx is free from this bug?

ewald · March 10, 2019, 3:56pm

The 4.4.x driver is very different. I quickly browsed the latest 4.4.176 and it does not have the oversized frame reception bug, nor the incorrect dma_free code. That said, a similar bug related to the pre-allocation of SKB buffers was also present in 4.4.x, but fixed in late 2015 here. Unfortunately, the 4.4 version of the driver also had it's fair share of issues. Just take a look here. So unless the 17.01.xx version is based on a very recent version of 4.4.x, you have a good chance to experience some kind of issue, but as far as I can tell not the ones discussed above with 18.06.xx.