NBG6817: OpenWrt rebooting constantly

I was rapidly scanning the code. The notion of simply dropping a frame larger than the allocated skb/mtu is a bit unusual to me. Normally, one would expect a new skb reservation of bigger size, cacheable memcpy/dma copy followed by recycle of receiver buffer slot. Also the use of buffer size of exactly 1536 bytes with zero headroom when MTU=1500 is not something I am used to. A minimum of 2 (or 4 for word alignment) extra bytes "buffer" is generally implemented on top of some "SKB_HEADROOM" type kernel variable. Interestingly, that is exactly what happens for GMAC4 frames (line 3416). Will take some more analysis to understand if the code has to work around possible HW faults.
Also for MTU 4088, the code uses a 4K buffer (multiple of 2K), so plenty on headroom there.

1 Like

/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c rx_watchdog

Thanks for the pointers!. In kernel 4.14.93 it looks like ring_mode.c is no longer used, although it's compiled in.
Testing a few modifications now (e.g. adding dev_kfree_skb_any).
EDIT: giving up for now, when making code changes to not drop oversized frames, the kernel survives almost 4 hours of stress test, but ultimately crashes. So more changes and debugging are needed for MTU=1500 + other systems sending it a bunch of oversized packets. Additionally, I would need to open up the router and hook up a console to interact with u-boot and see kernel message in real time since they don't always make it to persistent storage or enable netconsole in OpenWRT (which requires more patches and work). When setting the MTU to 9k or 4K (4088), the system remains stable, so there is a proper workaround.

1 Like

A bug report has been opened: FS#2026.

1 Like

I’m encountering the same issue with my R7800. It is spontaneously rebooting, but without any logs shown on the serial console. It just froze solid, and after some seconds, just power cycle.

I’m trying to get the NSS drivers working and managed to get the NSS cores working for packet acceleration but the spontaneous reboots makes it unuseable.

I thought the issue was caused by the NSS drivers but this thread seems to suggest otherwise.

I’m current using the lede-17.01 branch as well as the v17.01.6 tagged release. Both exhibit the same spontaneous reboot issue.

Does anyone know if there’s codes in the kernel causing such reboots?

It's very likely that all ipq8065 (and probably also ipq8064) devices are affected alike, when encountering packets with a specific MTU >1500 bytes (all of my stations use the default MTU sizes, so I haven't seen this behaviour myself) - but given that enabling the NSS cores and their driver is a very invasive change, it would be important to rule out those changes before investigating further into this direction (from a NSS specific angle, at least).

I’ll try with jumbo frames and see if I can reliably re-produce the reboots.

Yup, triggered......

Mon Jan 21 00:34:43 2019 kern.warn kernel: [187728.111885] ------------[ cut here ]------------                          
Mon Jan 21 00:34:43 2019 kern.warn kernel: [187728.112021] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 dev_wat0
Mon Jan 21 00:34:43 2019 kern.info kernel: [187728.115610] NETDEV WATCHDOG: eth0 (ipq806x-gmac-dwmac): transmit queue 0 t
[187760.323976] pgd = dba00000                                                                                           
[187760.330049] [e7e6e5f8] *pgd=00000000                                                                                 
[187760.332840] Internal error: Oops: 5 [#1] SMP ARM                                                                     
[187760.336649] Modules linked in: cp210x pl2303 ch341 ftdi_sio usbserial pppoe ppp_async ath10k_pci ath10k_core ath pppe
[187760.390804]  iptable_filter ipt_ECN ip_tables crc_ccitt compat sch_cake nf_conntrack act_skbedit act_mirred em_u32 cg
[187760.454441] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.14.63 #0                                       
[187760.476668] Hardware name: Generic DT based system                                                                   
[187760.483701] task: c0c06f80 task.stack: c0c00000                                                                      
[187760.488576] PC is at skb_release_data+0x68/0x174                                                                     
[187760.493425] LR is at consume_skb+0x58/0x78                                                                           
[187760.498020] pc : [<c0732680>]    lr : [<c0732348>]    psr: 20000113                                                  
[187760.502017] sp : c0c01ca0  ip : c0c00000  fp : 00000000                                                              
[187760.508612] r10: c0c03c08  r9 : 00000000  r8 : dcb19580                                                              
[187760.513909] r7 : d7ef0a80  r6 : db9da140  r5 : db9da140  r4 : 00000000                                               
[187760.519205] r3 : 000000ba  r2 : 00000000  r1 : 00000002  r0 : e7e6e5e4                                               
[187760.525544] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none                                        
[187760.532141] Control: 10c5787d  Table: 5da0006a  DAC: 00000051                                                        
[187760.539431] Process swapper/0 (pid: 0, stack limit = 0xc0c00210)                                                     
[187760.545248] Stack: (0xc0c01ca0 to 0xc0c02000)                                                                                                
[187760.770185] [<c0732680>] (skb_release_data) from [<c0732348>] (consume_skb+0x58/0x78)                                
[187760.778548] [<c0732348>] (consume_skb) from [<bf47212c>] (ieee80211_rx_napi+0x7a4/0xa14 [mac80211])                  
[187760.786547] [<bf47212c>] (ieee80211_rx_napi [mac80211]) from [<bf5115cc>] (ath10k_htt_t2h_msg_handler+0x17e0/0x1854 )
[187760.795744] [<bf5115cc>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf512414>] (ath10k_htt_txrx_compl_task+0x)
[187760.807223] [<bf512414>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf54db48>] (ath10k_pci_napi_poll+0x78/0x1)
[187760.820068] [<bf54db48>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c074768c>] (net_rx_action+0x144/0x31c)           
[187760.832075] [<c074768c>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)                                 
[187760.841625] [<c03015c8>] (__do_softirq) from [<c031d280>] (irq_exit+0xdc/0x148)                                      
[187760.849604] [<c031d280>] (irq_exit) from [<c0359b38>] (__handle_domain_irq+0xa8/0xc8)                                
[187760.857065] [<c0359b38>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)                          
[187760.864789] [<c0301488>] (gic_handle_irq) from [<c030c70c>] (__irq_svc+0x6c/0x90)                                                                                                                         
[187760.902383] [<c030c70c>] (__irq_svc) from [<c0308928>]                                
[187760.938220] ---[ end trace 272e43a1859da539 ]---   

Does this concern the same error ?
Never got the stack dump myself, always Oops is my last line.
This stack trace has ath10k calls. The Oops problem seems related to the stmmac ethernet core of the ipq806x SoC.

1 Like

What I'd encountered is probably different from what others are reporting here. My builds are not using the STMMAC drivers. Instead it is using the qca-nss-gmac drivers.

In any case, I tried using an MTU of 4000 and am able to download a 16MB file without issue, so my R7800 rebooting spontaneously is probably not related to jumbo frames.

Guess, it's more detective work for me ...

I have been trying to fix the stmmac_main.c by adding code to properly deal with this unexpected larger size packages, but it requires a lot of code changes due to the way the rx ring buffer sizes are (pre)allocated based on MTU size.
The kernel panic that I hit was due to a missing call to set the proper dma size (basically dma would overrun the buffer based on the real size of the packet) size after which an skb buffer free would free illegal memory. There is now a patch for this problem: here. I managed to fix this and 2 other issues, but kept hitting new bugs like starvation (basically ethernet port hangs).
I believe the correct fix for this is now posted here. Most hangs went away, but some remained...
When you configure a larger MTU, the driver allocates a 2K, 4K or 16K buffer and in this way manages to bypass a number of defects due to the extra headroom.
That is why with MTU=4088 my router been performing flawlessly for the last 18 days, despite all the stress tests thrown it it.

I think it's worth to back-port the master branch patches (a long list...) to 4.14.93 (or whatever is the current release). There are 20 or so major code changes/patches submitted since the 4.14 version. If I have some time I will try to generate a patch set and build a kernel.


It might be worth looking at forward porting ipq806x to 4.19 as well, yes, the next OpenWrt release will still ship with kernel 4.14, but master can switch immediately afterwards (and 4.19 support patches that don't toggle the default are already accepted).

Have you noticed the ongoing work at https://git.openwrt.org/?p=openwrt/staging/chunkeey.git;a=shortlog

1 Like

Thanks, no, I didn't so far.

I've just done a very quick test of the kernel 4.19 forward port (excluding the dsa changes) on my nbg6817, seems to be working fine (it might need some further refinements for USB3 on ipq8065, at least manual enabling of kmod-usb-dwc3 && kmod-usb-dwc3-qcom, but I'm not using USB on my router).

when compiling,do you have this error?

Applying /home/hingbong/openwrt-r7800/target/linux/generic/hack-4.19/302-powerpc-Enable-kernel-XZ-compression-option-on-BOOK3.patch using plaintext: 
patching file arch/powerpc/Kconfig
Hunk #1 FAILED at 199.
1 out of 1 hunk FAILED -- saving rejects to file arch/powerpc/Kconfig.rej
patching file arch/powerpc/platforms/Kconfig.cputype
Hunk #1 FAILED at 75.
1 out of 1 hunk FAILED -- saving rejects to file arch/powerpc/platforms/Kconfig.cputype.rej
Patch failed!  Please fix /home/hingbong/openwrt-r7800/target/linux/generic/hack-4.19/302-powerpc-Enable-kernel-XZ-compression-option-on-BOOK3.patch!

I just applied https://git.openwrt.org/?p=openwrt/staging/chunkeey.git;a=commitdiff;h=5789287f4437624989166c60452f5bc9ce06fd82 to current master, as it doesn't touch anything outside of target/linux/ipq806x/ (aside from package/kernel/linux/modules/usb.mk and target/linux/generic/config-4.19), the patch in question can't affect powerpc specific code - and yes, it did compile (and run) for me.

1 Like

do you get the pppoe service restart itself?


The netgear-r7800 and tplink-c2600 routers have been placed behind different cable modems that do not emit jumbo frames (the UBEE modem did) and for that reason these routers now work reliable.

It's a pity that the work-around (setting MTU to a large value and thereby causing stmmac to allocate a larger DMS buffer) came to late for me; the concerned production environment of a client of mine needed a quick solution (i.e. change of modem).

I still have one r7800 in semi-production. To reproduce the error a host connected to one of its LAN phys should emit jumbo frames. Still need to figure out how to force that on the Apple computers that are used; just setting the MTU to a large value on the Ethernet interface of the Mac did not do the trick. Any suggestion howto ?

Continuing the discussion from NBG6817: OpenWrt rebooting constantly:
Same problem here: my setup: Huawei B715 as modem (DMZ mode) and R7800 with OpenWRT 18.06.1(OpenWrt 18.06.1 r7258-5eb055306f / LuCI openwrt-18.06 branch (git-18.228.31946-f64b152) )

[46158.104334] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1541 larger than size (1536)
[46158.137186] ipq806x-gmac-dwmac 37200000.ethernet eth0: len 1541 larger than size (1536)