NBG6817: OpenWrt rebooting constantly

Reflashed OpenWRT on the router

I think its just a bug that is very rare, and in my instance only showed up after I reflashed OpenWRT bc I wanted a fresh copy. But then the router kept restating every 5 mins or so, which then made me reflash it again and everything's been working fine since last night.

qca8337 ... rgmii ... clocks... cond_resched....

if anyone is serious about getting to the bottom of this..... please share what media, client nic / os....

-need to eliminate this is gigabit / autonegotiate related.

-if you know how to kill power scaling in userspace..... please do so... ( and let me know how to aswell :slight_smile: )

-disable auto-negotiation ( on the next day )

-static macs all round on the third day

-no stp on the fourth

Still happening?

Assumptions......

  1. switch driver ... basically by hard setting as many parameters switch side on the affected router..... it will narrow down things me think......

Now if you could only capture the last 30 ( unique ) proto packets ( incoming ) prior to the error..... :wink:

Code wise;

-This sounded strange
/* Forward all unknown frames to CPU port for Linux processing */

-This pricked some interest too
vi drivers/net/ethernet/stmicro/stmmac/dwmac-ipq806x.c

/* Inter frame gap is set to 12 */
val = 12 << NSS_COMMON_GMAC_CTL_IFG_OFFSET |
      12 << NSS_COMMON_GMAC_CTL_IFG_LIMIT_OFFSET;
/* We also initiate an AXI low power exit request */

if only i had skills - and an effected router :wink:

o' some serial / debug = 1000 action would hurt either..... if whoever attacks this knows what / where to force verbosity on..... cause it sure ain't comin..... 8k.o -vvvv etc. etc.

https://github.com/coolsnowwolf/lede/blob/master/feeds.conf.default
i use coolsnowwolf modified feeds,and this bug disappeared.
His feeds are based on old openwrt feeds.

edit: for some days,the bug comes again......

@hingbong
I have a NBG6817 which is normally running the OEM software, but I can rapidly switch to the latest OpenWRT snapshot given the dual boot capabilities of this router. I can confirm the constant reboots using the latest snapshot r9057-8c6f00e. After having made the logs persistent on an external flash drive, one sees a whole bunch of messages "len xyz is larger than size (1536)" that don't cause a kernel panic as the code simply drops the frame. But at some point it goes wrong, specifically when the frames are starting to get much bigger than 1536 e.g. 19xx. Clearly, the panic is related to this error...

Thu Jan 17 14:22:16 2019 daemon.warn dnsmasq-dhcp[2289]: VUSoloSE is a CNAME, not giving it to the DHCP lease of 192.168.1.8
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.011375] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.011614] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:57 2019 kern.err kernel: [16862.057590] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1555 larger than size (1536)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:24:58 2019 daemon.info hostapd: wlan1: STA xxxxx WPA: group key handshake completed (RSN)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.515955] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.562652] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.794549] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:08 2019 kern.err kernel: [16872.893907] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.094066] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.201133] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.794684] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1931 larger than size (1536)
Thu Jan 17 14:25:09 2019 kern.err kernel: [16873.893968] ipq806x-gmac-dwmac 37400000.ethernet eth1: len 1822 larger than size (1536)
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.285892] Unable to handle kernel paging request at virtual address 2131d6e2
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.285920] pgd = d954c000
Thu Jan 17 14:25:11 2019 kern.alert kernel: [16875.292091] [2131d6e2] *pgd=00000000
Thu Jan 17 14:25:12 2019 kern.emerg kernel: [16875.294697] Internal error: Oops: 5 [#1] SMP ARM

I have been able to reproduce the crash using the PC of my spouse who has an Intel 82577LM ethernet card. If I turn on jumbo packets and set MTU to 4088 and execute an internet speedtest, it will reliably crash the kernel. With my newer Intel I218-LM it takes quite a lot more effort to make it crash but I managed to find a pattern too that would cause it to panic as well. We normally enable jumbo packets for increased performance to our Western Digital NAS. Our PC's are connected to NAS via a low cost TP link gigabit 8-port switch which supports Jumbo packets. If we disable jumbo packets everywhere, I can see quite a number of errors and warnings, but no crash and solid WiFi and LAN performance.

In the case the NBG6817 does support Jumbo packets, I could try to set an MTU of 4088 and see how it goes. Alternatively, there is the question: why does Windows (10, 1809) send packets >1500 to the NBG6817 if it's MTU is set to 1500. Could TCP window negociation go wrong somehow ? Could also try to see if I can reproduce this from Ubuntu or Debian Linux.

I am a Linux kernel developer, so debugging and/or changing kernel code is not an issue. Lack of time is a tougher problem...
Will report back on further tests.
Ewald

1 Like

It's a good question, but ultimately the kernel driver needs to not crash regardless of what other people send :wink: since you have experience in kernel development, is there any chance you could look in the code and see if there's an obvious bug near where this message is triggered? One assumes it's probably some buffer overwrite when a packet is too big, or some such thing.

@dlakelan,
No discussion, it should not crash. I was merely trying to understand where in the kernel driver it could go wrong.
Interestingly, as expected, the problem goes away, if we enable jumbo frames. The router survived 90 minutes of stress test of packets of all sizes up to 4088 bytes...
Unfortunately due to a bug in OpenWRT, you can't set the MTU in /etc/config/network, neither in the interface, nor device section.
And regrettably, the driver does not allow setting the MTU in active state.
So you need to issue:

ifconfig eth1 down; ifconfig eth1 mtu 4088; ifconfig eth1 up

I know this is just a workaround, but at least my router is not (continuously) crashing anymore and, if time allows, I know where to look for a defect now and where to add some debug code.
Need to find a clean way to set the mtu at router boot...

EDIT: Ok, this won't win any prize but it works.

  1. Set MTU in Luci for your ethernet port (interfaces - lan - advanced settings). This will add the mtu option to /etc/config/network
  2. Add to the /etc/rc.local

MTU=$(uci get network.lan.mtu)
ifconfig eth1 down; ifconfig eth1 mtu $MTU; ifconfig eth1 up

  1. reboot

You can change the MTU size using Luci (or edit /etc/config/network), the changes will be picked up by rc.local

Ewald

2 Likes

You could put this command into the custom firewall script and it'd get run when the net comes up.

I was rapidly scanning the code. The notion of simply dropping a frame larger than the allocated skb/mtu is a bit unusual to me. Normally, one would expect a new skb reservation of bigger size, cacheable memcpy/dma copy followed by recycle of receiver buffer slot. Also the use of buffer size of exactly 1536 bytes with zero headroom when MTU=1500 is not something I am used to. A minimum of 2 (or 4 for word alignment) extra bytes "buffer" is generally implemented on top of some "SKB_HEADROOM" type kernel variable. Interestingly, that is exactly what happens for GMAC4 frames (line 3416). Will take some more analysis to understand if the code has to work around possible HW faults.
Also for MTU 4088, the code uses a 4K buffer (multiple of 2K), so plenty on headroom there.
Ewald

1 Like

stmmac/ring_mode.c
/drivers/net/ethernet/stmicro/stmmac/dwmac1000_dma.c rx_watchdog
lib/dma-debug.c

Thanks for the pointers!. In kernel 4.14.93 it looks like ring_mode.c is no longer used, although it's compiled in.
Testing a few modifications now (e.g. adding dev_kfree_skb_any).
EDIT: giving up for now, when making code changes to not drop oversized frames, the kernel survives almost 4 hours of stress test, but ultimately crashes. So more changes and debugging are needed for MTU=1500 + other systems sending it a bunch of oversized packets. Additionally, I would need to open up the router and hook up a console to interact with u-boot and see kernel message in real time since they don't always make it to persistent storage or enable netconsole in OpenWRT (which requires more patches and work). When setting the MTU to 9k or 4K (4088), the system remains stable, so there is a proper workaround.

1 Like

A bug report has been opened: FS#2026.

1 Like

I’m encountering the same issue with my R7800. It is spontaneously rebooting, but without any logs shown on the serial console. It just froze solid, and after some seconds, just power cycle.

I’m trying to get the NSS drivers working and managed to get the NSS cores working for packet acceleration but the spontaneous reboots makes it unuseable.

I thought the issue was caused by the NSS drivers but this thread seems to suggest otherwise.

I’m current using the lede-17.01 branch as well as the v17.01.6 tagged release. Both exhibit the same spontaneous reboot issue.

Does anyone know if there’s codes in the kernel causing such reboots?

It's very likely that all ipq8065 (and probably also ipq8064) devices are affected alike, when encountering packets with a specific MTU >1500 bytes (all of my stations use the default MTU sizes, so I haven't seen this behaviour myself) - but given that enabling the NSS cores and their driver is a very invasive change, it would be important to rule out those changes before investigating further into this direction (from a NSS specific angle, at least).

I’ll try with jumbo frames and see if I can reliably re-produce the reboots.

Yup, triggered......

Mon Jan 21 00:34:43 2019 kern.warn kernel: [187728.111885] ------------[ cut here ]------------                          
Mon Jan 21 00:34:43 2019 kern.warn kernel: [187728.112021] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:320 dev_wat0
Mon Jan 21 00:34:43 2019 kern.info kernel: [187728.115610] NETDEV WATCHDOG: eth0 (ipq806x-gmac-dwmac): transmit queue 0 t
           
[187760.323976] pgd = dba00000                                                                                           
[187760.330049] [e7e6e5f8] *pgd=00000000                                                                                 
[187760.332840] Internal error: Oops: 5 [#1] SMP ARM                                                                     
[187760.336649] Modules linked in: cp210x pl2303 ch341 ftdi_sio usbserial pppoe ppp_async ath10k_pci ath10k_core ath pppe
[187760.390804]  iptable_filter ipt_ECN ip_tables crc_ccitt compat sch_cake nf_conntrack act_skbedit act_mirred em_u32 cg
[187760.454441] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W       4.14.63 #0                                       
[187760.476668] Hardware name: Generic DT based system                                                                   
[187760.483701] task: c0c06f80 task.stack: c0c00000                                                                      
[187760.488576] PC is at skb_release_data+0x68/0x174                                                                     
[187760.493425] LR is at consume_skb+0x58/0x78                                                                           
[187760.498020] pc : [<c0732680>]    lr : [<c0732348>]    psr: 20000113                                                  
[187760.502017] sp : c0c01ca0  ip : c0c00000  fp : 00000000                                                              
[187760.508612] r10: c0c03c08  r9 : 00000000  r8 : dcb19580                                                              
[187760.513909] r7 : d7ef0a80  r6 : db9da140  r5 : db9da140  r4 : 00000000                                               
[187760.519205] r3 : 000000ba  r2 : 00000000  r1 : 00000002  r0 : e7e6e5e4                                               
[187760.525544] Flags: nzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none                                        
[187760.532141] Control: 10c5787d  Table: 5da0006a  DAC: 00000051                                                        
[187760.539431] Process swapper/0 (pid: 0, stack limit = 0xc0c00210)                                                     
[187760.545248] Stack: (0xc0c01ca0 to 0xc0c02000)                                                                                                
[187760.770185] [<c0732680>] (skb_release_data) from [<c0732348>] (consume_skb+0x58/0x78)                                
[187760.778548] [<c0732348>] (consume_skb) from [<bf47212c>] (ieee80211_rx_napi+0x7a4/0xa14 [mac80211])                  
[187760.786547] [<bf47212c>] (ieee80211_rx_napi [mac80211]) from [<bf5115cc>] (ath10k_htt_t2h_msg_handler+0x17e0/0x1854 )
[187760.795744] [<bf5115cc>] (ath10k_htt_t2h_msg_handler [ath10k_core]) from [<bf512414>] (ath10k_htt_txrx_compl_task+0x)
[187760.807223] [<bf512414>] (ath10k_htt_txrx_compl_task [ath10k_core]) from [<bf54db48>] (ath10k_pci_napi_poll+0x78/0x1)
[187760.820068] [<bf54db48>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c074768c>] (net_rx_action+0x144/0x31c)           
[187760.832075] [<c074768c>] (net_rx_action) from [<c03015c8>] (__do_softirq+0xf0/0x264)                                 
[187760.841625] [<c03015c8>] (__do_softirq) from [<c031d280>] (irq_exit+0xdc/0x148)                                      
[187760.849604] [<c031d280>] (irq_exit) from [<c0359b38>] (__handle_domain_irq+0xa8/0xc8)                                
[187760.857065] [<c0359b38>] (__handle_domain_irq) from [<c0301488>] (gic_handle_irq+0x6c/0xb8)                          
[187760.864789] [<c0301488>] (gic_handle_irq) from [<c030c70c>] (__irq_svc+0x6c/0x90)                                                                                                                         
[187760.902383] [<c030c70c>] (__irq_svc) from [<c0308928>]                                
[187760.938220] ---[ end trace 272e43a1859da539 ]---   

Does this concern the same error ?
Never got the stack dump myself, always Oops is my last line.
This stack trace has ath10k calls. The Oops problem seems related to the stmmac ethernet core of the ipq806x SoC.

1 Like

What I'd encountered is probably different from what others are reporting here. My builds are not using the STMMAC drivers. Instead it is using the qca-nss-gmac drivers.

In any case, I tried using an MTU of 4000 and am able to download a 16MB file without issue, so my R7800 rebooting spontaneously is probably not related to jumbo frames.

Guess, it's more detective work for me ...

I have been trying to fix the stmmac_main.c by adding code to properly deal with this unexpected larger size packages, but it requires a lot of code changes due to the way the rx ring buffer sizes are (pre)allocated based on MTU size.
The kernel panic that I hit was due to a missing call to set the proper dma size (basically dma would overrun the buffer based on the real size of the packet) size after which an skb buffer free would free illegal memory. There is now a patch for this problem: here. I managed to fix this and 2 other issues, but kept hitting new bugs like starvation (basically ethernet port hangs).
I believe the correct fix for this is now posted here. Most hangs went away, but some remained...
When you configure a larger MTU, the driver allocates a 2K, 4K or 16K buffer and in this way manages to bypass a number of defects due to the extra headroom.
That is why with MTU=4088 my router been performing flawlessly for the last 18 days, despite all the stress tests thrown it it.

I think it's worth to back-port the master branch patches (a long list...) to 4.14.93 (or whatever is the current release). There are 20 or so major code changes/patches submitted since the 4.14 version. If I have some time I will try to generate a patch set and build a kernel.

3 Likes

It might be worth looking at forward porting ipq806x to 4.19 as well, yes, the next OpenWrt release will still ship with kernel 4.14, but master can switch immediately afterwards (and 4.19 support patches that don't toggle the default are already accepted).

Have you noticed the ongoing work at https://git.openwrt.org/?p=openwrt/staging/chunkeey.git;a=shortlog

1 Like