Belkin RT3200/Linksys E8450 WiFi AX discussion

benciphered · October 3, 2021, 5:15pm

I'm encountering a problem that results in the same error posted here:

Oct  3 12:08:59 ROUTER1A kernel: [12276.321915] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000053

It happens approximately 4-10 hours. It's an issue with the SNAPSHOT r17652-21c7a8593d and also happened on r17443 from the UBI installer repo.

This morning I attempted to get pstore data but I was unable to contact the router using either it's assigned IP address, 10.10.1.1. I also didn't have any luck pinging 192.168.1.1. I will be attempting to access it through either SSH or web on default IPs/ports when it next fails and hopefully I have more data to provide.

One thing that I do note is that the router continued to access the Internet and route traffic through Ethernet. It's the WiFi (both 2.4 and 5.0 radios) that wasn't functional. The fact that it routed traffic as a gateway makes me think that it was still operating as 10.10.1.1, though I couldn't get a ping response and didn't note any traffic from that IP with Wireshark.

I've been following this thread for the past couple days as I tried to diagnose the issue. If I am able to get pstore logs/data, I'll post them here. This router is not part of a mesh or anything like that. Normal gateway router/AP only.

themeanfarmer · October 3, 2021, 6:46pm

Hello. What channel does everyone use on 5GHz? I live in an apartment with about 50 networks.

Lynx · October 3, 2021, 7:14pm

I think in that case I would consider DFS. You ought to be able to scan and monitor and see which channel segment is the least occupied.

BTW for anyone wanting more SQM / VPN performace just enable irqbalance. I noticed significantly decreased loadavg with that enabled. And I saw on another thread this router can manage 1Gbit SQM with it enabled. Actually is there any reason not to have this enabled?

Finally has anyone got 160Mhz mesh or WDS working? Or 160Mhz working well in general? As in greater throughput than 80Mhz.

benciphered · October 4, 2021, 1:35am

I haven't really tried 160MHz. Originally tried to get it working back in June/July before LuCI had full ax support and couldn't get a stable connection. So I just set it to 80MHz and kept it there until earlier today experimenting a little bit. From my quick tests it did connect at 160MHz and report as such on the status page / station list. Have no idea about performance / stability, this was just a quick test before I decided I should probably keep the router on 80MHz until this crash problem is sorted so I don't introduce any additional variables in the troubleshooting process.

Is there any documentation on channel selection in this router, especially at 160MHz? I know with some routers only certain channels work with wider bandwidth, and it doesn't seem that all routers refer to the block of channels in a standard way. I seem to recall that when specifying the channel number, some routers want you to select the low numbered channel, some want the high numbered channel, and some want you to choose the center channel. Not sure how this works with this model, going to see what I can find out about this.

benciphered · October 4, 2021, 1:37am

My router just had the oops occur again and I was finally able to access it.

The router is running r17652-21c7a8593d, and after the oops it booted into recovery SNAPSHOT r17443-90e167abaa. Accessing the router at 192.168.1.1 and default ports was successful.

I was able to obtain dmesg-ramoops-0 and dmesg-ramoops-1 which appear to have identical data about the trace, as follows:

<1>[17357.656699] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000053
<1>[17357.665498] Mem abort info:
<1>[17357.668285]   ESR = 0x96000005
<1>[17357.671331]   EC = 0x25: DABT (current EL), IL = 32 bits
<1>[17357.676646]   SET = 0, FnV = 0
<1>[17357.679690]   EA = 0, S1PTW = 0
<1>[17357.682820] Data abort info:
<1>[17357.685695]   ISV = 0, ISS = 0x00000005
<1>[17357.689521]   CM = 0, WnR = 0
<1>[17357.692480] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000042bf2000
<1>[17357.698915] [0000000000000053] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
<0>[17357.707615] Internal error: Oops: 96000005 [#1] SMP
<7>[17357.712483] Modules linked in: xt_connlimit pppoe ppp_async nf_conncount iptable_nat xt_state xt_nat xt_helper xt_conntrack xt_connmark xt_connbytes xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT pppox ppp_generic nf_nat nf_flow_table nf_conntrack_netlink nf_conntrack mt7915e mt7615e mt7615_common mt76_connac_lib mt76 mac80211 ipt_REJECT cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_recent xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY slhc sch_cake nf_reject_ipv4 nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 iptable_raw iptable_mangle iptable_filter ipt_ECN ip_tables hwmon crc_ccitt compat sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip
<7>[17357.712647]  ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ifb vfat fat autofs4 nls_utf8 nls_iso8859_1 nls_cp437 seqiv uas usb_storage leds_gpio xhci_plat_hcd gpio_button_hotplug
<7>[17357.831984] CPU: 1 PID: 1197 Comm: napi/phy1-9 Tainted: G S                5.10.70 #0
<7>[17357.839802] Hardware name: Linksys E8450 (UBI) (DT)
<7>[17357.844671] pstate: 60000005 (nZCv daif -PAN -UAO -TCO BTYPE=--)
<7>[17357.850677] pc : mt76_tx_status_skb_done+0x0/0x80 [mt76]
<7>[17357.855984] lr : mt7915_queue_rx_skb+0xa04/0x154c [mt7915e]
<7>[17357.861545] sp : ffffffc010ea3c20
<7>[17357.864850] x29: ffffffc010ea3c20 x28: ffffff8002375e00 
<7>[17357.870154] x27: 0000000040000000 x26: ffffff80023738d0 
<7>[17357.875458] x25: ffffff8002375ec0 x24: 0000000000000000 
<7>[17357.880762] x23: 0000000000000052 x22: ffffff800248f848 
<7>[17357.886066] x21: ffffff8000aacbb8 x20: ffffff800248f828 
<7>[17357.891370] x19: ffffff8002372800 x18: 0000000000000000 
<7>[17357.896674] x17: 0000000000000000 x16: 0000000000000000 
<7>[17357.901978] x15: 0000000000000000 x14: 0000000000000004 
<7>[17357.907282] x13: 0000000000000000 x12: 0000000000000006 
<7>[17357.912586] x11: ffffff8002374800 x10: ffffffc010a96000 
<7>[17357.917890] x9 : 0000000000000064 x8 : ffffff8002372800 
<7>[17357.923194] x7 : ffffff8002374950 x6 : 00000001001a0728 
<7>[17357.928498] x5 : fffffffeffe5f939 x4 : ffffff8002bff300 
<7>[17357.933801] x3 : ffffffc010ea3d18 x2 : ffffffc010ea3d18 
<7>[17357.939105] x1 : 0000000000000000 x0 : ffffff8002372800 
<7>[17357.944409] Call trace:
<7>[17357.946850]  mt76_tx_status_skb_done+0x0/0x80 [mt76]
<7>[17357.951808]  mt76_dma_rx_poll+0x284/0x764 [mt76]
<7>[17357.956420]  __napi_poll+0x34/0x140
<7>[17357.959900]  napi_threaded_poll+0x84/0xf0
<7>[17357.963903]  kthread+0x120/0x124
<7>[17357.967123]  ret_from_fork+0x10/0x18
<0>[17357.970694] Code: 54fffa41 d2800000 d65f03c0 d503201f (39414c24) 
<4>[17357.976778] ---[ end trace 0d2d5ed54117714c ]---

I hope this is a help. If there's anything else I could gather or try I'd be glad to.

drikus · October 4, 2021, 4:59am

These nullpointers at virtual address 0000000000000053 are seen and reported in this thread since end of July. Had multiple occurrences as well then revered to version prior end of July with OpenWrt SNAPSHOT r17114-349e2b7e65.

It is reported here
and supposedly fixed 10 days ago.
Not sure when it ends up on openwrt snapshot builds though, someone ?

hnyman · October 4, 2021, 5:09am

It pretty much depends on when @nbd updates the mt76 driver in OpenWrt to reflect the newest stuff. He has latest bumped the version here in July.

https://git.openwrt.org/?p=openwrt/openwrt.git;a=history;f=package/kernel/mt76;hb=HEAD

(Note that he committed that fix in in mt76, so he knows about it.)

benciphered · October 4, 2021, 2:10pm

Thank you for the clarification.

I saw this reported earlier in the thread and the request for pstore logs. I also followed the link to the issue on GitHub, saw the commit, and thought that meant it'd be in snapshot builds from then on. Thought it must not have been a full fix since it seemed to be the same issue. Didn't realize until you pointed it out that commit doesn't get incorporated into subsequent OpenWrt snapshots without further steps. My mistake.

I think I'll revert later today to r17114 on my main router.

bobbythomas · October 4, 2021, 2:48pm

I am also waiting for this fix to be incorporated in the nightly build as my router goes into recovery mode every 2-3 days. Thinking of writing a script to clear the pstore content and to reboot the router whenever it goes into recovery for the time being.

neheb · October 4, 2021, 8:13pm

anyone have any luck enabling 160mhz for the 5ghz interface?

Lynx · October 4, 2021, 8:31pm

Yes. Read the thread! Set channel to 36 and it works. But throughput seems no better and mesh doesn't work.

neheb · October 4, 2021, 9:14pm

meh that's disappointing. 160 on unused spectrum would be lovely.

quarky · October 4, 2021, 10:23pm

You could just change the u-boot bootcmd parameter to not boot into recovery, I.e. don’t check for pstore contents.

Lynx · October 5, 2021, 4:54pm

Dear @daniel @hnyman,

My replacement arrived. I have rebooted many times and no issues. So really does seem like the issue was some form of hardware failure, albeit I wonder if perhaps it still could have been managed in software if chip could have been powered down and back up in software.

hlew · October 6, 2021, 6:46am

For those of you asking about using 160MHz on the low interference channels --

I'm running 160MHZ on the DFS channel block (100-166) with the following configuration and it's working great with two RT3200s (one as a router and the other as a wired AP) both on the same channel with DAWN. Stability is good and both 2.4Ghz and 5Ghz channels are using the same SSID and my devices seem to roam ok. I've only had it up for 2 days though and I have yet to migrate my IoT devices over from my old network so I can't say how it will hold up in the long run.

config wifi-device 'radio0'
        option type 'mac80211'
        option path 'platform/18000000.wmac'
        option channel '1'
        option band '2g'
        option htmode 'HT20'
        option cell_density '0'
        option country 'US'
        option txpower '13'
        option disabled '0'

config wifi-device 'radio1'
        option type 'mac80211'
        option path '1a143000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
        option band '5g'
        option hwmode '11a'
        option channels '36-52, 100-166'
        option ht_capab 'ieee80211h'
        option cell_density '0'
        option htmode 'HE160'
        option channel '112'
        option country 'US'
        option disabled '0'

config wifi-iface 'default_radio1'
        option device 'radio1'
        option network 'lan'
        option mode 'ap'
        option encryption 'sae-mixed'
        option ieee80211k '1'
        option ieee80211v '1'
        option skip_inactivity_poll '1'
        option ieee80211r '1'
        option ft_over_ds '1'
        option ft_psk_generate_local '1'
        option ieee80211w '1'
        option key 'xxxxxxxx'
        option ssid 'xxxxxxxx'
        option disabled '0'

config wifi-iface 'wifinet2'
        option device 'radio0'
        option mode 'ap'
        option network 'lan'
        option encryption 'sae-mixed'
        option ieee80211k '1'
        option ieee80211v '1'
        option ieee80211r '1'
        option ft_over_ds '1'
        option ft_psk_generate_local '1'
        option ieee80211w '1'
        option key 'xxxxxxx'
        option ssid 'xxxxxxxx'
        option disabled '0'

hlew · October 6, 2021, 6:54am

The only test that I did was to connect one RT3200 as a client to the other on a 160MHz wide channel since I don't have any AX-enabled devices. Throughput was around 500-600Mbps which seems higher than what I can usually achieve out of 802.11ac 80MHz (~400-500Mbps)

emontes · October 6, 2021, 2:45pm

One of my routers started playing up, just saw the following in the logs:

Oct  6 15:39:49 router-mesh kernel: [ 1476.190032] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000053
Oct  6 15:39:49 router-mesh kernel: [ 1476.198853] Mem abort info:
Oct  6 15:39:49 router-mesh kernel: [ 1476.201655]   ESR = 0x96000005
Oct  6 15:39:49 router-mesh kernel: [ 1476.204700]   EC = 0x25: DABT (current EL), IL = 32 bits
Oct  6 15:39:49 router-mesh kernel: [ 1476.210002]   SET = 0, FnV = 0
Oct  6 15:39:49 router-mesh kernel: [ 1476.213054]   EA = 0, S1PTW = 0
Oct  6 15:39:49 router-mesh kernel: [ 1476.216188] Data abort info:
Oct  6 15:39:49 router-mesh kernel: [ 1476.219060]   ISV = 0, ISS = 0x00000005
Oct  6 15:39:49 router-mesh kernel: [ 1476.222895]   CM = 0, WnR = 0
Oct  6 15:39:49 router-mesh kernel: [ 1476.225865] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000041722000
Oct  6 15:39:49 router-mesh kernel: [ 1476.232322] [0000000000000053] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000

Only remedy is a reboot.
Im running OpenWrt SNAPSHOT r17677-f82c93b93c, I was running one or 2 builds before this one and decided to upgrade to see if the problem would stop.
Managed to get a week without any issue but looks like there's something else going on.
Anything I can grab that could help in troubleshooting?

Edit: I have 2 routers and just realised both are doing this.
Edit2: Just saw these have been reported before ( I search for mem abort info and that's why I haven't seen it).

mike · October 6, 2021, 4:22pm

I'm seeing the same thing, don't have any logs unfortunately, but with hardware flow offloading enabled there are one or two crashes a day, with hardware flow offloading disabled (software flow still enabled) there aren't any crashes.

dasher · October 6, 2021, 4:55pm

Is there any reason not to modify the logic of u-boot boodcmd’s parameters to ignore the content of pstore?

My RT3200 also boots into recovery about every three days and I’m not always available to fix it.

hnyman · October 6, 2021, 5:33pm

Well, the current behaviour (of booting into recovery if pstore present) is @daniel 's explicit choice for this router. It may well be suited for development, debugging & error reporting in the development phase, but it may be strange for end-users who merely wish that the router recovers automatically. Personally I am not sure if that should remain the default when the router gets into the next release...

To my knowledge there is no real reason not to simplify the bootcmd, and daniel himself mentions the possibility in