Belkin RT3200/Linksys E8450 WiFi AX discussion

i have 48 devices and never crash on the same chipset, same radio. not on the same device.

... and probably different configuration.

Thanks to /sys/fs/pstore you can read the crashlogs also without any serial port access. Please report this bug on github.com/openwrt/mt76

For me the crashing happens when using eap with 802.1q under heavy traffic. I'm not sure the cause yet. I have the same model at home hooked to about 30 devices, though most of them are low bandwidth IoT, and it has been up for almost 50 days. I've been stumped at this for almost a month now.

same, if it was for me, i only need 5 devices.

I run snapshot builds on my RT3200s, configured as dumb APs + bridger, and have enabled WED. However, I'm noticing a trend where after some time (sorry I don't have a more finite measurement at the moment) of being up and seeing offloaded wireless flows, it appears the offloading just stops.

I confirm this by watching /sys/kernel/debug/ppe0/bind after reboot and see offloaded flows. After several hours of uptime, some of my RT3200s (I have three in this configuration) stop showing any flow output from /sys/kernel/debug/ppe0/bind.

Is anyone else seeing the same? I don't know exactly when the offloading stops occurring, so I haven't been able to catch anything in my logs as of yet.

1 Like

Hi,Today I was attaching additonal usb drive next to my HDD and during the mounting process, I saw a partition mounted as /dev/ubi0_3. Is that the recovery partition and should it be mounted?
Thanks

Check your process list and see if the "bridger" process is still there, there is a good chance it has crashed. Mine crashed within few seconds with "segment fault" so I haven't been able to get it work at all...

It's weird. I also have a similar setup(8021q+EAP) but mine has been stable for weeks. 5+ devices but no heavy traffic though.

I do notice one different, I'm using the legacy way to setup VLAN(one bridge per VLAN), not the bridge's VLAN filter.

Usually if the device crashes and subsequently reboots you should find files in /sys/fs/pstore containing the dmesg log of the crash. If you didn't change the bootloader settings the presence of anything in pstore will also trigger booting the recovery image (to call the attention of developers and not just silent reboot which may go unnoticed; and also to prevent boot loops caused by a early kernel crash).

Hi Daniel,

I haven't been able to get the /sys/fs/pstore to work. I always have to pull the power supply for at least 10 seconds to get the unit back. Perhaps you could tell me what I'm doing wrong, so I could access the pstore.

I have a lot of C/C++ experience, but know very little about Linux drivers. But I'm a quick study so I'm hoping that I might even be able to track down and fix this issue myself... possibly...

You might be onto something. Can you post your config? I would switch mine over and see if it makes it more stable. It would be a huge clue as to what is going on.

Note that in case of a crash the unit boots the recovery image, that means it will be available with address 192.168.1.1/24 untagged, which is the default. If you have configured VLANs or use an address other than 192.168.1.1/24 for the router when running in production this can be the cause of the confusion...

You can try pstore feature by issuing

echo c > /proc/sysrq-trigger

That will create an artificial kernel crash which will be stored in pstore and hence subsequent boot of the recovery image.

1 Like

In my case, right now I am only seeing offloaded flows on 1 of 3 of my RT3200s. On the two that are not showing any flows, the bridger process is running. For grins, I tried to restart bridger and network to see if the flows would start again. No luck.

I'm thinking there must be something more to this, but I'm not sure what else to check at the moment. Still hoping maybe @daniel has some clues for me. :slight_smile:

I wish it was. It isn't available, even the switch shows zero mac communications. I believe I tried pinging ip 192.168.1.1 but I don't think there was an answer. I can try it again the next time it crashes.

This has probably been asked and answered a million times, so I apologize, but if I accidentally choose a non-UBI build when I am running UBI, will sysupgrade in LuCI prevent me from accidentally flashing this?

If not - what happens if I flash the non-UBI on UBI?

It should warn you about incompatible image and would prevent accidents. (But can still overrule the big red warning and flash, but that would brick the device. See below.)

If you flash non-UBI image to an UBI enabled device, the device would not boot.
You would need to use reset button during the boot process and revert to the initramfs recovery instance, and use that use to flash the normal UBI sysupgrade image.

1 Like

Hi Daniel. I'll try the /proc/sysrq-trigger to see what happens. The issue I ran into is the unit will not talk at all on the LAN as the cisco switch shows 0 MAC addresses (i.e. "show mac add int gi0/1"). When in recovery mode, I can see the MAC address is learned. This is some other bizarre state.

Could I have your help to take a quick look at something? I may have found a race condition in the mt76 code.

I believe the call to ieee80211_rx_list( ) is missing certain protections outlined by the mac80211.h comments, such as BH disable, rcu read lock, and spinlock protections so that ieee80211_tx_status_ext( ) can never be called at the same time.

Also, it appears that ieee80211_tx_status_ext( ) can be called in multiple places simultaneously without a spinlock, and the documentation on that function states that it must be synchronized against each other, and also with any ieee80211_rx( ) function call.

I'm adding in these protections to see if it resolves the crash, but wanted your opinion if it looks like the smoking gun.

I outlined some of it here --> https://github.com/openwrt/openwrt/issues/11995#issuecomment-1447420166

1 Like

I'm suspect this is actually related to https://github.com/openwrt/mt76/issues/754 (and potentially this as well https://github.com/openwrt/mt76/issues/749). Something is causing my mt7915e driver to crash. Given these offloaded flows only apply to the 5Ghz radio, this likely explains why I stop seeing offloaded flows.

I'll try to pay better attention to this next time it occurs and confirm if this theory holds.

UPDATE My theory does NOT hold true. At over 24 hours of uptime, I am no longer seeing offloaded flows on any of my three APs, and I have not seen the mt7915e driver crash yet.

UPDATE 2 @nbd Has several new mt76 commits in master, so I'm going to grab them and see if any of that addresses what I'm seeing. Eyes on this one especially: https://github.com/openwrt/openwrt/commit/4dd0eaffc142e9782681353a53748e68cd731d49

1 Like

Just a quick followup on my last tests from:

now on:
SNAPSHOT r22190-12a3c863d2
12:03:36 up 1 day, 11:46, load average: 0.29, 0.41, 0.27
CH100, Tx-Power: 24 dBm
MacBook Pro M1
bluetooth disabled (same as last time)

2m away:
80Mhz set in config
-44db
1200.9 Mbit/s, 80 MHz, HE-MCS 11, HE-NSS 2
1134.2 Mbit/s, 80 MHz, HE-MCS 11, HE-NSS 2, HE-GI 1

~ iperf3 -c ethernet_connected_host.local -P 4
[SUM]   0.00-10.00  sec   979 MBytes   821 Mbits/sec                  sender
[SUM]   0.00-10.01  sec   973 MBytes   816 Mbits/sec                  receiver
~ iperf3 -c ethernet_connected_host.local -P 4 -R
[SUM]   0.00-10.01  sec  1005 MBytes   842 Mbits/sec                  sender
[SUM]   0.00-10.00  sec   997 MBytes   837 Mbits/sec                  receiver

160Mhz set in config

~ iperf3 -c ethernet_connected_host.local -P 4
[SUM]   0.00-10.00  sec   969 MBytes   813 Mbits/sec                  sender
[SUM]   0.00-10.01  sec   963 MBytes   807 Mbits/sec                  receiver
~ iperf3 -c ethernet_connected_host.local -P 4 -R
[SUM]   0.00-10.01  sec   917 MBytes   768 Mbits/sec                  sender
[SUM]   0.00-10.00  sec   909 MBytes   763 Mbits/sec                  receiver

160mhz
10m + 1 thin brick wall away:
-64dbm
576.4 Mbit/s, 80 MHz, HE-MCS 5, HE-NSS 2
680.6 Mbit/s, 80 MHz, HE-MCS 7, HE-NSS 2, HE-GI 1

~ iperf3 -c ethernet_connected_host.local -P 4
[SUM]   0.00-10.00  sec   579 MBytes   486 Mbits/sec                  sender
[SUM]   0.00-10.02  sec   576 MBytes   482 Mbits/sec                  receiver
~ iperf3 -c ethernet_connected_host.local -P 4 -R
[SUM]   0.00-10.02  sec   564 MBytes   472 Mbits/sec                  sender
[SUM]   0.00-10.00  sec   560 MBytes   469 Mbits/sec                  receiver

So from my perspective everything working correctly on 5Ghz radio.
However, I started getting abysmal performance on 2.4Ghz:

144.4 Mbit/s, 20 MHz, MCS 15, Short GI
144.4 Mbit/s, 20 MHz, MCS 15, Short GI

This is expected:

~ iperf3 -c ethernet_connected_host.local -P 4
[SUM]   0.00-10.00  sec   102 MBytes  85.4 Mbits/sec                  sender
[SUM]   0.00-10.13  sec  99.8 MBytes  82.7 Mbits/sec                  receiver

This doesn't seem to be right:

~ iperf3 -c ethernet_connected_host.local -P 4 -R
[SUM]   0.00-10.00  sec  7.58 MBytes  6.36 Mbits/sec                  sender
[SUM]   0.00-10.00  sec  7.09 MBytes  5.94 Mbits/sec                  receiver

I observe it on iphone12 too. I think it was introduced several builds away. It degraded not in the build that restored 160mhz, but the one that preceeded it.

2 Likes