Mt76 wireless driver debugging

_FailSafe · March 17, 2023, 1:18am

If I'm not mistaken, @timothyjward, the Archer C6 v3 has mt76 based WLAN cards, right? Do you happen to know exactly which mt76 chipsets they are?

Edit Found it here: MediaTek MT7603EN, MediaTek MT7613BEN

timothyjward · March 17, 2023, 6:22am

That’s right. I also know that the MT7613 uses the same kernel modules as the MT7615e, so I’m pretty sure that I’m running code being changed in these commits.

Brain2000 · March 17, 2023, 9:51am

I would set the following to see if it fixes it:

option mesh_fwding '1'
option encryption 'sae'
option key 'SomeEncryptionKey'
option mesh_rssi_threshold '0'
option disassoc_low_ack '0'
option macaddr '99:98:97:96:95:94'

I manually create a mac address because I had issues with some IOT devices connecting unless I set it different than the default. Keep in mind all devices in the mesh need a unique mac address, so don't go and copy the same one across all your routers.

Unless you're running batman or equivalent, you probably want mesh_fwding '1'.

The rssi threshold and disassoc_low_ack also keeps it so it won't disconnect in low signal conditions.

This is the only difference in configuration that I'm using and it has been working great. The only other thing that might be different is I have offload hardware/software disabled.

It also might be easier to try pinging ff02::2 instead of ff02::1 to get just the routers to respond instead of all nodes. My two home routers have been up for over 30 days and both still respond to ff02::2

_FailSafe · March 17, 2023, 10:14am

@Brain2000 No overnight crashes! All three RT3200s are still cranking along just fine!

WED flows finally stopped reporting in /sys/kernel/debug/ppe0/bind on all my APs, so WED is where I’m going to start turning my focus again. But, undoubtedly I’ll be basing my debug upon your current commit.

Brain2000 · March 17, 2023, 2:32pm

I was looking up WED today, it appears it is an Mediatek invention, but I can't find any docs or specs anywhere. I wonder if they sent them over to Felix (the original author) but maybe they are not published online anywhere? I may contact Mediatek to see if i can get a copy.

I looked again at that code that adds one to the wed token counter, and it's basically doing this:

Pick a number between 1 and 4095
If the number if > 4095 then we'll add one to the wed counter token.

So unless I'm misinterpreting something, that line seems unreachable. I need the hardware docs to know which way to go...

_FailSafe · March 17, 2023, 3:55pm

Not sure exactly what I'm looking for, or what I will find, but I'm running a build now where I've added a ~~ridiculous~~ large amount of debug output to get a big picture view of which WED functions are in play and in what order.

Now I just need to wait for the flows to stop and see what happened in the kernel log. More to come...

neheb · March 17, 2023, 4:35pm

He gets documents under NDA to write drivers AFAIK.

timothyjward · March 17, 2023, 5:24pm

What is the goal with this set of changes? I think they it is possible that what you have done already would fix issues on a number of devices, even without WED.

Are you hoping/able to get what you have in a state where it could be submitted as a PR? I’m sure that a number of Archer C3 v3.2 owners would be keen to test it out.

_FailSafe · March 17, 2023, 5:26pm

@Brain2000 Nothing conclusive yet, but some observations thus far...

While I haven't seen the flows stop yet, I am seeing a pattern that is intriguing to me:

root@AP-2:~# dmesg | grep WED
[    7.507859] mt7915e 0000:01:00.0: WED start at IRQ mask
[   43.886992] mt7915e 0000:01:00.0: WED offload enable
[ 1544.207861] mt7915e 0000:01:00.0: WED offload disable
[ 1571.365862] mt7915e 0000:01:00.0: WED offload enable
[ 1603.418618] mt7915e 0000:01:00.0: WED offload disable
[ 1612.465212] mt7915e 0000:01:00.0: WED offload enable
[ 1689.582934] mt7915e 0000:01:00.0: WED offload disable
[ 1695.633972] mt7915e 0000:01:00.0: WED offload enable
[ 1724.676944] mt7915e 0000:01:00.0: WED offload disable
[ 1756.023433] mt7915e 0000:01:00.0: WED offload enable
[ 1861.177012] mt7915e 0000:01:00.0: WED offload disable
[ 1871.222477] mt7915e 0000:01:00.0: WED offload enable
[ 1905.279921] mt7915e 0000:01:00.0: WED offload disable
[ 1937.451876] mt7915e 0000:01:00.0: WED offload enable
[ 1966.498024] mt7915e 0000:01:00.0: WED offload disable
[ 1997.681381] mt7915e 0000:01:00.0: WED offload enable
[ 2052.760359] mt7915e 0000:01:00.0: WED offload disable
[ 2053.811145] mt7915e 0000:01:00.0: WED offload enable
[ 2093.870167] mt7915e 0000:01:00.0: WED offload disable
[ 2113.950274] mt7915e 0000:01:00.0: WED offload enable
[ 2147.999245] mt7915e 0000:01:00.0: WED offload disable
[ 2175.239666] mt7915e 0000:01:00.0: WED offload enable
[ 2208.295078] mt7915e 0000:01:00.0: WED offload disable
[ 2235.459152] mt7915e 0000:01:00.0: WED offload enable
[ 2273.360873] mt7915e 0000:01:00.0: WED offload disable
[ 2300.688621] mt7915e 0000:01:00.0: WED offload enable
[ 2573.076400] mt7915e 0000:01:00.0: WED offload disable
[ 2577.116001] mt7915e 0000:01:00.0: WED offload enable

These are from basic dev_info() calls I am making within mt7915_mmio_wed_offload_enable() and mt7915_mmio_wed_offload_disable() in mmio.c. This behavior may be perfectly normal, but I'm trying to do more digging into it now.

_FailSafe · March 17, 2023, 5:30pm

In full transparency, @Brain2000 is driving the current changes to address this:

github.com/openwrt/openwrt

Intermittent kernel dump (5.15.92) when adding a new STA to a dynamic VLAN (Belkin RT3200)

opened 06:58PM - 12 Feb 23 UTC

Brain2000

bug

### Describe the bug When adding a new wifi station to a dynamic VLAN, a kernel… dump will occur anywhere from once every couple of hours to a couple of days. The more devices that are roaming around the wifi, the more likely it is to crash. Currently the snapshot (OpenWrt SNAPSHOT r22020-1c31ca5da9) is installed but this also happens on the current stable release as well. Here is the log. I hope it is enough to discern some ideas, but if not I may need to learn to compile OpenWRT if we need a full stacktrace: ``` 2023-02-10 18:08:20 Daemon.Notice WIFI-Dev hostapd: wl1-ap0: CTRL-EVENT-EAP-STARTED 7c:b0:c2:81:42:23 2023-02-10 18:08:20 Daemon.Notice WIFI-Dev hostapd: wl1-ap0: CTRL-EVENT-EAP-PROPOSED-METHOD vendor=0 method=1 2023-02-10 18:08:21 Daemon.Info WIFI-Dev hostapd: wl1-ap0: STA 7c:b0:c2:81:42:23 RADIUS: VLAN ID 4099 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.268763] br-vlan15: port 6(wl1-ap0.4110) entered disabled state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.308176] device wl1-ap0.4110 left promiscuous mode 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.308208] br-vlan15: port 6(wl1-ap0.4110) entered disabled state 2023-02-10 18:08:21 Daemon.Error WIFI-Dev hostapd: VLAN: br_delif: Failure determining interface index for 'wl1-ap0.4110' 2023-02-10 18:08:21 Daemon.Notice WIFI-Dev hostapd: wl1-ap0: CTRL-EVENT-EAP-SUCCESS2 7c:b0:c2:81:42:23 2023-02-10 18:08:21 Daemon.Info WIFI-Dev hostapd: wl1-ap0: STA 7c:b0:c2:81:42:23 RADIUS: starting accounting session 4AE36C9075097886 2023-02-10 18:08:21 Daemon.Info WIFI-Dev hostapd: wl1-ap0: STA 7c:b0:c2:81:42:23 IEEE 802.1X: authenticated - EAP type: 25 (PEAP) 2023-02-10 18:08:21 Daemon.Notice WIFI-Dev netifd: bridge 'br-vlan11' link is up 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.436574] br-vlan11: port 1(br-lan.11) entered blocking state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.436599] br-vlan11: port 1(br-lan.11) entered disabled state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.436889] device br-lan.11 entered promiscuous mode 2023-02-10 18:08:21 Daemon.Notice WIFI-Dev netifd: bridge 'br-vlan11' link is down 2023-02-10 18:08:21 Daemon.Notice WIFI-Dev netifd: VLAN 'br-lan.11' link is up 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.439792] br-vlan11: port 1(br-lan.11) entered blocking state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.439820] br-vlan11: port 1(br-lan.11) entered forwarding state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.440987] br-vlan11: port 2(wl1-ap0.4099) entered blocking state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.441012] br-vlan11: port 2(wl1-ap0.4099) entered disabled state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.441246] device wl1-ap0.4099 entered promiscuous mode 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.441354] br-vlan11: port 2(wl1-ap0.4099) entered blocking state 2023-02-10 18:08:21 Kernel.Info WIFI-Dev kernel: [22128.441366] br-vlan11: port 2(wl1-ap0.4099) entered forwarding state 2023-02-10 18:08:21 Daemon.Info WIFI-Dev hostapd: wl1-ap0: STA 7c:b0:c2:81:42:23 WPA: pairwise key handshake completed (RSN) 2023-02-10 18:08:21 Daemon.Notice WIFI-Dev hostapd: wl1-ap0: EAPOL-4WAY-HS-COMPLETED 7c:b0:c2:81:42:23 2023-02-10 18:08:21 Kernel.Warning WIFI-Dev kernel: [22128.527831] Rekeying PTK for STA 7c:b0:c2:81:42:23 but driver can't safely do that. 2023-02-10 18:08:21 Kernel.Warning WIFI-Dev kernel: [22128.642338] wl1-ap0.4110 selects TX queue 0, but real number of TX queues is 0 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.657308] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.666135] Mem abort info: 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.668991] ESR = 0x0000000096000005 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.672749] EC = 0x25: DABT (current EL), IL = 32 bits 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.678095] SET = 0, FnV = 0 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.681170] EA = 0, S1PTW = 0 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.684316] FSC = 0x05: level 1 translation fault 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.689222] Data abort info: 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.692120] ISV = 0, ISS = 0x00000005 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.695958] CM = 0, WnR = 0 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.698955] user pgtable: 4k pages, 39-bit VAs, pgdp=0000000041c53000 2023-02-10 18:08:21 Kernel.Alert WIFI-Dev kernel: [22128.705419] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000 2023-02-10 18:08:21 Kernel.Emerg WIFI-Dev kernel: [22128.714170] Internal error: Oops: 96000005 [#1] SMP ``` Things I've tried to rule out various things: Using a different Belkin RT3200 (can reproduce on three different physical devices) Running without dynamic VLANs, I have a Belkin RT3200 that has 31 days of uptime. Running on only one band (2.4Ghz and 5Ghz both crash) Setting WPA2 or WPA3 Setting every combination of disassoc_low_ack and skip_inactivity_poll Setting different wifi channels Setting different min/max CPU speeds Setting different CPU scheduler Setting per_sta_vif to '1' to change how to stations connect to the vlan bridge Setting vlan bridges promisc to '1' since the crash always happen right after the vlan bridge enters promiscuous mode Setting vlan bridge bridge_empty '1' since the crash always happens right after the vlan bridge is brought back up Here's the config, pretty vanilla, used as an AP with tagged VLANs 8-15: /etc/init.d/firewall disable /etc/init.d/firewall stop /etc/init.d/dnsmasq disable /etc/init.d/dnsmasq stop /etc/init.d/odhcpd disable /etc/init.d/odhcpd stop vi /etc/config/network ``` config interface 'loopback' option proto 'static' option ipaddr '127.0.0.1' option netmask '255.0.0.0' option device 'lo' config device option name 'br-lan' option type 'bridge' list ports 'lan1' list ports 'lan2' list ports 'lan3' list ports 'lan4' config bridge-vlan option device 'br-lan' option vlan '8' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '9' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '10' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '11' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '12' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '13' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '14' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config bridge-vlan option device 'br-lan' option vlan '15' list ports 'lan1:t' list ports 'lan2:t' list ports 'lan3:t' list ports 'lan4:t' config device option name 'br-vlan8' option type 'bridge' list ports 'br-lan.8' config device option name 'br-vlan9' option type 'bridge' list ports 'br-lan.9' config device option name 'br-vlan10' option type 'bridge' list ports 'br-lan.10' config device option name 'br-vlan11' option type 'bridge' list ports 'br-lan.11' config device option name 'br-vlan12' option type 'bridge' list ports 'br-lan.12' config device option name 'br-vlan13' option type 'bridge' list ports 'br-lan.13' config device option name 'br-vlan14' option type 'bridge' list ports 'br-lan.14' config device option name 'br-vlan15' option type 'bridge' list ports 'br-lan.15' config interface 'vlan10' option device 'br-vlan10' option proto 'static' option ipaddr '192.168.X.XXX' <--- this IP address is used to validate users with the radius server option netmask '255.255.255.0' option gateway '192.168.X.1' option ip6assign '60' list dns '192.168.X.X' ``` vi /etc/config/wireless ``` config wifi-device 'radio0' option type 'mac80211' option path 'platform/18000000.wmac' option band '2g' option htmode 'HT20' option channel '1' option txpower '14' option country 'US' option disabled '1' config wifi-device 'radio1' option type 'mac80211' option path '1a143000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0' option channel '36' option txpower '14' option band '5g' option htmode 'HE80' option he_su_beamformee '1' option he_bss_color '8' option country 'US' option cell_density '2' config wifi-iface 'default_radio1' option device 'radio1' option network 'lan' option mode 'ap' option ssid 'WIFITest' option dtim_period '3' option encryption 'wpa2+ccmp' option auth_server '192.168.X.X' option auth_secret 'xxxxxxxxxxxxx' option dynamic_vlan '2' option vlan_tagged_interface 'br-lan' option vlan_bridge 'br-vlan' option vlan_naming '1' option per_sta_vif '1' option ieee80211r '0' <--- this must be 0 so the STA's reauthenticate fully, as we have layer2 protections on the switch option ieee80211w '1' option bss_transition '1' option wnm_sleep_mode '1' option time_advertisement '2' option time_zone 'GMT0' option ieee80211k '1' option rrm_neighbor_report '1' option rrm_beacon_report '1' ``` ### OpenWrt version r22020-1c31ca5da9 ### OpenWrt target/subtarget mediatek/mt7622 ### Device Linksys E8450 (UBI) ### Image kind Official downloaded image ### Steps to reproduce This is difficult as the issue is intermittent and only occurs every few hours to few days, depending on how many devices are present around it. ### Actual behaviour When a high number of devices are connecting/disconnecting in a VLAN environment, a kernel dump occurs and the unit stop functioning. ### Expected behaviour The unit to not kernel dump ### Additional info I put a list of things I have done to troubleshoot this in the description ### Diffconfig _No response_ ### Terms - [X] I am reporting an issue for OpenWrt, not an unsupported fork.

I was really just offering my RT3200s up to help do additional testing with him. I'll let him weigh in on where he feels his work sits currently, but my further investigation into WED is by no means meant to hold up any PR he wishes to move forward for the above issue I linked.

That said, I'm keenly interested in making WED work more consistently because I see the positive impact it has on CPU load during high bandwidth TX (from AP perspective).

Brain2000 · March 17, 2023, 6:43pm

I have a PR ready to submit, I'm submitting an issue on mt76 repo and offering up the PR.
The crash fix was the main fix. Having a working WED will also be nice.

Brain2000 · March 17, 2023, 7:02pm

Here -> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mediatek/mtk_wed.c#L1675
WED is enabled when the first flow starts, and it is disabled when the last flow stops.
Given that, I think that was the original purpose of the tokens that we are puzzled over. It looks like the flows are handled internally in mtk_wed.c.

I think I see the issue. My prediction -> two threads will come along at the same time, one adding, one removing, as the num_flows is at 1. Because they aren't using atomic reads/writes, they will end up crossing each other's path and it will end up calling mtk_flow_remove and leave num_flows at 1 forever more. Thus it will never call mtk_flow_add again.

If that is the case, simply moving the mutex acquire up a line will fix it. We'll have to fix it in the mediatek ethernet driver (I think that's what mtk_wed.c is part of?).

_FailSafe · March 17, 2023, 7:24pm

Man you work fast! I appreciate that you found that link--I'm going to give it a deep look here in a moment.

I think your prediction sounds pretty reasonable, but this seems odd to me in light of this:

"AP-1" when I noticed flows stop offloading:

root@AP-1:~# dmesg | grep WED
[    7.583905] mt7915e 0000:01:00.0: WED start at IRQ mask
[   33.557084] mt7915e 0000:01:00.0: WED offload enable
[12597.160106] mt7915e 0000:01:00.0: WED offload disable
...
[12671.469181] mt7622-wmac 18000000.wmac: done sta_remove=0x00000000835e7868

It has stopped indicating offloaded flows, though they were visible for the first ~3.5 hours of operation. I literally only had the single pair of offload enable/disable messages on that AP. My other two APs have dozens of pairs of enable/disable messages and as far as I can tell, they're still offloading even now.

Brain2000 · March 17, 2023, 8:09pm

So that "WED offload disable" was the last WED message?

Brain2000 · March 17, 2023, 8:28pm

Let's give it a quick try, in the OpenWRT build server:

cd openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/linux-5.15.98/drivers/net/ethernet/mediatek

gedit mtk_wed.c, line 1712

	if (hw->num_flows) {
		hw->num_flows++;
		return 0;
	}

	mutex_lock(&hw_lock);   <----- move this above the if (hw->numflows) statement

so it looks like this:

	mutex_lock(&hw_lock);

	if (hw->num_flows) {
		hw->num_flows++;
		return 0;
	}

then on line 1741:

	if (--hw->num_flows)
		return;

	mutex_lock(&hw_lock);  <--- same thing, move this up

so it looks like this:

	mutex_lock(&hw_lock);

	if (--hw->num_flows)
		return;

cd back to the openwrt folder and run:
make world

Don't try and run any of the clean/downloads etc... or it will overwrite your temporary change.

_FailSafe · March 17, 2023, 8:49pm

I'm going to give this a shot, but shouldn't we go out: to unlock the mutex instead of return 0?

e.g.

	mutex_lock(&hw_lock);
	
	if (hw->num_flows) {
		hw->num_flows++;
		ret = 0;
		goto out;
	}
	...
out:
	mutex_unlock(&hw_lock);

	return ret;

Brain2000 · March 17, 2023, 9:47pm

Oh yes! How'd I miss that. Good eye.

umayer · March 18, 2023, 12:02pm

@Brain2000 @_FailSafe I'm watching your debug session in awe
I very much look forward to running snapshot with these fixes.
Thanks a bunch for spending the time and getting to the bottom of it.

PolynomialDivision · March 18, 2023, 2:19pm

I applied everything sequential and see if mesh is crashing starting from encryption until enabling mesh_fwding. And now after I enabled meh_fwding the mesh is not crashing anymore since more than an hour. I will wait a bit to be sure but now looks promising. However, actually I don't want to have mesh_fwding enabled because I use olsr and babeld for routing. But at least I have now some clue what's breaking.

Brain2000 · March 18, 2023, 7:59pm

I found a different crash this morning. This time when the mt76 device is unregistering, which luckily does not happen very often:

<7>[79555.951003] Call trace:
<7>[79555.953443]  eth_type_trans+0x44/0x19c
<7>[79555.957189]  ieee80211_rx_list+0x1cc/0xbcc [mac80211]
<7>[79555.962273]  mt76_rx_complete+0x20c/0x40c [mt76]
<7>[79555.966896]  mt76_rx_poll_complete+0x2c8/0x4f0 [mt76]
<7>[79555.971949]  mt76_dma_rx_poll+0x2a4/0x4f0 [mt76]
<7>[79555.976568]  mt7615_unregister_device+0x404/0x560 [mt7615e]
<7>[79555.982140]  __napi_poll+0x54/0x1b0
<7>[79555.985627]  napi_threaded_poll+0x84/0xe4
<7>[79555.989633]  kthread+0x11c/0x130
<7>[79555.992858]  ret_from_fork+0x10/0x20
<0>[79555.996434] Code: 91003880 f9006440 f9419020 f9400083 (f9400000)
<4>[79556.002522] ---[ end trace 2d1d109d5542bc4f ]---

I remember seeing this crash last month and I have some ideas what it might be, but I need to add some logging around it to find a way to reproduce it on demand.