Mt7981 wifi firmware crash caused by TX with small unicast packets (482 bytes or less)

I have an issue that was already mentioned here on the forum, but did some further testing and think that it is not device-specific but chip specific. I already made an issue on github, but there is no activity there, so I want to try here.

I'm experiencing crashes with multi-psk on MT7981.

  • when connecting to the network using the main WPA2 PSK, everything is stable.
  • when connecting to the network using a secondary WPA2 PSK, if station doesn't move to another ap_vlan everything is stable.
  • when connecting to the network using any of the secondary PSKs, if a station moves to a secondary ap_vlan, everything works until a station starts sending traffic to the AP
  • then the chip hangs, and a combination of 00005aed and 000026ed timeouts happens

When using OpenWrt snapshot without any patches to the mt76 driver, the chip completely restarts on it's own and the wifi network appears in a couple of seconds. All clients including ones connected via the main PSK get disconnected.

Then I tried rany2/openwrt@18cc739 patch and 0x5a messages stop appearing but the chip still hangs, the driver shows 0x26 timeout and restarts.

I then tried to compile the rany2/openwrt fork and since it applies a bunch of patches, when the chip hangs, it manages to recover without disconnecting clients, but shows the following:

[  447.275349] mt798x-wmac 18000000.wifi: send message 000130ed timeout, try again(1).
[  447.283349] mt798x-wmac 18000000.wifi: 
[  447.283349] phy0 L1 SER recovery completed.
[  447.821897] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000004
[  447.828811] mt798x-wmac 18000000.wifi: 
[  447.828811] phy0 L1 SER recovery start.
[  447.837695] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000008
[  447.854270] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000010
[  447.861219] mt798x-wmac 18000000.wifi: phy0 SER recovery state: 0x00000020
[  447.868360] mt798x-wmac 18000000.wifi: 
[  447.868360] phy0 L1 SER recovery completed.

The same setup works on MT7613, MT7612, MT7615, MT7603, MT7921k (n, ac, ax), and appears to not crash on MT7975 (Asus RT-AX53U) even though it uses the same mt7915e module.

So I'm assuming that this is a firmware bug, so I tried all five firmware versions published on mtk-feeds, and it's similar with all, but the crashes don't happen as often with the latest firmware.

Additionally, I assumed that this is related to GTK (as it is different ) so I applied a patch similar to this mt7615 workaround from a few years back, but it didn't do anything.

If possible, can someone explain to me what's the difference between stations connected to the main AP interface vs ones connected to AP_VLAN interface? The GTK is different, but why would it cause it to crash the firmware?

1 Like

Even if there is common driver, it does not mean this is firmware bug. Chips are quite different, they have different firmware and also may require special handling for some corner cases.

Please test this workaround mt79xx: attempt fixing message timeout errors · blocktrron/mt76@7447213 (github.com)

I'll test this today and let you know.

Thank you for suggesting this, but it looks like it's not enough. I applied the patch to rany2/openwrt fork, and it didn't help - it took a bit more to trigger the crash (1 minute instead of 10-20sec of constant upload) . I'll try now applying it to openwrt main.

Unfortunately it doesn't help with the main branch of openwrt too.

Could you follow the instructions in MP2.3 release.md and try it again to identify if this is an upstream or vendor driver issue?

https://git01.mediatek.com/plugins/gitiles/openwrt/feeds/mtk-openwrt-feeds/+/refs/heads/master/autobuild_mac80211_release/Release.md

I'm trying to build this MP2.3, but it's failing compiling the kernel (runing on Ubuntu 22.04):

make[5]: Entering directory '/mnt/mtk-openwrt/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7981/linux-5.4.271'
scripts/Makefile.build:42: /scripts/basic/Makefile: No such file or directory
make[5]: *** No rule to make target '/scripts/basic/Makefile'.  Stop.

I even tried hardcoding the kbuild-dir in build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7981/linux-5.4.271/scripts/Makefile.build:

kbuild-dir := /mnt/mtk-openwrt/openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7981/linux-5.4.271

But it's not working, it fails with:

make[5]: *** No rule to make target 'scripts/basic/include/generated/bounds.h', needed by '__build'.  Stop.

Is the mediatek build only compatible with Ubutnu 18.04?

I managed to build an image on Ubuntu 18.04 vm, but it doesn't boot (reboots after 1-2 seconds) on one of my devices using MT7981: Cudy WR3000 v1

I had to backport the DTS and make it compile with Mediatek's mt7981.dtsi but apparently I made a mistake somewhere, and I currently don't have a TTL serial adapter to debug.

I have another device that has exactly the same issue: Unifi U6 Plus, but that's a more expensive device and I don't want to brick it.

Back to my original question:
can someone explain to me what's the difference between stations connected to the main AP interface vs ones connected to AP_VLAN interface? The GTK is different, but why would it cause it to crash the firmware?

I have done further digging into this, and found the following:

mt76 (as it is a softmac driver) uses the mac80211 framework. In it, there is this snippet in key.c:

	if (sdata->vif.type == NL80211_IFTYPE_AP_VLAN) {
		/*
		 * The driver doesn't know anything about VLAN interfaces.
		 * Hence, don't send GTKs for VLAN interfaces to the driver.
		 */
		if (!(key->conf.flags & IEEE80211_KEY_FLAG_PAIRWISE)) {
			ret = 1;
			goto out_unsupported;
		}
	}

which means that vlan-specific GTK is never sent to the driver and never gets KEY_FLAG_UPLOADED_TO_HARDWARE set. This in turn causes all multicast/broadcast packets for VLAN stations to be software-encrypted and sent to the hardware.

So it looks like the firmware crashes when sending software-encrypted packets and the receive buffer is full.

Update:

I have narrowed this (mt76 github issue #881 and #866) to a very specific set of conditions that cause the chip firmware to crash:

The issue is completely unrelated to rx packets. Instead, it has everything to do with small unicast packets sent to stations in ap_vlan or multicast_to_unicast converted packets.

Packets with IP packet length of 482 or less sent to ap_vlan or multicast_to_unicast crash the firmware frequently.

Can someone help me understanding why is this packet length magic - what's different with smaller packets: block ack, something to do with aggregation or something else?

I have found a workaround that can't ever be sent upstream.

Changing ap_vlan interface ndo_start_xmit from ieee80211_subif_start_xmit to
ieee80211_subif_start_xmit_8023 fixes this firmware crash.

The easiest way to work around this is described in my comment on github.

3 Likes