Mt76 wireless driver debugging

If I'm not mistaken, @timothyjward, the Archer C6 v3 has mt76 based WLAN cards, right? Do you happen to know exactly which mt76 chipsets they are?

Edit Found it here: MediaTek MT7603EN, MediaTek MT7613BEN

1 Like

That’s right. I also know that the MT7613 uses the same kernel modules as the MT7615e, so I’m pretty sure that I’m running code being changed in these commits.

I would set the following to see if it fixes it:

option mesh_fwding '1'
option encryption 'sae'
option key 'SomeEncryptionKey'
option mesh_rssi_threshold '0'
option disassoc_low_ack '0'
option macaddr '99:98:97:96:95:94'

I manually create a mac address because I had issues with some IOT devices connecting unless I set it different than the default. Keep in mind all devices in the mesh need a unique mac address, so don't go and copy the same one across all your routers.

Unless you're running batman or equivalent, you probably want mesh_fwding '1'.

The rssi threshold and disassoc_low_ack also keeps it so it won't disconnect in low signal conditions.

This is the only difference in configuration that I'm using and it has been working great. The only other thing that might be different is I have offload hardware/software disabled.

It also might be easier to try pinging ff02::2 instead of ff02::1 to get just the routers to respond instead of all nodes. My two home routers have been up for over 30 days and both still respond to ff02::2

@Brain2000 No overnight crashes! All three RT3200s are still cranking along just fine!

WED flows finally stopped reporting in /sys/kernel/debug/ppe0/bind on all my APs, so WED is where I’m going to start turning my focus again. But, undoubtedly I’ll be basing my debug upon your current commit.

1 Like

I was looking up WED today, it appears it is an Mediatek invention, but I can't find any docs or specs anywhere. I wonder if they sent them over to Felix (the original author) but maybe they are not published online anywhere? I may contact Mediatek to see if i can get a copy.

I looked again at that code that adds one to the wed token counter, and it's basically doing this:

Pick a number between 1 and 4095
If the number if > 4095 then we'll add one to the wed counter token.

So unless I'm misinterpreting something, that line seems unreachable. I need the hardware docs to know which way to go...

Not sure exactly what I'm looking for, or what I will find, but I'm running a build now where I've added a ridiculous large amount of debug output to get a big picture view of which WED functions are in play and in what order.

Now I just need to wait for the flows to stop and see what happened in the kernel log. More to come...

He gets documents under NDA to write drivers AFAIK.

What is the goal with this set of changes? I think they it is possible that what you have done already would fix issues on a number of devices, even without WED.

Are you hoping/able to get what you have in a state where it could be submitted as a PR? I’m sure that a number of Archer C3 v3.2 owners would be keen to test it out.

@Brain2000 Nothing conclusive yet, but some observations thus far...

While I haven't seen the flows stop yet, I am seeing a pattern that is intriguing to me:

root@AP-2:~# dmesg | grep WED
[    7.507859] mt7915e 0000:01:00.0: WED start at IRQ mask
[   43.886992] mt7915e 0000:01:00.0: WED offload enable
[ 1544.207861] mt7915e 0000:01:00.0: WED offload disable
[ 1571.365862] mt7915e 0000:01:00.0: WED offload enable
[ 1603.418618] mt7915e 0000:01:00.0: WED offload disable
[ 1612.465212] mt7915e 0000:01:00.0: WED offload enable
[ 1689.582934] mt7915e 0000:01:00.0: WED offload disable
[ 1695.633972] mt7915e 0000:01:00.0: WED offload enable
[ 1724.676944] mt7915e 0000:01:00.0: WED offload disable
[ 1756.023433] mt7915e 0000:01:00.0: WED offload enable
[ 1861.177012] mt7915e 0000:01:00.0: WED offload disable
[ 1871.222477] mt7915e 0000:01:00.0: WED offload enable
[ 1905.279921] mt7915e 0000:01:00.0: WED offload disable
[ 1937.451876] mt7915e 0000:01:00.0: WED offload enable
[ 1966.498024] mt7915e 0000:01:00.0: WED offload disable
[ 1997.681381] mt7915e 0000:01:00.0: WED offload enable
[ 2052.760359] mt7915e 0000:01:00.0: WED offload disable
[ 2053.811145] mt7915e 0000:01:00.0: WED offload enable
[ 2093.870167] mt7915e 0000:01:00.0: WED offload disable
[ 2113.950274] mt7915e 0000:01:00.0: WED offload enable
[ 2147.999245] mt7915e 0000:01:00.0: WED offload disable
[ 2175.239666] mt7915e 0000:01:00.0: WED offload enable
[ 2208.295078] mt7915e 0000:01:00.0: WED offload disable
[ 2235.459152] mt7915e 0000:01:00.0: WED offload enable
[ 2273.360873] mt7915e 0000:01:00.0: WED offload disable
[ 2300.688621] mt7915e 0000:01:00.0: WED offload enable
[ 2573.076400] mt7915e 0000:01:00.0: WED offload disable
[ 2577.116001] mt7915e 0000:01:00.0: WED offload enable

These are from basic dev_info() calls I am making within mt7915_mmio_wed_offload_enable() and mt7915_mmio_wed_offload_disable() in mmio.c. This behavior may be perfectly normal, but I'm trying to do more digging into it now.

In full transparency, @Brain2000 is driving the current changes to address this:

I was really just offering my RT3200s up to help do additional testing with him. I'll let him weigh in on where he feels his work sits currently, but my further investigation into WED is by no means meant to hold up any PR he wishes to move forward for the above issue I linked.

That said, I'm keenly interested in making WED work more consistently because I see the positive impact it has on CPU load during high bandwidth TX (from AP perspective).

1 Like

I have a PR ready to submit, I'm submitting an issue on mt76 repo and offering up the PR.
The crash fix was the main fix. Having a working WED will also be nice.

2 Likes

Here -> https://elixir.bootlin.com/linux/latest/source/drivers/net/ethernet/mediatek/mtk_wed.c#L1675
WED is enabled when the first flow starts, and it is disabled when the last flow stops.
Given that, I think that was the original purpose of the tokens that we are puzzled over. It looks like the flows are handled internally in mtk_wed.c.

I think I see the issue. My prediction -> two threads will come along at the same time, one adding, one removing, as the num_flows is at 1. Because they aren't using atomic reads/writes, they will end up crossing each other's path and it will end up calling mtk_flow_remove and leave num_flows at 1 forever more. Thus it will never call mtk_flow_add again.

If that is the case, simply moving the mutex acquire up a line will fix it. We'll have to fix it in the mediatek ethernet driver (I think that's what mtk_wed.c is part of?).

1 Like

Man you work fast! I appreciate that you found that link--I'm going to give it a deep look here in a moment.

I think your prediction sounds pretty reasonable, but this seems odd to me in light of this:

"AP-1" when I noticed flows stop offloading:

root@AP-1:~# dmesg | grep WED
[    7.583905] mt7915e 0000:01:00.0: WED start at IRQ mask
[   33.557084] mt7915e 0000:01:00.0: WED offload enable
[12597.160106] mt7915e 0000:01:00.0: WED offload disable
...
[12671.469181] mt7622-wmac 18000000.wmac: done sta_remove=0x00000000835e7868

It has stopped indicating offloaded flows, though they were visible for the first ~3.5 hours of operation. I literally only had the single pair of offload enable/disable messages on that AP. My other two APs have dozens of pairs of enable/disable messages and as far as I can tell, they're still offloading even now. :face_with_monocle:

1 Like

So that "WED offload disable" was the last WED message?

1 Like

Let's give it a quick try, in the OpenWRT build server:

cd openwrt/build_dir/target-aarch64_cortex-a53_musl/linux-mediatek_mt7622/linux-5.15.98/drivers/net/ethernet/mediatek

gedit mtk_wed.c, line 1712

	if (hw->num_flows) {
		hw->num_flows++;
		return 0;
	}

	mutex_lock(&hw_lock);   <----- move this above the if (hw->numflows) statement

so it looks like this:

	mutex_lock(&hw_lock);

	if (hw->num_flows) {
		hw->num_flows++;
		return 0;
	}

then on line 1741:

	if (--hw->num_flows)
		return;

	mutex_lock(&hw_lock);  <--- same thing, move this up

so it looks like this:

	mutex_lock(&hw_lock);

	if (--hw->num_flows)
		return;

cd back to the openwrt folder and run:
make world

Don't try and run any of the clean/downloads etc... or it will overwrite your temporary change.

I'm going to give this a shot, but shouldn't we go out: to unlock the mutex instead of return 0?

e.g.

	mutex_lock(&hw_lock);
	
	if (hw->num_flows) {
		hw->num_flows++;
		ret = 0;
		goto out;
	}
	...
out:
	mutex_unlock(&hw_lock);

	return ret;
1 Like

Oh yes! How'd I miss that. Good eye.

1 Like

@Brain2000 @_FailSafe I'm watching your debug session in awe :slight_smile:
I very much look forward to running snapshot with these fixes.
Thanks a bunch for spending the time and getting to the bottom of it.

5 Likes

I applied everything sequential and see if mesh is crashing starting from encryption until enabling mesh_fwding. And now after I enabled meh_fwding the mesh is not crashing anymore since more than an hour. :open_mouth: I will wait a bit to be sure but now looks promising. However, actually I don't want to have mesh_fwding enabled because I use olsr and babeld for routing. But at least I have now some clue what's breaking.

1 Like

I found a different crash this morning. This time when the mt76 device is unregistering, which luckily does not happen very often:

<7>[79555.951003] Call trace:
<7>[79555.953443]  eth_type_trans+0x44/0x19c
<7>[79555.957189]  ieee80211_rx_list+0x1cc/0xbcc [mac80211]
<7>[79555.962273]  mt76_rx_complete+0x20c/0x40c [mt76]
<7>[79555.966896]  mt76_rx_poll_complete+0x2c8/0x4f0 [mt76]
<7>[79555.971949]  mt76_dma_rx_poll+0x2a4/0x4f0 [mt76]
<7>[79555.976568]  mt7615_unregister_device+0x404/0x560 [mt7615e]
<7>[79555.982140]  __napi_poll+0x54/0x1b0
<7>[79555.985627]  napi_threaded_poll+0x84/0xe4
<7>[79555.989633]  kthread+0x11c/0x130
<7>[79555.992858]  ret_from_fork+0x10/0x20
<0>[79555.996434] Code: 91003880 f9006440 f9419020 f9400083 (f9400000)
<4>[79556.002522] ---[ end trace 2d1d109d5542bc4f ]---

I remember seeing this crash last month and I have some ideas what it might be, but I need to add some logging around it to find a way to reproduce it on demand.