Users needed to test Wi-Fi stability on Linksys WRT3200ACM & WRT32X on OpenWrt 21.02

I've looked at the change. The commit adds a static function ieee80211_send_addba_with_timeout() from code that was already part of ieee80211_tx_ba_session_handle_start().
There are two changes:
tid was given as a paramter of ieee80211_tx_ba_session_handle_start(), and after the change, it takes the value from tid_tx->tid. This one appears to be harmless, at first sight.

The other change looks awfully suspicious:
The 4th parameter of the call to ieee80211_send_addba_request changed from params.ssn to sta->tid_seq[tid] >> 4. This is where I would bet the problem is. I'll build a package and will post it soon.

Edit: It might be that the call to drv_ampdu_action() changes the value of params.ssn. In fact, this call in [mwlwifi]/mac80211.c passes a pointer to it (indication that it may change its value):
rc = mwl_fwcmd_get_seqno(hw, stream, &params->ssn);

I'll build a module with both parameters included in ieee80211_send_addba_with_timeout().

2 Likes

We can modify timer in milliseconds in the driver code that has to do with BA traffic check:

There is also a setting for MAX TID on same page but this is all above my understanding.

@WildByDesign or anybody, please try these modules:
https://drive.google.com/drive/folders/1oH_alLp7qQ4dcFn0EhM9DoYypNXd-iSa?usp=sharing
:crossed_fingers:

Edit: I've changed the packages at 17:10 GMT to print a message if the affected variables don't match.

1 Like

I will definitely test it but I won’t be able to test until around 9:30pm EST when the rest of the family is off the network.

If you don’t mind, could you please explain the changes that are included (or excluded) from these modules?

These are the changes I'm making:

--- a/net/mac80211/agg-tx.c
+++ b/net/mac80211/agg-tx.c
@@ -448,12 +448,12 @@ static void sta_addba_resp_timer_expired(struct timer_list *t)
        ieee80211_stop_tx_ba_session(&sta->sta, tid);
 }

-static void ieee80211_send_addba_with_timeout(struct sta_info *sta,
-                                             struct tid_ampdu_tx *tid_tx)
+static void ieee80211_send_addba_with_timeout(struct sta_info *sta, int tid,
+                                             struct tid_ampdu_tx *tid_tx,
+                                             u16 ssn)
 {
        struct ieee80211_sub_if_data *sdata = sta->sdata;
        struct ieee80211_local *local = sta->local;
-       u8 tid = tid_tx->tid;
        u16 buf_size;

        /* activate the timer for the recipient's addBA response */
@@ -478,10 +478,15 @@ static void ieee80211_send_addba_with_timeout(struct sta_info *sta,
                buf_size = IEEE80211_MAX_AMPDU_BUF_HT;
        }

+       if (tid != tid_tx->tid)
+               pr_warn_ratelimited("ieee80211_send_addba_with_timeout: mismatch tid=%dr, tid_tx->tid=%d\n",
+                                   tid, tid_tx->tid);
+       if (ssn != sta->tid_seq[tid])
+               pr_warn_ratelimited("ieee80211_send_addba_with_timeout: mismatch ssn=%dr, sta->tid_seq[tid] >> 4=%d\n",
+                                   ssn, sta->tid_seq[tid] >> 4);
        /* send AddBA request */
        ieee80211_send_addba_request(sdata, sta->sta.addr, tid,
-                                    tid_tx->dialog_token,
-                                    sta->tid_seq[tid] >> 4,
+                                    tid_tx->dialog_token, ssn,
                                     buf_size, tid_tx->timeout);
 }

@@ -544,7 +549,7 @@ void ieee80211_tx_ba_session_handle_start(struct sta_info *sta, int tid)
                return;
        }

-       ieee80211_send_addba_with_timeout(sta, tid_tx);
+       ieee80211_send_addba_with_timeout(sta, tid, tid_tx, params.ssn);
 }

 /*

There are two changes in the offending commit that caught my attention: two parameters to the ieee80211_send_addba_request() call:
the ssn (session number?) was changed from params.ssn to sta->tid_seq[tid] >> 4. While the initial value of params.ssn was indeed sta->tid_seq[tid] >> 4, the very next line makes a call to
ret = drv_ampdu_action(local, sdata, &params) that may have changed params.ssn. I haven't studied the way this works at all, but found a candidate for change in mac80211.c in mwlwifi that appears to do it.

The tid change is less likely to have caused it, but the changes assumes that the tid parameter passed to ieee80211_tx_ba_session_handle_start() has the same value of tid_tx->tid. tid_tx is taken from rcu_dereference_protected_tid_tx(sta, tid). I have almost zero knowledge of these parts of the kernel, so I may be shooting at the wrong target.

Anyway, I've added code to log a warning if the values of the parameters don't match. If the warning show up, then we know where the problem with this change is.

3 Likes

Thank you for the details and for everything that you have done so far.

Is this a warning that I should be looking out for during opkg install, or a warning that would potentially show up in the syslog or kernel log?

I can look out for the warning and share it here later on tonight when I get a chance to test it.

It's a klog warning. I've just realized I made a mistake in the comparison code. I'm missing a >> 4. I'll try again.

OK. I've rebuilt the modules in my drive.
Now I'm building them for master r17328, so I can test this myself :slight_smile: .

1 Like

It looks like we've found our bug!!!

ieee80211_send_addba_with_timeout: mismatch ssn=23, sta->tid_seq[tid] >> 4=66

:partying_face: :fireworks:

Even though I don't yet know to properly fix it :thinking:, I've already found a workaround for it!
Let's just hope it is enough to stop the dropouts.

8 Likes

Nice.
The interesting part is why mwlwifi reacts badly for that change, while other wifi drivers survived it just fine.

I don't know who is at fault here. If it is mac80211 for counting for the ssn to not change, or mwlwifi for changing it. If the code responsible for the change is in actual mwlwifi C code, at least we can look at it.

2 Likes

Have you tested building the latest mac80211 with just that patch reverted (as practically as possible anyway)?

That would be a much more maintainable patch burden. Even better if the patch can live on the mwlwifi side.

Good hunting.

2 Likes

I have tested it with backports-5.10.42-1, which was what master had in August, and that's what the routers I have here are running in production. So far so good. No threats, at least.

I don't see why this would not work with the latest version. The part that I don't know how to fix, is not about just getting it to work, but knowing who is at fault here--mwlwifi or linux kernel. From my pov, the fault may be with mwlwifi, since there have been no complaints about this from any other source--but for now it is easier for me to fix--or work around--it in the kernel, so that's what I have done.

I just looked at where and how the behavior changed, and made changes so that it worked like it did before. If you want to blame the kernel, then you can argue that the offending commit did not intend to change the behavior, but did; therefore it is a kernel bug.

This part of the code has not been changed from 5.8 till now. I've looked at linux master, and it is still the same way.

Edit
I'll explain more about why I speculate this is a mwlwifi bug, besides nothing else getting broken. The data that mwlwifi changed was local to the ieee80211_tx_ba_session_handle_start() function, without static. It doesn't appear to be wise for the driver to change that data and count on it later. However, I don't know anything about that interface, and its documentation. I'll look it up next.

5 Likes

I have built a 21.02.1 mac80211 package here:
https://drive.google.com/drive/folders/1bF_N41aQ3w58dRLn9sE28954McDh7pC9?usp=sharing

This should be all that's needed to make it work with the official release, although we still need more testing to be sure.

3 Likes

The good thing is that this work-around should also benefit Divested-WRT users since it follows master branch.

2 Likes

Would be interesting to see if there is a difference in the mwlwifi driver for the WRT1900ACS or WRT1200 since those are not affected and probably share most of the C code.

1 Like

@cotequeiroz Would you mind posting your official patch for this?

I see you posted a patch in Users needed to test Wi-Fi stability on Linksys WRT3200ACM & WRT32X on OpenWrt 21.02 - #569 by cotequeiroz, but then later mentioned missing >> 4, and it doesn't appear your patch was updated after that post.

You'll find the patch in the same drive folder as I saved the package. The patch is meant to be saved under package/kernel/mac80211/patches/subsys/.

The patch does not print the warning, like my first one did.

1 Like

It may be worth posting a summary of findings to the OpenWrt devel mailing list to get some opinions from people who are closer to the wifi part of the system and may have a comment (thinking nbd).

2 Likes

Fabulous thank you! I’ll try it out in my personal builds in the next few days.

1 Like