tl; dr; mt7996 currently rejects some hardware configurations which exist in the wild and I don't know how to fix it
Part of the problem is described in this proposed patch from @aiamadeus :
With discussions in https://github.com/openwrt/openwrt/pull/20430
Although it fixes the issue for that particular device, and for another one I've got, I believe the fix is incomplete. The main reason is that it breaks another device I have.
If I knew how the hardware is supposed to work, and had documentation like RF frontend, aka ADIE, datasheets to back it up, then I would have discussed this in the more appropriate patch form. But since I don't, I'll just try to explain my assumptions in the hope that someone with the knowledge will correct it and fill in missing pieces.
The mt7996 code classifies RF frontends into "iFEM" and "eFEM", which IIUC refers to internal or external LNAs and PAs. Not sure about this definition at all... Actually not sure they have to be related either. There seems to be 2 bits deciding this for each frontend, where we only consider 0 and 3 as valid. Are those bits really LNA and PA enable bits? And could they be different?
The ADIE classification happens in mt7996_variant_fem_init() in https://github.com/nbd168/wireless/blob/mt76/drivers/net/wireless/mediatek/mt76/mt7996/init.c#L1145
Closely related to this is also the number of bands and streams in detected by mt7996_variant_type_init() in the same file. Although that is mostly based on the wifi chip variant (mt7996, mt7992 or mt7990), both functions use bits from the MT_PAD_GPIO register for their decisions.
Then later on, in mt7996_eeprom_variant_valid() in https://github.com/nbd168/wireless/blob/mt76/drivers/net/wireless/mediatek/mt76/mt7996/eeprom.c#L96 these classifcations are used to validate the eeprom data (from nvmem) and to select the correct default eeprom file for cases where no data is provided or the validation fails.
Such validation can obviously only make sense if the classification is 100% correct. If it is wrong, then using it to enforce eeprom settings can actually be harmful. It makes the driver reject valid eeprom data and replace it with mismatched defaults.
This is what was happened for the W1701k discussed in PR20430. It was classified as "iFEM" while actually having a mix of iFEM and eFEM. It had correct eeprom data, but that was rejected due to the classifiaction bug. The driver used default iFEM eeprom instead, causing the LNA and PA to be disabled for 5 and 6GHz. With terrible performance as a result. The proposed patch fixed this by detecting this specific combination as mixed iFEM+eFEM.
I believe it still leaves many questions unanswered:
- even with the mixed alternative supported, we only support 3 combinations: iFEM+iFEM, iFEM+eFEM and eFEM+eFEM. Why is eFEM+iFEM impossible? Maybe it is for some hardware reason I don't understand?
- and shouldn't this have been iFEM+iFEM+IFEM, iFEM+eFEM+eFEM etc? There are 3 bands here, making it possible to imagine 9 combinations. Why is the driver limiting this to 2 (or 3 with the patch)? The existence of the 3rd variant in the wild proves that the driver is wrong. The question is how wrong...
- the actual classification of the ADIE chips is based on magic id and version numbers read from the MT_ADIE_CHIP_ID registers. It is very convoluted and hard to follow. And I have reason to believe it is incomplete, not detecting the iFEM vs eFEM difference between 0x7977 ADIEs with different versions. Or maybe I am guessing wrong - I have no datasheets, only the routers
- The "adie_comb" value from the MT_PAD_GPIO_ADIE_COMB is used in the decision process. But only as "1" vs "all other values". And it is a 2bit field, so there are 3 other values. Do they all mean the same? What is the purpose of these bits? "COMB" = "combined"? or "combination"?
And that's only the problems with the detection logic. The eeprom handling is also lacking. There is no default file for any mixed iFEM/eFEM mt7996 configuration. And the parsing logic trying to validate the eeprom data seems to still confuse phy and band numbers.
What can we do about this? The driver will currently break existing hardware configurations with no possible workaround. It needs fixing. But from my point of view, the only alternative I can see is to remove/relax all these tests and simply trust the eeprom data we feed it. To do something better we would need to know about all the possible hardware combinations and have default eeprom files for all of them.