Why the switch to unstable ath10k-ct?

I've tested this commit before ct-driver it crashes 3 days later and reverted back to original driver never had crash issues since then. I'm using Linksys EA8500 from my experience ct-drive has better connectivity. Honestly I don't know if this related by I suspect that ct-driver damage my 80211-bgn hw I had to disable PCI0 from dts file in order for me to use 802.11ac otherwise my router will hang and reboot endlessly.

This is what I get before it hangs and reboot, this is from serial logs.
Note: I'm using ct-driver for 3 weeks then it went like that. Then tried original driver and they got the same error message wmi unified ready event not received then freezes.

[   14.113365] ath10k_pci 0000:01:00.0: enabling device (0140 -> 0142)
[   14.114037] ath10k_pci 0000:01:00.0: pci irq msi oper_irq_mode 2 irq_mode 0 reset_mode 0
[   14.283484] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/pre-cal-pci-0000:01:00.0.bin failed with error -2
[   14.283520] ath10k_pci 0000:01:00.0: Falling back to user helper
[   14.789420] ath10k_pci 0000:01:00.0: Direct firmware load for ath10k/QCA99X0/hw2.0/firmware-6.bin failed with error -2
[   14.789456] ath10k_pci 0000:01:00.0: Falling back to user helper
[   14.837979] firmware ath10k!QCA99X0!hw2.0!firmware-6.bin: firmware_loading_store: map pages failed
[   15.063189] ath10k_pci 0000:01:00.0: qca99x0 hw2.0 target 0x01000000 chip_id 0x003b01ff sub 168c:0002
[   15.063248] ath10k_pci 0000:01:00.0: kconfig debug 0 debugfs 0 tracing 0 dfs 0 testmode 1
[   15.075895] ath10k_pci 0000:01:00.0: firmware ver 10.4.1.00030-1 api 5 features no-p2p crc32 d2901e01
[   15.136299] ath10k_pci 0000:01:00.0: board_file api 2 bmi_id 1:2 crc32 08fa09f2
[   21.402881] ath10k_pci 0000:01:00.0: wmi unified ready event not received

Try the current master as the CT firmware in it is fixed and does not crash anymore

It is very unlikely that the ath10k-ct firmware somehow damaged non-related hardware. And,
gjallerigudis comment above shows logs from stock firmware, not ath10k-ct firmware. When WMI event is not returned like that, it means the firmware and/or NIC hardware failed very early. With no crash to debug, it is hard to make progress, but if you reproduce these problems with ath10k-ct firmware, please open me a bug and provide as many details as possible on how it happened.

@greearb
I get the same error with ath10k-ct and ath10k, as mentioned on my previous post. To make it clear I was using ath10k-ct driver for about 3 weeks then one day my router hangs and unresponsive, rebooted several times still hangs. Then I check the serial logs it hangs after wmi unified ready event not received then I re-flash my router for the original ath10k driver and get the same message. You may not believed it but I also build the router with debugfs and with kernel debug and I'm still not getting crash dump and my serial connection becomes unresponsive had to reboot the router again. I tried everything and the last solution that I have found is to remove pcie0: pci@1b500000 definition which is the 802.11bgn chip on dts file.

Here's my previous topic about it.

Just had multiple crashes on my r7800 here, less than 2 hours uptime. Opened an issue on the github.

I get this in my log using the CT driver, not sure what it means but sometimes mobile devices can't find services/sites. Not sure if it's related.

[507337.897825] ath10k_pci 0001:01:00.0: Invalid peer id 451 or peer stats buffer, peer:   (null)  sta:   (null)

Uptime of 6 days here now, no crashes or errors.

ath10k-ct-htt driver and fw 10.4b-ct-9980-fH-012-81e1edd on OpenWrt SNAPSHOT r8858-852fc0ba0a up 5 days no issues other than infrequent (say once a day) sta drops that I can't track down. This may not be ct driver/fw related.

It should be 7 days up, but there was one router crash after 2 days up (and no apparent issues with ct driver/fw). The router had to be power cycled. I have no log from the event so it is indeterminate if this crash is ct driver/fw related or not.

Well, still crashes on my R7800, github issue created.

1 Like

I rather heavily pushed for making ath10k-ct the default, mainly because Ben is interested in fixing bugs whereas QCA doesn't seem to care at all.

My main issue with ath10k:
https://bugs.lede-project.org/index.php?do=details&task_id=333
https://bugzilla.kernel.org/show_bug.cgi?id=188201

As you can see in the upstream bug report, none of the Qualcomm employees bothered even commenting on the issue. Since QCA employees are the only ones who have access to the "upstream" firmware, and they don't respond at all, I consider ath10k dead and ath10k-ct is the only option forward.

2 Likes

And while i totally understand, and kinda agree, it needs to be taken to account that we're substituting 0 people working on it for 1 person. And ath10k works not only on qca988x, but qca99xx and more, which depending on the chip have their own firmware and bugs associated with them. -ct up until recently worked worse than stock, and it's still common for it to crash, and bugs still popup that were not present in stock. This is because Ben caters his firmware less to the common usecase and more to the hardware his company is producing. This is not a criticism mind you, but an observation. IMHO there is no good answer on what to do really, since qca dropped the bomb here pretty badly.

Here's a question I would humbly propose - what would it take for a team to reverse-engineer the firmware like in the ath5k days? We have monetary tools now that we didn't have in the ath5k days, like patreon. I could imagine we could setup a bounty for something like this.

1 Like

I don't think anyone really wants to do reverse engineering unless you want to fork out a lot of cash, even then it would be quite a task. Just look at the Allwinner video acceleration campaign for instance.

That said, I'm a bit surprised with the persistance of sticking with -ct since it does seem to cause more issue than it fixes overall(?) as OpenWrt is usually very quick at backing out commits that breaks things. Ben is trying to fix issues but I would be a bit concerned for how long he can keep up since it's a one man band project to my knowledge.

Regard the bugs, realistically I'm not sure how much time they want to spend on troubleshooting a specific device that's been discontinued for more than 2 years (as of writing). I'm sure there are underlying issues but I would guess that they're trying to cover as much ground as possible with as little effort as possible. I'm sure you'd get much more attention if lets say iPhone 8 would crash the firmware as its a much larger userbase. In that regard I'd guess the same goes for mwlwifi, there are edge cases etc and it's overly naive to demand that they should reproduce all scenarios reported.

Worth keeping in mind is that 11ax is starting to pop up so focus is most likely shifting on a corp level as sales > *.

As I am the author if the linked kernel bug report, I'd like to chime in. The issue is gone for me for quite while now. I am not exactly sure what solved it, but I have suspected it might have been the ieee80211w setting (management frame protection), that I disabled at some point. At least that was the only configuration change I made between the bug report and when the issue didn't occur anymore. I did however also update my OpenWrt builds from time to time, so who knows. Another reason I didn't try to track this down more is that one of the two Nexus 5X devices died in the meantime and with only one device it took more time to trigger the bug.

And one word of fairness regarding QCA: I did take this bug to the ath10k mailing list (if I recall correctly) and I did get a response from a QCA employee asking for further info. I tried to collect as much as I could, but after that I never heard back anymore. So, I can't say nobody cared at all. But maybe they didn't care enough or they could simply not reproduce it, I don't know.

Uptime 11 days, new crash (not resulting in a reboot):

http://sprunge.us/DOlZ7s

Happy New Year, Ben!

So, I took the time to test the ath10k-ct firmware and driver against their upstream counterparts. The platform I used is qca988x-based (TP-Link Archer C7 v2).

First things first, the gap that I saw in throughput measurements before (last time I did this was more than a year ago) is gone. The performance is very similar now. Both drivers/firmwares allow for more than 400 Mbit/s throughput between the access point and a 2x2 ac wireless client (a 2014 or 2015 MacBook Air). I tested throuput using iperf with repeated measurements and averaged their results. The server was connected by wire to the same switch as the Archer C7 v2 (which runs in access point mode).

I did notice a few things though:

  1. With the upstream firmware and driver, there's not much difference between the client being the sender or receiver of data. The results are all between about 410-430Mbit/s (with UDP transfers more towards the upper boundary and TCP more towards the lower). This is not the case with the ct driver and firmware. Here performance is slightly lower when the client is receiving data (380-400Mbit/s). This is noting I'd be worried about at all, but the difference seems to be consistent in both TCP and UDP measurements.

  2. I did test the combinations ath10k-ct driver with upstream firmware and ath10k-upstream driver with ct firmware as well, albeit not as long. While the combination ath10k-ct driver and upstream firmware worked fine, the upstream driver didn't seem to like the ct firmware. In fact, it was horrible. Packet loss was extremely high and the throughput only 8-10 Mbit/s!

A few more notes on the testing I did: Testing was done at night, so there wouldn't be as much wifi traffic in the neighborhood. The MacBook was places about 2-2.5m away from the access point (without obstructions) and was not moved during testing. The commands used for testing were:

iperf3 -c <server-ip>   # TCP client to server
iperf3 -R -c <server-ip>   # TCP server to client
iperf3 -u -b 867M -c <server-ip>   #UDP client to server
iperf3 -R -u -b 867M -c <server-ip>   #UDP server to client

I'm not an iperf expert, but I chose 867 Mbit/s here as it is the theoretical maximum bandwidth of the 2x2 client and then looked at the actual throughput taking the packet loss into account.

I used recent builds from the master branch (ar71xx) for testing.

escalade: I found your crash, but in general, please open bug at ath10k-ct instead of just posting to forums, I don't reliably read forums.

silentcreek: For your performance results, do you mean that the download speed is less than upload when using ct firmware?

For now, I plan to ignore the issue of stock driver not liking the ct firmware. I can debug it later in case there is ever a real need to run that test case.

1 Like

@hnyman

How do you avoid them being selected by default again by a defconfig? Which then results in this:

Collected errors:
 * check_data_file_clashes: Package ath10k-firmware-qca99x0-ct wants to install file /media/MyBook/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_glibc_eabi/linux-ipq806x/target-dir-f7e20ea3/lib/firmware/ath10k/QCA99X0/hw2.0/board-2.bin
	But that file is already provided by package  * ath10k-firmware-qca99x0

Yes, downloading (iperf with the -R switch on the client) was slightly slower than uploading on the client. Not as much as I would care about it, but I found it interesting as it showed a pattern both with TCP and UDP. However, it didn't occur when I coupled the upstream firmware with the ct driver, so I assume it's related to the firmware itself.

If you run make menuconfig, deselect the ct driver and firmware package and select the original ath10k driver and firmware package, you will and up with these lines somewhere in your configuration:

CONFIG_PACKAGE_ath10k-firmware-qca988x=y
# CONFIG_PACKAGE_ath10k-firmware-qca988x-ct is not set
[...]
CONFIG_PACKAGE_kmod-ath10k=y
# CONFIG_PACKAGE_kmod-ath10k-ct is not set
# CONFIG_PACKAGE_kmod-hwmon-core is not set

If you run ./scripts/diffconfig.sh > my_diffconfig, these will be preserved.
Note: At least on my platform the ath10k-ct driver depends on the kmod-hwmon-core package, while the default driver does not (nor does any of my other packages), so it's deselected as well. Depending on your hardware/configuration, you may still need the kmod-hwmon-core package.

In addition, you can also choose to build one of the firmware packages as a module that does not get included in the firmware image. That should avoid the error as well and gives you the option to manually install the other firmware package (using the opkg --force-overwrite option), so you can test the two firmwares against each other. In this case I would make sure to make a backup of the file /lib/firmware/ath10k/QCA99X0/hw2.0/board-2.bin though, as it will get deleted once you uninstall either of the two firmware packages.

1 Like

Escalade: I opened a bug for your crash file, see link below. It has a binary you can test: