Xiaomi AX3600 / IPQ8071A ethernet "crashes" sporadically

It's time to open a bug report:
I had the Ethernet interface on AX3600 crashing the third time in 3 days now. Running current snapshot without any additional nss patches or something.

This time, I had:
a) another AP (let's call it AP2) connected to WAN port of AX3600 (all 4 ports bridged as "br-lan", no VLAN or anything, no actual WAN)
b) Client connected to that AP2
c) Client running iperf3 -c against iperf3 -s running on the AX3600 (actually, I wanted to benchmark the wifi of AP2...)

Interestingly, I still got a few bytes of traffic after it stopped working. Following is the log of iperf3 -s via SSH console that was open all the time:

ccepted connection from 192.168.0.169, port 35924
[  5] local 192.168.0.202 port 5201 connected to 192.168.0.169 port 35934
[ ID] Interval           Transfer     Bitrate
[  5]   0.00-1.00   sec  5.50 MBytes  46.1 Mbits/sec                  
[  5]   1.00-2.00   sec  5.61 MBytes  47.0 Mbits/sec                  
[  5]   2.00-3.00   sec  5.64 MBytes  47.3 Mbits/sec                  
[  5]   3.00-4.00   sec  5.91 MBytes  49.6 Mbits/sec                  
[  5]   4.00-5.00   sec  5.80 MBytes  48.6 Mbits/sec                  
[  5]   5.00-6.00   sec  4.17 MBytes  35.0 Mbits/sec                  
[  5]   6.00-7.00   sec  0.00 Bytes  0.00 bits/sec                  
[  5]   7.00-8.00   sec  0.00 Bytes  0.00 bits/sec                  
[  5]   8.00-9.00   sec  0.00 Bytes  0.00 bits/sec                  
[  5]   9.00-10.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  10.00-11.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  11.00-12.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  12.00-13.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  13.00-14.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  14.00-15.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  15.00-16.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  16.00-17.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  17.00-18.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  18.00-19.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  19.00-20.00  sec  0.00 Bytes  0.00 bits/sec                  
[  5]  20.00-21.00  sec  0.00 Bytes  0.00 bits/sec           

while the lines further down took some time to appear.

After I did not get any more traffic through, I connected via wifi directly to the AX3600 which still worked. There was nothing in dmesg or logread, just the 2 VPN clients complaing about missing internet.
rmmod qca_nss_dp immediately rebooted the AX3600.

How can I help debugging this?

Hm, unfortunately I cannot reproduce the issue.
What I was "hiding" in the post above:

  • the AX3600 also had two OpenVPN client sessions running, but both did not have much traffic
  • There were 4 SSIDs on both main radios, the 3rd ath10k radio was disabled.

For the following tests, I removed all SSIDs and created a new, test one; also I removed all VPN configurations and all other network interfaces (wan and lan sides for the vpns; but nothing VLAN related, e.g. still not running any VLAN)

The AX3600 got replaced by a Netgear WAX206 in production (that one is doing great, wifi performance slightly worse than AX3600, but that's fine), so I have it on my table now.
I pushed iperf3 traffic to and through the device via ethernet and via wifi as well as "switched" (e.g. from one ethernet port to iperf3 server running on another device connected via ethernet) and could not reproduce the issue.

A few notes though:

  • Pushing a full 1 Gig ethernet through the switch loads one core SIRQ 90% (governor "schedutil", staying at default about 1 GHz) or 65 % (governor "performance" at 1,38 GHz)
  • Additionally pushing 400 Mbit/s via wifi onto iperf running on the system loads first core fully and 2 more cores 60% (mostly napi)
  • At the same time, the 1 gig ethernet destination iperf server running on "slow" fritzbox 7530 has less than 10% cpu utilization in total (on performance governor/700 Mhz)

-> The default ethernet/bridge/switch implementation in ipq8071A seems to use a lot of CPU ressources
-> The wireless implementation with ath11k also seems to use a lot of CPU ressources in comparison to ath10k/mtk devices

Anyway, performance-wise I don't have a problem. But I need help reproducing and fixing the ethernet hangups as that's the showstopper :frowning:

I did not update the firmware between all the tests, e.g. everything is as when the crashes happened in the opening post.

Do you still have the wan port in br-lan and you are using that port?

It is in there, but unused/disconnected (and also disconnected during the past crashes, but always in there)

In previous post I had read it was in use.

1 Like

Uups, thanks, i might investigate that. My documentation here was maybe better than my rememberings. But as I understood it, all ports of the ax3600 should be equal, e.g. all of them are just phy's connected to the ipq8071a, anyway, i'll run more tests later. Thanks!

1 Like

Hm, still no success in reproducing the problem :frowning: If there is no more ideas coming up I guess i'll keep it in the shelf for now and maybe find a use case for it again later / some environment to reproduce the issue.

I seem to be having the same issue, sometimes it can 'reset' itself by plugging the ethernet cable out and in again. Other times however it can only be reset by using the reset button or powering it on and off again.

Just replying to see if the cause has been found yet, and maybe others that are having the same issue.

1 Like

I have also found this problem sporadically but I don't know which is the problem itself. A couple of times, my wife told me that internet was not working on the morning after all night without traffic or on just small traffic.

1 Like

Same with me here. It has happened since last week but the frequency seems to be increasing now. I have zerotier running, other than that nothing fancy. I'll try disabling zerotier and see how it goes.

Another note, I monitored the router using node exporter, but there's seems to be nothing extraordinary when this happens.