Cake QOS R7800, expectations

I seem to be having issues with the 5ghz wireless on the R7800, not sure why but it doesn’t seem very stable. I had a look in the kernel log and I’m seeing this;

[300883.153798] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[344241.648315] ath10k_warn: 6 callbacks suppressed
[344241.648329] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[344241.652033] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[344241.660218] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[344241.668452] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[344241.676669] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode
[344241.684977] ath10k_pci 0001:01:00.0: received unexpected tx_fetch_ind event: in push mode

I don’t lose WiFi connection but the router becomes unresponsive as if the internet has gone down which it hasn’t. 2.4ghz seems unaffected.

image

I have gone to the latest master now to see if that stabilises my 5ghz.

I had a thought that I hoped someone could clarify, I’ve been impressed so far with OpenWRT + cake, so much so that I returned the AC86U.

Scenario
Overall WAN bandwidth 64Mbps, Cake on;

1 x Netflix 4K Stream = 25Mbps

1 x Netflix 4K stream = 25Mbps

1 x Netflix 1080p stream = 10Mbps

Total 60Mbps being downloaded, say some more traffic on the network occurs, with Cake the two 4K Netflix streams are going to feel it first while the lower 10Mbps file would probably be ok? I know the point is not to starve devices but that would still happen?

Without Cake, this additional traffic would make all three videos buffer? I’ve also dabbled with GeForce now gamestream in the past but that wants 50Mbps, which would get throttled real fast I would imagine.

I agree with this prediction, the 25 Mbps host are closer to their fair-share already.

I would propose you simply test that :wink: theoretically it is not fully clear what happens as TCPs do not necessarily share a link fairly, but I think that in that situation any is prone to buffering, though not necessarily all three in a synchronous fashion.

On theoretical grounds I agree, but again it should be easy to test this hypothesis.

1 Like

After my factory reset and change to the master branch my old dhcp entries for all connected devices remain, does this come from the devices themselves. I was having issues with Amazon Echos and Phillips hue lights not responding correctly since the factory reset.

Can anyone help with this, where as my 5ghz was going down before now it seems to be my 2.4ghz, I can’t get WiFi stability when using OpenWRT;

[15340.622578] IPv6: ADDRCONF(NETDEV_UP): eth1: link is not ready
[15340.644416] br-lan: port 1(eth1.1) entered blocking state
[15340.644451] br-lan: port 1(eth1.1) entered disabled state
[15340.649056] device eth1.1 entered promiscuous mode
[15340.654298] device eth1 entered promiscuous mode
[15340.686326] br-lan: port 2(wlan0) entered blocking state
[15340.686356] br-lan: port 2(wlan0) entered disabled state
[15340.691134] device wlan0 entered promiscuous mode
[15340.714992] br-lan: port 3(wlan1) entered blocking state
[15340.715027] br-lan: port 3(wlan1) entered disabled state
[15340.719785] device wlan1 entered promiscuous mode
[15340.728070] br-lan: port 3(wlan1) entered blocking state
[15340.729278] br-lan: port 3(wlan1) entered forwarding state
[15340.734875] br-lan: port 2(wlan0) entered blocking state
[15340.739973] br-lan: port 2(wlan0) entered forwarding state
[15341.040640] device wlan0 left promiscuous mode
[15341.040780] br-lan: port 2(wlan0) entered disabled state
[15341.097590] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 0
[15341.097819] ath10k_pci 0000:01:00.0: peer-unmap-event: unknown peer id 0
[15341.352503] device wlan1 left promiscuous mode
[15341.352606] br-lan: port 3(wlan1) entered disabled state
[15341.442450] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 0
[15341.442562] ath10k_pci 0001:01:00.0: peer-unmap-event: unknown peer id 0
[15341.640414] ipq806x-gmac-dwmac 37400000.ethernet eth1: Link is Up - 1Gbps/Full - flow control off
[15341.640489] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[15341.653427] br-lan: port 1(eth1.1) entered blocking state
[15341.654362] br-lan: port 1(eth1.1) entered forwarding state
[15348.264772] ath10k_pci 0000:01:00.0: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[15348.264805] ath10k_pci 0000:01:00.0: msdu-desc: 2500  skid: 32
[15348.346639] ath10k_pci 0000:01:00.0: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[15348.347455] ath10k_pci 0000:01:00.0: wmi print 'free: 87020 iram: 26788 sram: 18240'
[15348.711163] ath10k_pci 0000:01:00.0: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[15348.711383] IPv6: ADDRCONF(NETDEV_UP): wlan0: link is not ready
[15348.728213] br-lan: port 2(wlan0) entered blocking state
[15348.728259] br-lan: port 2(wlan0) entered disabled state
[15348.733025] device wlan0 entered promiscuous mode
[15354.903829] ath10k_pci 0001:01:00.0: 10.4 wmi init: vdevs: 16  peers: 48  tid: 96
[15354.903862] ath10k_pci 0001:01:00.0: msdu-desc: 2500  skid: 32
[15354.987832] ath10k_pci 0001:01:00.0: wmi print 'P 48/48 V 16 K 144 PH 176 T 186  msdu-desc: 2500  sw-crypt: 0 ct-sta: 0'
[15354.988676] ath10k_pci 0001:01:00.0: wmi print 'free: 87020 iram: 26788 sram: 18240'
[15355.394297] ath10k_pci 0001:01:00.0: Firmware lacks feature flag indicating a retry limit of > 2 is OK, requested limit: 4
[15355.394478] IPv6: ADDRCONF(NETDEV_UP): wlan1: link is not ready
[15355.675036] br-lan: port 3(wlan1) entered blocking state
[15355.675098] br-lan: port 3(wlan1) entered disabled state
[15355.679961] device wlan1 entered promiscuous mode
[15355.685103] br-lan: port 3(wlan1) entered blocking state
[15355.689336] br-lan: port 3(wlan1) entered forwarding state
[15355.695282] br-lan: port 3(wlan1) entered disabled state
[15356.052435] IPv6: ADDRCONF(NETDEV_CHANGE): wlan1: link becomes ready
[15356.052741] br-lan: port 3(wlan1) entered blocking state
[15356.057912] br-lan: port 3(wlan1) entered forwarding state
[15419.464482] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[15419.464615] br-lan: port 2(wlan0) entered blocking state
[15419.469911] br-lan: port 2(wlan0) entered forwarding state

I have moved to hnyman OpenWrt 18.06-SNAPSHOT r7653-939fa07b04

Are there known stability issues? I find that my 5Ghz doesn't want to come up half of the time?

Using DFS channels requires a survey for radar events before the interface can be brought up, this takes at least one minute.

Appreciate the reply but that isn’t my issue, both the 2.4ghz and 5ghz are crapping out, once the 5ghz goes it takes a reboot of the router to get it to come back up. I moved over to the 18.06 as from looking around there are some WiFi driver issues on the CT version. It’s hard to remember which ones I have tried to know whether this latest change will work. So far I’ve not managed 3 days without one of the bands failing.

Even choosing a none DFS channel will not bring back the 5ghz when it craps out.

Client dependend?

Today it was 2.4ghz and my Galaxy S8 could no longer get internet, same thing happens on 5ghz with ipads. WiFi is there but it looks as though the internet is down but it appears to be WiFi related.

Suggest to open a new thread for the wlan issues.

1 Like

Getting back onto the cake topic,

My modem syncs at 67Mbps and without Cake enabled I get 62Mbps, I set Cake to use 63.65Mbps so 95% of the sync rate. I noticed that on some DSLreports test that I’m dropping out of an A+ rating to a A or B. At the start of the test I’m getting a big ping spike sometimes, occasionally in idle as well for some reason. If I set my ingress to 60Mbps my spike at the start seems to stay around 40ms so A+ every time.

Is the ping spike indicative of how the connection will perform when the connection is first being taxed when the link is saturated? Is it worth losing 4Mbps for?

Okay, so at a measured goodput of 62 Mbps you will need at least:
62 * 65/64 * (1526/(1500-8-20-20)) = 66.1779011708 Mbps Sync
your 67 Mbps is close, but the difference might iundicate that your ISP has some thraffic shaping at its BRAS/BNG level. Anyway, I would take this 66Mbps as 100% for the gross rate. For ingress shaping typically one needs a larger bandwidth sacrifice to keep control over the queueing than for egress (in the limit you can set the egress shaper-gross-rate to 100% of the true link gross-rate, as all you need to do is avoiding overfilling the modems buffers). 95% seems awfully tight for decent ingress/download shaping, we typically recommend to use 85-90% of the ingress gross rate. BUT this is a trade-off between latency-under-load and bandwidth utilization which all users/networks will need to figure out for themselves, or tp put it differently this is a policy question.
That said 100*63.65/66.18 = 96.18 % of the gross-link-rate seems tight enough to expect "back-spill" into the ISPs DSLAM-buffers when data is rushing in, and hence occasional ingress latency spikes. Mind you these spikes can also occur when the shaper is set to lower percentages of the true link speed, as they are depend on the difference in accumulated incoming rate to the DSLAM and the bandwidth from the DSLAM to the modem, so one can not fully rule these out (e.g. in a DOS situation our post-bootleneck-ingress-shaping basically is useless).

This most likely is the effect of all measurement flows sending like 10 packets towards the end-host all at once, and while these are queued at the dslam everything else experiences delays. Unfortunately that can not be fully avoided when instantiating a shaper downstream of the true bottle-neck.

This tends to be something else, potentially transient CPU overload on your router (caused by say, WIFI processing) which means that the shaper gets not-serviced for too long a period and that immediately causes delay as well.

Which just means that this shaper bandwidth is well matched with the load generated by the number of flows from the test. Cake actually offers the "ingress" keyword to better deal with shaping ingress links (it will for example not scale its packet dropping so the shaper's egress rate matches the set rate, but rather so that the ingress rate matches the set rate, which works better for post-bottle-neck shaping). IIRC, "ingress" also will better deal with the number of concurrent flows on the ingress link (more flows typically require a higher bandwidth sacrifice so that all flows get a sufficiently strong signal to slow down), and it dies this transiently depending on the number of active flows. In short, if you use the ingress keyword you might get away with a smaller permanent bandwidth sacrifice, but that still is policy and you need to balance this for your own network according to your preferences.

Sort of, my take on this is, that the pretty much synchronized nature of flows in a speedtest result in, say 16 flows dumping 10 Packets (assuming intial window 10) into the link simultaneously resulting in packets worth up to
(16 * 10 * 1526 * 8) / (63.65 * 1000 * 1000) * 1000 = 30.6878240377 ms
of transfer time on the bottleneck link piling up at the upstream end of said bottleneck link. These will be queued by the DSLAM and any other packets, like the packets used to probe the latency will be stuck behind these and experience an additional 30 ms of delay. Does this make sense to you?

You tell me :wink: This is a policy decision every network operator needs to make individually, there is not true answer to this. If you ask me, sure I would sacrifice this 4 Mbps happily, but I would also not be ashamed to completely disable my shaper if I needed maximum bandwidth (temporarily). :wink:

To make this explicit, the "true link gross-rate" I am talking about here is not the sync rate, as VDSL2 uses PTM which will add one extra overhead byte for every 64 payload bytes, and hence the "100%" here equals upload-sync-rate * 64/65 or 100*64/65 = 98.46% of the sync rate.

There also can be an issue initially during speed tests if you have CPU frequency scaling, it can take time for the CPU to wake up and start processing packets

Thanks @moeller0 and @dlakelan for the very detailed response, I am going to take some more time to read through as there is a lot of information to digest for me there. I do not really understand what I need to put in for the egress setting.

Here are a few of my speedtests to look at, they are with the bandwidth set to 64,60,59 and 55Mbps;

As you can see from the 64Mbps setting I have the lag spike on idle, however on the 60Mbps test I have a bigger lag spike that on the 64Mbps test.

From the testing that I have done, dropping down to 59Mbps seems to be the sweet spot, no big idle spikes and neither download or upload go much above 40ms. Thanks again for the detailed reply.

Sometimes when I test I wonder if other network traffic may be causing some anomaly’s.

Speed test and lag results are dependent not just on your SQM settings but also on the behavior of your link at the ISP level. For example I have GPON which is more or less a shared fiber-optic link, where a single fiber is actually optically split. The shared nature of the link means that the actual speed I can expect is dependent on how many of the people who share the link are doing what at what given time. The result is that I never get a consistent "guaranteed" speed. It also seems to be the case that if I run a speed test and get a bad result, immediately running the speed test again will give a better result. This is particularly true on uplink. I personally think what's going on is that my ONT negotiates a set of time slices, and that can take several hundreds of ms to decide how many timeslices are needed for the bandwidth I'm trying to use during the "ramp up" phase. The result is I essentially always have some kind of transients in my speedtests.

The good news is that my QoS settings prioritize packets using DSCP and the sensitive flows don't actually experience these transients.

For example I ran ping -i 0.2 google.com during this speed test: http://www.dslreports.com/speedtest/45265094

my QoS prioritizes ICMP automatically at a pretty high level. And even though you can see a massive transient in the download results above, the ping response time never went above 11ms (about +4 ms from idle).

Unfortunately DSL Reports seems to use TCP to measure delays, and with a proxy in the way, and various things, it's not a totally reliable way to measure delay for sensitive flows like UDP game or voice RTP packets.

1 Like

But you know what you are doing. In general I do not recommend to do that, otherwise ICMP RTT becomes a bad proxy for normal latency-under-load-performance :wink:

Yes, right. In my case I want just regular old pings to give me an approximation to the sensitive flow behavior rather than the general flow behavior so I'm doing this explicitly.

But the basic issue is that DSLReports uses TCP to measure delays and this isn't necessarily a great measurement of delay (better than nothing, but certainly not the final word).

TCP delays is a good measure of how DASH type video streaming will work, and some games that use TCP for their controls (for example in browser games like slither.io do this). VOIP packets though will be different to some extent, particularly if you use say diffserv4 or diffserv8 and have DSCP set by your VOIP phone or adapter, as will many games that use UDP such as many first person shooters, or action games.

EDIT: in the end, what I think I'm trying to say is that the occasional isolated spike on a dslreports speed test is not necessarily indicative of something that requires changing SQM settings, it could be indicative of situations outside your control, and particularly with the diffserv SQM settings may not reflect behavior of VOIP or games either.

1 Like

When pinging BBC.co.uk from my PC my ping is 9ms, if I then saturate the line the highest I recorded was 30ms but most of the the time it was withing a couple of ms of 9. I think that I could probably get away with 60Mbps but I think just to be safe I will stay at 59Mbps.

1 Like