Add support for MikroTik RB5009UG

Yep that's another patch that slipped out somehow :person_facepalming: Sorry. Gonna check what happened there.

Edit: Well obviously missed the first patch adding the CONFIG_ARM_SBSA_WATCHDOG symbol, that should be fixed now. Shouldn't really be diffing patches at 1 AM. Curious if that fixes your issue.

1 Like

Yes, that does indeed fix the issue :slight_smile: Thanks a lot for these patches

1 Like

Is PoE functional on OpenWrt? Really curious.

Haven't tried it yet (nor have I tried it on RouterOS). From a quick search, looks like it won't be straightforward :grin:

1 Like

Anyone know what functionality was broken on 5.15 on OpenWrt that was working on 5.10? Something routing related. Can't find it anymore.

I think, but am not sure, this board uses a mvpp2 interface?

If so, a new BQL support patch is here: https://github.com/semihalf-wojtas-marcin/Linux-Kernel/commit/44dacb14698b0e7756d427acd3591980057a70fb

Author claims nice reductions in latency in the commit.

3 Likes

Thanks! It uses mvpp2 indeed. I'll see if I can backport to 5.15 and will keep you posted.

@dtaht I just built and flashed 5.15 with the patch (which applied cleanly on top of 5.15.72), but all connectivity is gone. SFP+ died, 2,5G doesn't come up, the switch the router was attached to drops the link to 10 Mbps (on a Gbps link) but no ping there either.

I have no serial set up on this hardware, so I'm afraid you'll need someone else to find out what's going on here. Maybe @sumo can help out?

wow. As first tries go, that must of hurt!!!! I'll talk to marcin....

Hi @Borromini. This is extremely strange - the BQL change is totally unrelated to link handling whatsoever. 2x 10G SFP+ and 1G work the same way in my setup with and without this patch. Just to double-check is exactly the same baseline (just without the applied patch) behaving as expected?

1 Like

@mwojtas Sorry for the noise, looks like a typical PEBKAC. Builds with the BQL patch boot just fine on both RB5009UGs I got. So looks like some late evening thing on my side. Thank you for the work.

@dtaht I'm seeing counters now, i suppose eth0 is what I should be looking at here?

root@RB5009UG+S+IN:~# cat /sys/class/net/eth0/queues/tx-0/byte_queue_limits/*
1000
0
66596
1879048192
0
root@RB5009UG+S+IN:~# ls /sys/class/net/
bonding_masters  br-lan.1         fiber            lo               p2               p4               p6               p8               sfp
br-lan           eth0             guest            p1               p3               p5               p7               pppoe-wan        wg0

@Borromini Thank you for the update. Please let me know, how the testing goes.

1 Like

Will do. Anything specific I should be looking out for?

yay! I wish bql had made it into even more drivers like virtio-net... so glad you persisted!

useful is to monitor it with bqlmon while under stress. see: https://github.com/ffainelli/bqlmon

How many tx's are exposed?

Even more useful, to me, would be to hit it hard with multiple iperf or flent flows and see at what level it stablizes at using:

fq_codel # which doesn't split GSO packets and will be 64-132k
cake gso-split # which does, but it's higher overhead
on the mq interace - also cake gso-split #
cake bandwidth someprettylargenumber # As high as it can go. I am hoping we can shape 2GBit per core on this arch

In all these cases it interacts with the BQL subsystem (which just chopped 1/10 or more of the latency and jitter out of this chipset. yay! HALF a ms down from 12!! from 8 lines of code! what more could you want? :slight_smile: )

My goal (elsewhere on many threads) was to be able to observe and reduce interrupt latency to where things like BQL were running at below 32k at a gbit. I don't know what a right number would be at 10gbit or 2.5.

A related hope was to be able to reduce the coalescing interval, and also that by modifying NAPI_POLL_WEIGHT on arm architectures to 8 or less, we could start seeing major reductions in microseconds between rx and tx and get openwrt's networking closer to a fluid model,
and for all I know, speed things up due to using cache more efficiently.

I don't have any numbers for how fast the a72s can context switch, but the a53s could do it WAY faster than x86 could. The kinds of numbers we've inherited from the x86 world seem to be waay too large and too optimized for server, not routing workloads.

Another long deferred work was to make BQL multi-core, which would cut the amount of data outstanding on all the rings to just what is needed.

2 Likes

thx for doing this! Please add my joyfully-Acked-By?

My (now slightly less) long term (formerly seemingly) crazy idea was to attempt to make this platform into a ISP fiber tower -> cake based shaper, leveraging either libreqos.io or (more likely) bracketqos, which leverages rust, xdp, and ebpf: BracketQos - Rust, EBPF, XDP, & cake, oh my!

I suppose this is the answer to that question? Looks like 8 Tx queues? (Where does the inbalance with the Rx ones come from BTW?)

root@RB5009UG+S+IN:~# ls /sys/class/net/eth0/queues/
rx-0  rx-1  rx-2  rx-3  tx-0  tx-1  tx-2  tx-3  tx-4  tx-5  tx-6  tx-7

I will check blqmon, looks like i'd need to package it for OpenWrt. That might take some time to get done, I'm a bit rusty in that department.

Any recommendations on how to set up flent/iperf testing? I suppose the RB5009UG should be sitting in the middle, right? I got a 2,5G capable desktop, but no 10G clients. Do have a switch that can do 10G though and I can run OpenWrt on. Not sure how well that will perform though as an endpoint?

It mostly sounds way over my head to me but I believe you when you say it's crazy :wink:.

I've run a few flent tests on a 500/40 connection and I don't think that's achievable at the moment.

Cake CPU usage (layer-cake, bandwidth 500mbit, nat, dual-dsthost/srchost, docsis framing):



FQ Codel CPU usage (simple, 500mbit):



1 Like

thank you for so rapidly and beautifully dashing my fantasies. :frowning: That's a LOT of softirqs. What tool are you using to produce these graphs? Which flent test(s) were these?

I take it you are using htb shaper + fq_codel (simple) for the second series?

I'm always sensitive about this, it's the shaper that is 90% of the workload with fq-codel, and the BQL thing that marcin just did was to improve the latency at line rate (so I was hoping to crack 2.5gbit easily there and scale down properly to 10mbit also)

Are you running irqbalance? The performance governor?

out of curiousity what does cake, unshaped, on this connection, using sch_mq so it goes on the 8 tx queues - look like (I know you want to shape to 500, but) Fastest way to try that is to

sysctl -w net.core.default_qdisc=cake
tc qdisc replace dev eth0 root cake
tc qdisc del dev eth0 root
tc -s qdisc show

I am not sure if this is a mqprio or mq device. Ideally what you would see here is sch_mq + 8 instances of cake.

It's silly to have more than one rx queue per core. The 8 tx queues either comes from an idea they are a strict priority queue (mapping CS0-CS7) usually which is one of those ideas I'm unfond of, but popular in some circles, or an artifact of that (if they are more sanely configured as equal priority queues). It makes the most sense to have one tx queue per core.