EcoNet Hardware NAT+QoS

I’m working on the EcoNet ETH and I’m trying to figure out exactly how the hardware NAT and QoS infrastructure works, and then make a decision about how we ought to leverage it.

My goal is to leverage hardware NAT through the Linux flowtable (offload) system, and then make hardware QoS available to that. For any packet that is being sent/forwarded in software, hardware QoS support is a non-goal because “just use CAKE”.

My understanding of the QoS system is as follows:

  1. There is register space for are 32 channels, though (for some reason) on the LAN side the code says that only 8 channels can be used.
  2. For each channel, there are 8 “queues”, when sending a packet, you specify the channel and queue number. For some unknown reason, there are 8 bits given for specifying the channel number in the packet descriptor.
  3. Packets are sent from each queue as Weighted Round Robin, weights can be set for each channel/queue. I think this is weighting queues vs one another within a channel, not weighting channel+queue vs other channel+queue.
  4. In the xPON context, it is implied that channel maps to T-CONT / GEM port or LLID, i.e. a virtual link that is made in GPON or EPON. It is unknown what effect using different channels has on the LAN side.
  5. For this reason, you are allowed to convert some of your channels into virtual channels, which act as queues within a channel. When you do this, you can then configure Weighted Round Robin weights between the virtual channels. You can opt for either 2 virtual channels, or 4, per physical channel. I think that when you opt for 4 (for example), channels numbered 0,1,2,3 all become physical channel 0. This allows you to have 32 queues on one channel (assuming other physical channels cause the packet to go to the wrong place, that means the device has 32 queues).
  6. You can set a per-queue static threshold so when more than this amount of queue buffer is used, packets in this queue begin to drop (there is a secondary number for DEI marked packets). Queue N must be configured the same across all channels, but each channel does have a unique queue so when one fills up, it doesn’t affect all of the others. I imagine the parameter is a proportion of total queue space. This can also be set dynamically by the hardware.
  7. You can set a rate limit based on the channel number. Again it is my expectation that virtual channels qualify.
  8. You can also enable per-channel bandwidth metering and then read the meters.
  9. There is also a debug counter which can count packets or bytes on a channel (all queues), a queue (all channels), or a channel+queue. The code claims there are 40 counters but IO memory dumps imply there might be 64 in total.

Now the part that I don’t know much about:

There is a mysterious field (in the xPON send message) called tsid (Traffic Shaper ID), which along with it’s brother tse (Traffic Shaper Enable) seem to control some kind of traffic shaping / marking capability. tsid is used in the code, in one instance it is set from xpon_info→tsid which is in turn set by CAR_QUEUE_NUM. This symbol CAR_QUEUE_NUM is used in xmcs_set_qos_policer_creat(), a function which explicitly does not run on EN751221 (TCSUPPORT_CPU_EN7521 is always set on EN751221 xPON chips). The code path from xmcs_set_qos_policer_creat() goes dead as parts were removed from that repo, but qdma_set_txqueue_trtcm_params() is present in another repo, and it sets register QDMA_CSR_TRTCM_CFG which is absent in the EN751221 QDMA (the address lines up with a static queue threshold configuration reg). However, tsid is still (very intentionally) present in the descriptor, right below the TCSUPPORT_CPU_EN7521which should be excluding it’s usefulness.

A couple of comments seen in the code:

/* in EN7580, tsid=0x7F means no sharping, tsid=0~126 measn sharping */

/* In MT7510/20, bit5~1 and bit0 of PPE foe's tsid field stand for tsid and tse respectively */

My questions at this point:

  1. Is 8 LAN channels an actual hard limit? If so, is it caused by missing QDMA1 being different from QDMA2, or is it caused by the LAN peer (switch) being unable to handle so many channels? If I enable 4:1 virtual channel mapping, does it really go down to 2 channels or can I have all 32 channels bit as 8 physical with each having 1 virtual?
  2. Is there any problem with sending on more than one (physical) channel? Assuming we’re using either the LAN, or we’re using the ethernet WAN port on a EN7513 DSL modem.
  3. Is tsid usable for getting some additional traffic control that I didn’t know about?

Part 2: Hardware NAT (Packet Processing Engine)

This is largely the same as the Mediatek PPE and if our goal was to do non-QoS forwarding, we could probably get away with just copying it. But if we’re not doing QoS on hardware forwarded traffic, then there’s no point in doing QoS at all, if it’s software forwarded traffic then “just use CAKE”.

So we need to adapt the hardware forwarding rules to also tag the packets with QoS channel+queue information.

We don’t have the source code to the actual EcoNet PPE controller, but it seems to be based on ra_hwnat_7510. The way the vendor code worked, there is a userland data structure called struct hwnat_tuple and that structure is passed through a procfs call into a kernel module which transforms it into a struct FoeEntry, what the hardware is able to understand. Fortunately we have the code for the struct hwnat_tuple and it is the same as ra_hwnat_7510’s struct hwnat_tuple with the following additions:

    unsigned short in_vlan;
    unsigned int hsk_l2;
    unsigned char channel;
    unsigned char new_l2b;
    unsigned int hsk_mc;
    unsigned short in_gre_call_id;
    unsigned short out_gre_call_id;

One of those is obviously pretty important to us because it’s the channel. The ra_hwnat_7510 AddFoeEntry function does set the queue ID:

				if(opt->dst_port == FP_QDMA_HW){
					entry->ipv4_hnapt.iblk2.fqos = 1;
					entry->ipv4_hnapt.iblk2.qid = opt->qid;
				}

and importantly only if the destination port is set to QDMA_HW, something that matches with EcoNet. However there is no channel being set here.

Airoha AN7581’s PPE driver has the channel and queue ID placed just below the (famous) tsid, which in the case of ra_hwnat_7510 would put it right in a reserved field, but ra_hwnat_7510’s info_block2 differs rather violently from that of Airoha, so it’s unclear which structure EcoNet is using if either.

I think that in order to get to the actual correct struct FoeEntry, I think we may need to reverse engineer the binary from EcoNet, but fortunately the AddFoeEntry function is fairly self-contained so reverse engineering it is probably something that will be reasonably achievable.

Part 3: What to do about it

There are two PPE drivers in mainline Linux, those are Mediatek (MT7621) and Airoha (AN7581).

Mediatek

Mediatek does not really implement QoS, they expose all of the queues to the kernel through alloc_etherdev_mqs(), so that the kernel can schedule different traffic on different queues, but there is no prioritization between queues beyond the relatively clumsy mtk_set_queue_speed() which allows setting a queue to 10mb/s, 100mb/s or 1000mb/s using ethtool.

Mediatek does implement flowtable offloading by handling TC_SETUP_BLOCK/ TC_SETUP_FT, when a rule is inserted, the queue is selected statically in mtk_flow_set_output_device() based on the physical port (Distributed Switch Architecture port, or upstream port) that will send the traffic. The queue is set in the lower 3 bits of info_block_2 (MTK_FOE_IB2_QID), or the lower 7 bits (MTK_FOE_IB2_QID) if mtk_is_netsys_v2_or_greater().

Airoha

Airoha does implement QoS, in particular, TC_SETUP_QDISC_ETS and TC_SETUP_QDISC_HTB. Airoha advertises 32+4 queues to the kernel, 32 are immediately usable while 4 are reserved for HTB rules. Airoha allows maximum 4 HTB rules, and they must only be directly under the root (no hierarchy is allowed), therefore they can be easily implemented by imposing rate limits.

Outside of HTB-channels, the 32 main queues are organized into groups of 4*8. The default queue selection (airoha_dev_select_queue()) uses the physical port to select from among the 4 channels, then the skb→priority field to select from among the 8 queues. If the packet matches on an HTB rule then of course it does not follow this code path, instead it is assigned the proper queue ID from among the four “extra” queues.

Airoha also allows setting ETS for the 32 “main” channels, however you can only set weights on the 8 queues, setting weights between the channels is not allowed.

Like Mediatek, Airoha does flowtable offloading by handling TC_SETUP_BLOCK/ TC_SETUP_FT, and airoha_ppe_foe_entry_prepare() selects a queue based on the (destination) DSA port. As with Mediatek, the queue is assigned in the bottom bits of info_block2, in this case it is 5 bits (AIROHA_FOE_IB2_NBQ). Interestingly, AIROHA_FOE_CHANNEL, and AIROHA_FOE_QID are never configured.

Discussion

Allocating a channel for each DSA port only really makes sense if you have a downstream chokepoint (i.e. a 10/100mb link) on the switch and you can do what Mediatek does and choke that channel (mtk_set_queue_speed()) so that traffic can be prioritized for it, rather than letting the switch drop traffic more-or-less at random. Otherwise, channels and queues should be allocated to users and applications. My opinion is that this isn’t really useful, the switch will do something, and in 2025, “don’t plug in a 100mb cable” is a reasonable thing to say.

Also, I don’t see the utility in HTB offload, especially if hierarchy is forbidden and there are no queues below an HTB channel. If I’m understanding the Airoha code correctly, the four HTB queues are incidentally protected on account of the fact that they are a different channel than the other four physical channels, though it would probably be better to put WRR so that they are selected first. In any case, there is no way to select an HTB channel when doing flowtable offloading, so the whole system does not really meet my goals.

Generally speaking, I don’t think the rate limiting capabilities of the EN751221 are that useful because they are only per-channel and in any case, rate limiting is a big hammer which is rarely what you want.

Proposal

First I think we need to figure out just how many channels/queues we have available to us.

  • If each “physical” channel is indeed bound to a unique LLID / GEM, then we will only have 4 virtual channels * 8 queues available to us, but all have WRR weighting.
  • If we are able to use as many channels as we want, then we have 8 physical channels, of which each has 4 virtual channels, and of those each has 8 queues for a total of 256 queues, but there is no way to weight traffic between the 8 physical channels, we can only assume they are round robin.
  • If as the code implies, we only have 8 channels available on the LAN side, then the smartest configuration would be 2 physical channels, each having 4 virtual channels, of which each has 8 queues, total: 64 queues. Again, we would have WRR weighting except between the 2 physical channels.

We probably need to support multiple different configurations, since upstream and downstream probably have different limits.

A simple MVP implementation would be to allocate a channel per user (NAT-local IP) and a queue per application (source port number), this could be done directly in the flowtable offload handler since we know the IP and port numbers already. This is easy, it will provide some fairness, and it doesn’t require implementing TC_SETUP_QDISC_*.

We could take this a little bit further by reading the DSCP off of the packet header to decide which queue to place it in, you will get at least one packet during the setup of the session and you can read it during the tx phase. This is not the cleanest solution, but there is no support for setting the QoS info in the flow, there is FLOW_ACTION_PRIORITY, but this is only set by tc … skbedit, not by nft flowtable. With this change, we can then group traffic into “latency sensitive” (high priority, small buffer, drops in case of a burst) and “bulk” (low priority, big buffer, keep the pipe full). This would integrate nicely with OpenWrt’s qosify. This would require a background task to update PPE rules because we may receive the TC_SETUP_FT call prior to having actually transmitted the first packet.

Beyond that, we might consider using the QDMA bandwidth meters (per-channel), counters (40 in total), and PPE counters (64 total) to watch for greedy flows and isolate them in their own low-priority queue so that they don’t harm the user, but this is getting to be fairly advanced and I’m not sure we’re going to make it that far.

WDYT?

Per my understanding the SoC design/philosophy is to have a QDMA wan side and a QDMA lan side. Thus the different sides can be asymmetric in features. Indeed it looks like one have 8 channels and the other have 32 channels.

QDMA can also only perform QoS in the Tx-path. On the Airoha platform each port is mapped to a channel which makes me think we should use the same model on Econet.

Regarding your questions about physical channels and channel mappings I dont know the answers but what I do know is that the acceleration rules only have one channel setting and one for port and sp_tag. So the channel and port is not tied together and you should be free to map flows to a channel and apply QoS rules on the channel.

1 Like

So there is a high degree of freedom, you can do alot but I think that the only worthwhile QoS model is to map each port to a channel and only perform QoS in the US direction.

All upstream traffic should be mapped to one channel with 8 queues with SP among them. The rules mapping into the queues should be based on VLAN PCP (L2) or DSCP (L3).

Anything else and especially downstream QoS I would see as a waste of time. The work to just being able to generate the traffic needed to test the features are enormous.

One more thing, it is unclear how much of the hardware actually works. Going of the tracks from what the vendor code does might end up in hardware bugs. IIRC the MT7621 QDMA had hardware bugs in the QoS engine.

2 Likes

Indeed it looks like one have 8 channels and the other have 32 channels.

Yeah, the reason I was suspicious of this limit is because it seems to be the only difference between the QDMAs. The GDMs have a fair bit of difference, but the QDMAs are (AFAICT) identical - with the exception of this one thing.

QDMA can also only perform QoS in the Tx-path

Well yes, but since we’re always forwarding, we always do TX eventually.

On the Airoha platform each port is mapped to a channel which makes me think we should use the same model on Econet.

It seems Airoha copied it from Mediatek, and (if I’m reading this right) it’s downstream QoS based on dest DSA port - which in most cases at that point the bottleneck is behind you.

only worthwhile QoS model is to map each port to a channel and only perform QoS in the US direction

Right, shaping with the bottleneck behind you is a really special case, you need to impose a global rate limit below what you know the line can support, and then pray that nobody sends bursty uncooperative traffic. IMO there is no excuse whatsoever for an xPON upstream to not be running through a box with CAKE.

BTW: It seems there is actually a global rate limit in the GDM, so that’s neat.

That said, there is a case where downstream QoS is actually very important, and that’s when you have WOE, because the bottleneck is overwhelmingly likely to be the wifi - but EN751221 doesn’t support WOE so this is a Future Thought.

All upstream traffic should be mapped to one channel with 8 queues with SP among them. The rules mapping into the queues should be based on VLAN PCP (L2) or DSCP (L3).

It’s really best practice to segregate by user, otherwise one person running rsync over ssh that is mistakenly classed as high priority is going to ruin it for everyone else.

The work to just being able to generate the traffic needed to test the features are enormous.

As long as we’re not trying to prove behavior in exact real world scenarios, generating traffic is easy - and it’s pretty easy to decide if one strategy is better than another. For example, less impact on ping latency from a big upload is a good thing (as long as you’re not faking it by special-casing ICMP). IMO the biggest challenge is actually configuring the modem in a scenario where it will have congestion - but I think this should be possible by using the WAN port on a EN7513 DSL modem and forcing it down to 10mb.

There’s also this which makes it really easy to get a rating:

Going of the tracks from what the vendor code does might end up in hardware bugs

That’s a good point, but it’s a bit tricky to figure out what the vendor code actually does and doesn’t do, because the HWNAT module exposes all features and then expects userland apps to make the rules. That said, I don’t think what I’m proposing goes that far off of what the vendor fw does, mostly what I’m talking about is a more advances queue selection strategy which takes source IP and port into account.

I know that I will push your rules a bit here, but having a hardware traffic shaper would be helpful for the "just run CAKE/fq-codel" approach... as traffic shaping in software is quite expensive CPU-wise and being able to offload this would be sweet, even if this only really works in upload direction...

1 Like

"just run CAKE/fq-codel" approach... as traffic shaping in software is quite expensive CPU-wise

I know that asking someone to run CAKE on a 900Mhz 34Kc is kind of a crap proposition, but the world is floating in devices that can very well handle 1Gb/s of CAKE - and it’s incredibly hard to get hardware QoS anywhere near as good as what CAKE does, so there’s a big risk of this project just making “another device that’s worse than CAKE”.

But combining hardware QoS with flowtable offload is something nobody has really done - except Mellanox, and their tech is extremely complex and the exact opposite of set-and-forget. So the way I see it, HW QoS + NAT flowtable offload is something actually new, and if we get something working really nicely, we can imagine eventually porting it to the 10Gb EN7580 and Airoha 7581/7583 which starts to be quite interesting because machines that can do 10Gb of CAKE start to be more costly and power demanding.

even if this only really works in upload direction

Not entirely true, it only really works when you’re behind the bottleneck, which if you have wifi then you probably are on the downstream…

1 Like

Well just configuring the ethernet wan to 10mbits will get you there.

So a 32 user maximum? Each user mapped to a channel based on some easily resolvable metric? I guess that would work but I dont think QoS is reliable. IMO it is best effort and if you sit behind a GW I would hope that video/voice conferencing and games sets the DSCP field in the L3 packets and other tools do not. I agree you could do better classification of the traffic but then you need more compute power then what the Econet family of gateways can provide.

Anyway I think a “basic” QoS implementation (as I described before) would result in a good compromise with respect to what the hardware can deliver and what people would like to have. And as we are going for faster and faster connections I would argue that there is less need for QoS. If you always have available bandwidth then there is no need to do any QoS.

Anyway I think we can revisit this discussion when there is some QoS support implemented.

that is unrelated to hardware or software implementation

we might consider ... to watch for greedy flows and isolate them in their own low-priority queue so that they don’t harm the user

that sounds like an interesting approach. Also can the queues drop from head instead of tail?

Wondering what one could do with hardware on Realtek chips for IPv6 routing.

Depends on where the hardware QoS engine is tucked into the stack, it might insist on talking to a phy which on ingress we do not have available, but you are right, I am speculating here.

Once you run out of channels, you have to start pairing up users in the same channel - but 2 users is better than “all users”.

Each user mapped to a channel based on some easily resolvable metric?

Source IP

hope that video/voice conferencing and games sets the DSCP field

This is where it gets tricky because you don’t see the packet header at the time you receive the flowtable insert request. But at some point you will transmit the packet and get it, it just requires some book-keeping to match the packet as-sent to the rule which emerged because of it. IMO worth it though.

you need more compute power then what the Econet family of gateways can provide

The beautiful thing about offloading is you only have to do the computation once, and then setup the channel/queue priorities, tell the PPE to put that flow in that channel, and then it’s hands-off from there.

Anyway I think a “basic” QoS implementation (as I described before)

If you’re talking about flowtable offloading, even “basic” is not that basic because as I said, the flowtable insert command does not come with any DSCP info, so it’s actually easier to assign channel/queue by source IP + port, because this is something you definitely know. I do want to include DSCP, but doing so is just tricky.

If you’re talking about QoS on non-offloaded traffic, I don’t see the point - I mean it certainly makes no sense to implement QoS and also flowtable, when they can’t be used together.

Anyway I think we can revisit this discussion when there is some QoS support implemented.

Yeah, this thread has been useful because now I know pretty much what levers I have available in the hardware and what is possible on the kernel integration side. A lot of the decision making such as per-user, per-port, per-DSCP is something we can play with at the end because it’s basically going to be a “select channel” function.

Also can the queues drop from head instead of tail?

I was looking for that and it seems the queues can only tail-drop. There is an RED available but that’s a global, not per-queue.

Aren’t the flows built dynamically based on generated traffic? Anyway there might be a disconnect between the kernel apis and what is possible in hardware.

One more thing to add if TSO is enabled hardware QoS will not be performed on the packets that are split.

I’ve heard that there was some vendor code that did this in the past, but it was considered unreliable and problematic. The way the kernel guys decided this should work is you write an nft file like this:

table inet x {
        flowtable f {
                hook ingress priority 0; devices = { eth0, eth1 };
                flags offload;
        }
        chain y {
                type filter hook forward priority 0; policy accept;
                ip protocol tcp flow add @f
                counter packets 0 bytes 0
        }
}

And when flowtable f is created the driver gets a call to ndo_setup_tc, with type = TC_SETUP_FT, and it registers its callback. Then when flow add @f invokes, the callback is called (airoha_dev_setup_tc_block_cb in Airoha) and the kernel uses TC_SETUP_CLSFLOWER- AFAICT this is because tc-flower does everything they need and there’s no reason to write a new classifier. Then it’ll push a couple of actions, which are just going to be FLOW_ACTION_MANGLE and FLOW_ACTION_CSUM and possibly FLOW_ACTION_REDIRECT. See: nf_flow_rule_route_ipv4 and nf_flowtable_inet.c.

The thing is, we have no knowledge of DSCP or skb→priority, and in fact nftables doesn’t really know either, because the packet has not yet visited the class/qdisc chain, type filter hook forward happens before that.

So the solution I propose is to work with flowtable, but also when we transmit a packet from software we check whether it’s related to a flow that we have inserted and then read QoS info off of it and update the rule as necessary.

EDIT:

One more thing to add if TSO is enabled hardware QoS will not be performed on the packets that are split.

I have no interest in implementing TSO because it only matters for software received packets and the goal is to forward everything in hardware.

Well wifi-traffic is “software” traffic. But yeah it is of limited use but the more spare cpu-cycles the kernel gets the better it can forward traffic.

The EN751627 (the one which supports TSO) has WOE so wifi traffic would then be in hardware.

Something in the back of my head: What to do about wifi traffic on EN751221, it would be convenient to bridge the wlan with the LAN GDM so the software does minimal effort and the packets immediately enter the HW NAT / QoS flow. Not quite sure how that would be implemented though.

Does it matter? The traffic must pass the cpu either way. Wlan0 and lan0 will be in the br-lan bridge. Traffic going out through the wan interface will be NATed through the PPE, bridged traffic will just be relayed via TDMA(PDMA) or QDMA.

Trying to make sure I understand the workflow so I know it will be right:

  1. Incoming packet from wlan on bridge
  2. Destination MAC is set to the MAC of the bridge (because we are the gateway)
  3. Somehow we need the packet to be bridged into the GDM with fport=PPE, not processed by the kernel
  4. Packet gets to the PPE and then it’s the normal hardware flow after that.

3 is the challenge

EDIT: FWIW, ChatGPT thinks that we can write this:

flowtable ft {
    hook ingress priority 0;
    devices = { eth0, eth1, wlan0, wlan1 };
    flags offload;
}

and when they insert a wlan → eth flow, Linux will be smart enough to make a software flowtable entry, so we just need to be smart enough to do the right thing when we get a transmit call on the WAN interface with an skb from a wlan device.