A while back I started this thread on the lack of accounting for the MT7621 hardware flow offloads.
After reading through the MediaTek SDK, I have a reasonable handle on how the accounting works in the packet processing engine, and I think it should be possible to keep accurate per-flow metrics with (typically) minimal performance impact.
The deal is that the MT7621 has 64 accounting groups, each of which would have to be dedicated to a single flow at a time. This would require treating the HW offload table as a 64 entry cache for all offloaded flows, with the cache evictions/spills going to the SW offload path. For most uses, most of the time, this should be fine, as the per-flow traffic is likely to observe a power law - e.g. the busiest flows will contribute the bulk of the traffic, so most of the traffic will be HW offloaded. On my home network there are O(200) flows at any one time, and assuming a power distribution, only a few of those are going to contribute the bulk of the traffic, while most are going to be pretty quiet.
The outlier case is where there are a baziliion (16k HW entries in the limit) flows, all with equal traffic - in which case a cache will either thrash or offload only 16/bazillion flows.
Now, I've been reading through the Linux netfilter docs and kernel sources to try to get a handle on how it hangs together.
It's pretty neat & modular code, but what I'm struggling with ATM is the question of how to do cache management. To do this right, I need to be able to maintain the (approximate) busiest flows in the HW offload entries. Evicting quiet flows in he HW offload table is no big deal, but identifying the busiest evicted (SW) flows might be less straightforward. This would either require call-downs from the SW path, or some way to iterate the flows and assign the busy ones to HW entries.
As-is, it looks to me that there's only a one-time calldown to the HW offload path when a flow is initialized, and after that the HW flow offload is expected to either exist or not.
So questions for anyone who knows of such things:
Is there is existing support for hardware with a limited number of HW flow entries that I can reference?
Do I have this upside down somehow - is there an easier way?
Offload table is LRU and hw offload is another LRU on top of that. Counters update when flows retire to slowpath (established-accept) every 30 seconds at worst.
Thanks, that's useful to know. I looked at the code a bit, but couldn't offhand figure out the eviction paths. My plan was - and is - to set up JTAG debugging to get a better idea of how the code actually works, but due to other hobby work I've been stalled for a while.
I've been looking at this again for a couple of days, and I have a better^Wless confused understanding of how the offload machinery works. I still think it'd be helpful to see the machinery working, so I'm going to wire up JTAG on my ER-X.
Now it looks to me that the HW offloads are strictly layered below the SW offloads, which makes sense. I can see how the state machine could be driven by packets that hit the SW path, plus timeouts.
It also looks to me that the HW offloads always contain two flows for each SW flow, one for each direction.
What I'm having trouble with is how to manage a small (64 entry) cache of HW offloads for all SW offloads. I guess simply maintaining this at the SoC level is one way to go, e.g. just implement a simple LRU of the last 64 HW flows requested. I don't know that this will be reasonable, as the high traffic flows won't have a chance to migrate back into the HW cache until after they've been retired by the SW flow layer?
It should be easy enough to implement, though, so perhaps worth trying .
So ultimately I think either exposing the cache size to the "upstream" SW flow level, or else implementing some kind of upcall (HW flow evicted) would be required to manage the HW flows in a reasonable manner?
Your initial post was about MTK SDK - we no experts here, just sharing how open drivr works and how closed driver should be working. Immortalwrt packages closed drivers too, may give some goothold in your research.
I tried to wire up the GPIO pins of one of my PIs to the ER-X JTAG without any useful results. After hooking a logic analyzer to the pins in question, I found that the Argon One case I'm using actually hijacks some of the pins the GPIO adapter expects to use.
So I wired up the FT232R adapter I had been using for the console as a JTAG adapter, which works OK, albeit painfully slowly. I'm going to get an FT232H adapter, which should work orders of magnitude faster.
You should discuss your issues with proprietary SDK with Mediatek.
It appears you are using firmware that is not from the official OpenWrt project.
When using forks/offshoots/vendor-specific builds that are "based on OpenWrt", there may be many differences compared to the official versions (hosted by OpenWrt.org). Some of these customizations may fundamentally change the way that OpenWrt works. You might need help from people with specific/specialized knowledge about the firmware you are using, so it is possible that advice you get here may not be useful.
Ask for help from the maintainer(s) or user community of the specific firmware that you are using.
Provide the source code for the firmware so that users on this forum can understand how your firmware works (OpenWrt forum users are volunteers, so somebody might look at the code if they have time and are interested in your issue).
If you believe that this specific issue is common to generic/official OpenWrt and/or the maintainers of your build have indicated as such, please feel free to clarify.
No, I'm running mainline OpenWRT. The MT7621 SoC has a very limited per-flow accounting capability. So unlike later SoCs, per-flow packet/octet counts are suspended for HW offloaded flows (or I guess more accurately for the packets that flow the HW path).
I'm proposing that it would be possible to fix this by never having more than 64 HW offloads live at a time, and likely without noticable performance effects (for most people). This would allow maintaining the per-flow accounting by assigning one of the MT7621's 64 accounting groups to a flow at a time.
My "problem" is that I'm a Linux network code n00b, and I'm having real trouble understanding how conntrack, SW and HW flow offloads hang together. Hence JTAG debugging, which is a whole another thang to learn.
Yes, it sure it. This, however, doesn't help where the hardware doesn't maintain accounting for HW offloads, which is the point of this thread and the thread I linke at the top. The per-flow metrics are TOTALLY accurate on the MT7621 until the point where they go to HW offload, at which point every packet and octet that goes through the HW offload path "vanishes" with respect to per-flow accounting.
Because I need to see the code running to figure out how it works, as I can't for the life of me figure out how the various work queues that take conntrack to SW to HW offloads hang together, nor how I modify them to maintain no more than 64 HW offloads at a time.