First rule in chain INPUT & OUTPUT for firewall4

I tried with inserting below two rules before 'flow add':

ip protocol udp ip saddr 192.168.2.2 accept
ip protocol udp ip daddr 192.168.2.2 accept
meta l4proto { tcp, udp } flow add @ft

Works pretty well that all UDP packets originated from or destinated to 192.168.2.2 will remain on slow flow and none out of order packets seen even going as high as 900Mbps

So seems like this could be a workaround if necessary.

Now I recall you intended to create a 'flow_handler' and jump to flow_handler to process 'flow add' and 'accept' with pre-condition 'ct state { established, related }'. I think that appears to be a good idea. So it looks something like this:

In chain 'forward'

ct state { established, related } jump flow_handler

In the flow_handler

<flowtable exemption to be inserted by users here>
flow add @ft
accept

It is easy:

Files in /etc/nftables.d/ like gamer.nft will be put right after flowtable definition before any other rules, see templates/ruleset.uc

chain handle_offload {
   iif !="lo" udp sport 9010 accept
}

If other chain is emitted then this one will be prepended, otherwise hang idle.
Evaluate with nft -c -d netlink -f file_with_table_inet_around.nft some constrcts are hell slow, like meta vs packet access or iif vs iifname. More or less you need to permutate particular construct until fastest is reached, you can add 1000 repetitions to LAN ping to visually measure, then add counter between conditions and reorder them to have most picky/selective/fastest first.

	meta l4proto udp ip saddr . ip daddr { 192.168.2.2 . 0/0, 0/0 . 192.168.2.2 } accept

So it's '/usr/share/firewall4/templates/ruleset.uc' line 104

Had I known this feature, I won't have created my own 'table-prepend' user includes. LOL

This is a better version than my quick & dirty test cases. So rules like this could be inserted into 'flow_handler' 'handle_offload' by user includes with an infrastructure like this: (changed slightly to better align with your pull-request codes)

By checking your pull-request, I think you still intend to do so mostly:

Now my only comment is you could merge the two 'ct state' statements into a single one like above.

I think the end result will be pretty cool.

Yep, still if you find nonfifo reordering discussion im interested in test cases.
User chance to fix up is nice cf synflood has it.

No new ones at the moment. If I hear some new cases, I'll let you know.

Interesting idea for heuristic (I hope)
Add first handle_offload chain rule

meta length lt 128 accept

Tried 64...1024 , the jitter is lowest around 100...300.
I'd speculate its some CPU load added to individual packet, becoming insignificant for bigger ones.

What do you mean by 'jitter"?
And what's your base case for comparison?

One that netperf shows , like min max avg stddev...

sync packet should be small ideally. But may not always be the case. Also first UDP packets might not be small either.

So in some cases the flow may not be offloaded. No?

Only about more constant forward time.
In place of not offloading game port