Flow_offloading=1 is broken on latest snapshot (4.19 issue)

Archer C7 v2, OpenWrt SNAPSHOT, r10451-653e05d27f

Flow offload seems to be broken on 4.19.

root@router:~# lsmod | grep flow
nf_conntrack           71817 18 ipt_MASQUERADE,xt_state,xt_nat,xt_conntrack,xt_REDIRECT,xt_CT,nft_redir_ipv4,nft_redir,nft_nat,nft_masq_ipv4,nft_masq,nft_flow_offload,nft_ct,nf_nat_ipv4,nf_nat,nf_flow_table,nf_conntrack_rtcache,nf_conntrack_netlink
nf_flow_table          14399  6 xt_FLOWOFFLOAD,nft_flow_offload,nf_flow_table_ipv6,nf_flow_table_ipv4,nf_flow_table_inet,nf_flow_table_hw
nf_flow_table_hw        2192  1 
nf_flow_table_inet       560  0 
nf_flow_table_ipv4       496  0 
nf_flow_table_ipv6       496  0 
nf_tables              88236 22 nft_redir_ipv4,nft_redir,nft_nat,nft_masq_ipv4,nft_masq,nft_flow_offload,nft_ct,nft_chain_nat_ipv4,nf_flow_table_ipv6,nf_flow_table_ipv4,nf_flow_table_inet,nft_reject_ipv6,nft_reject_ipv4,nft_reject_inet,nft_reject,nft_quota,nft_numgen,nft_log,nft_limit,nft_counter,nft_chain_route_ipv6,nft_chain_route_ipv4
nft_flow_offload        1648  0 

Actual speed with flow_offloading=1 is ~100 Mbps, while expected offload speed should be closer to ~1 Gbps (I can effortlessly check tell if it's >= 500 Mbps with my current setup). It's even worse than with flow_offloading=0, which gives around 350 Mbps.

By mistake, I initially though it was hardware offload that was broken, but it was using software offload before, and in fact it was the software offload that is malfunctioning.

What's going on there? Is there a fix pending?

UPD: changed some details to reflect the fact that it's the software offload that's broken, while HW offload never was supported for Archer C7 v2.

@nbd you might be interested in tackling this issue, and I can help with testing! :slight_smile:

Software offloading (flow_offloading) is supported in all targets, but hardware offloading (flow_offloading_hw) is only supported in ramips/mt7621 target.

Why is that? It seemed working just fine for me with the previous snapshot I was on, that was on 4.14.

The HW offload plugin is currently only available on MT7621 target.

Actually, with flow_offloading '1' and flow_offloading_hw '0' speed is ~100 Mbps, and with flow_offloading '0' and flow_offloading_hw '0' it is ~350 Mbps (the typical full-software routing limit, that would be the max speed for WiFi even with HW Flow Offload on cause it has to go through CPU).

This makes me think something is broken really bad. And the fact that it worked with previous ~2-month-old snapshot and now it doesn't just makes me very sad.

Is it because of the transition to 4.19? What broke other targets?

SW offload was available on all 4.14.x targets, but appears to be borked (some | all targets ???) with the 4.19 kernel push. HW offload was only ever available on the one target mentioned above.

Might have been a good idea to restrict LuCI showing those options only for said target, to avoid confusion and potential pitfalls enabling it on other targets


1 Like

Oh, I understand now. I'll change the title to be more specific.

So, how do I diagnose the problem further?

root@router:~# lsmod | grep -i flow
nf_conntrack           71817 12 xt_NETMAP,ipt_MASQUERADE,xt_state,xt_nat,xt_conntrack,xt_REDIRECT,xt_CT,nf_nat_ipv4,nf_nat,nf_flow_table,nf_conntrack_rtcache,nf_conntrack_netlink
nf_flow_table          14399  2 xt_FLOWOFFLOAD,nf_flow_table_hw
nf_flow_table_hw        2192  1
x_tables               15391 26 xt_NETMAP,ipt_MASQUERADE,xt_state,xt_nat,xt_conntrack,xt_REDIRECT,xt_FLOWOFFLOAD,xt_CT,ipt_REJECT,xt_time,xt_tcpudp,xt_multiport,xt_mark,xt_mac,xt_limit,xt_comment,xt_TCPMSS,xt_LOG,iptable_mangle,iptable_filter,ip_tables,xt_set,ip6table_mangle,ip6table_filter,ip6_tables,ip6t_REJECT
xt_FLOWOFFLOAD          2832  2
root@router:~# iptables -L | grep -i off
FLOWOFFLOAD  all  --  anywhere             anywhere             /* !fw3: Traffic offloading */ ctstate RELATED,ESTABLISHED FLOWOFFLOAD

I would suggest that you would start by reconciling the 4.14 and 4.19 offload specific patches under target/linux/generic, what made it, what did not, differences, what was upstreamed...

Should also mention that the ath79 4.14 patches are still in place, so that is available if you build your own image.

I've had similar plan in mind as a way to get to the root cause, though I wonder what diagnostics from the live system would be helpful. Probably very few people that had actually worked on the implementation know what I should check; but there's a good change they'll see this thread :slight_smile:

Building 4.14 image is my way to go for now, I've been working on this today: https://github.com/MOZGIII/archer-c7-v2-builder

A thread in the forum is probably not the best avenue for getting attention directed at the issue. Probably someone experiencing the issue should open a FS on the bug tracker with maybe a link to the various forum threads extent.

1 Like

True, and I've been there today, but I got too bored filling in the bug report. Maybe tomorrow :slight_smile:

1 Like

I've built a 4.14 image for my own use, and thought maybe other people might be interested. You can grab it from here: https://github.com/MOZGIII/archer-c7-v2-builds
The manual-build-2019-07-15-1 is what I'm using.

I created FS#2389.

1 Like

I’ve had a similar issue in EdgeRouter X.

It turned out for whatever reason WAN port eth0 was negotiating at 100Mb instead of 1G.

I thought it was flow offload related as well, until I decided to finally check. Changing eth4 to WAN resolved it for me

I'll test this right now! I doubt it is the reason though, cause I've speeds over 300 Mbps with offload off.

I noticed the same on my Archer C7 v2, top speeds at around 300Mbps. I tried using the old ar71 snapshots as a workaround but it appears that rpcd is broken on those so luci does not work. (But offloading works, so for anyone not in need of luci this might be a temporary solution)

As i kind of like luci for the ease of use i also switched to an ath79 4.14 image. This also gave me an excuse to try and build my own image :wink:

Should i create a flyspray account to add to the bug so it is clear that this is not an isolated problem for just one user?