I've used pure dnsmasq in the past, but added unbound, as it's the only one capable of recursive DNS. I don't believe unbound is the issue though. Since even disabling/stopping the application I'm unable to make DNS requests directly from the router to an external resolver, or to dnsmasq for local queries.
Pretty much any communication requiring UDP socket breaks at the kernel level.
i've had a little bit of free time with work lately so i decided to try another target for fun.
i bought 2 x mt6000. they are based on the mediatek filogic soc. pretty close to the same class as these ipq807x devices.
after using them for 2 weeks i am sending them back to amazon.
this nss build + the 301w are better. they are fairly close, but strictly speaking about wifi:
the ipq807x has a more stable connection. when i do transfers that are wifi speed limited, i typically see 180-190MB/s fairly flat (0-50GB transfers) with the ipq807x.
the mediatek filogic has a higher peak, the transfers frequently hit 215+MB/s, but they are much choppier. frequently dropping below 150MB/s, if not lower.
i know its hard to compare since there is more to a router than just the soc, but after quite a bit of tinkering... the ipq807x + this nss build in my opinion is still the best that non proprietary (...software wise) wifi routers have to offer. and yes i know the 301w is 10gbe and the mt6000 is 2.5gbe etc etc etc... but my nas cant really push too much past 250-300MB/s anyways so its sort of a moot point in my use case. my wan connection is 500/100 so both devices can handle that without any problems.
just though id post my insight in case anyone else was thinking about giving the other 'big' soc a shot.
edit : i suppose its worth mentioning that the mt6000 plus all of the other filogic based devices have quite a larger community around it. larget than the ipq807x one by the looks of it. so progress with issues witll probably be faster than with the ipq807x. also mediatek looks to be much more involved in the various oss bits around their products. more so than qca by the looks of it anyways. so maybe in a year or so the story will be different. and of course theres the new filogic soc that came out fairly recently (bpi r4) which looks to be quite the beast. but its early days there so all the usual caveats apply.
My lottery guess is that your error has something common with this error I experiense.
And that is, probably RTNETLINK that looks suspicious (not working).
So maybe we have an issue with RTNETLINK in the builds and @qosmio just another blind guess - can it be somehow connected with your recent DNS issue.
I first experienced that error once I changed some advanced configuration settings that I made looking here.
I changed the below settings. Before that I used only default settings and I didn't experienced that error.
CONFIG_USE_MOLD=y
CONFIG_USE_LTO=y
CONFIG_USE_GC_SECTIONS=y
CONFIG_TARGET_OPTIONS=y
CONFIG_TARGET_OPTIMIZATION="-O3 -pipe -mcpu=cortex-a53+crypto+crc"
# CONFIG_SECCOMP is not set
CONFIG_PKG_RELRO_PARTIAL=y
# CONFIG_PKG_RELRO_FULL is not set
# CONFIG_KERNEL_WERROR is not set
# CONFIG_KERNEL_SECCOMP is not set
# CONFIG_KERNEL_NAMESPACES is not set
# CONFIG_KERNEL_KEYS is not set
# CONFIG_KERNEL_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_KERNEL_ELF_CORE is not set
# CONFIG_KERNEL_CGROUPS is not set
CONFIG_COLLECT_KERNEL_DEBUG=y
CONFIG_KERNEL_PERF_EVENTS=y
CONFIG_KERNEL_DYNAMIC_DEBUG=y
CONFIG_KERNEL_ARM_PMU=y
CONFIG_GCC_DEFAULT_PIE=y
CONFIG_EXPERIMENTAL=y
CONFIG_DEVEL=y
CONFIG_CCACHE=y
CONFIG_BUILD_PATENTED=y
And I've recently found out that if KERNEL_CGROUPS isn't used on ipq806x (R7800) the compiled firmware will not include cpuset and thus unable to use it in a script that sets different services to different cpu cores (irq affinity). Although there is a script that can set them without using cpuset.
Another error for example on ipq806x, messing with those advanced settings, was that zerotier started to spit segmentation fault error. This error appeared once I used GCC 13.1 (default for 23.05 is GCC 12.3) and CGROUPS enabled. Trial and error showed that zerotier works properly if I don't touch default advanced settings or I if use GCC 13.1 without CGROUPS. In latter case losing cpuset.
So obviously if any advanced setting is changed (better to not do it blindly) a good understanding of its functions is needed. But it's a good feeling to tackle with all these settings.
# CONFIG_KERNEL_CGROUPS is not set
I suspect that any of these advanced configuration settings might be the reason for these RTNETLINK errors. But its simply a guess based on what settings I changed in my config before and after that error appeared in my build.
Last but not least, there are many other settings in menuconfig that should be accounted for too.
Getting the wifi offload + memory usage were the last few pieces of the "full offload" puzzle. And as a whole it's been a pretty stable and performant platform. Especially in regards to wifi performance when compared to Mediatek, and even Broadcom (I had but returned the RT-AX86U on Merlin FW).
QCA's codebase is all over the place, the GIT commits may as well be "did some stuff", "update", "blah". The QSDK release schemes have no rhyme or reason to discern between features, bugfixes, backports, etc, and the documentation is non-existent.
Majority of my NSS patches have been a complete shot in the dark... But by some miracle hasn't blown up in my face and chugging a long .
cgroups, cpuset, namespaces are primarily for containers and virtualization which we wouldn't be doing on these platforms, and just added to kernel size.
These are still experimental but could be the reason for most of your segmentation errors.
Not sure which specific RTNETLINK error you mean. The one from @TomCruise's output? I believe there is a race condition somewhere in the way I'm managing the available interfaces being available during initialization.
My specific issue related to UDP socket bug only started after b5c5384 and 7116d2f on 6.6.28 and 29.
The advanced menuconfig settings though have been pretty much the same since I posted it for 6.1, and through 6.6 when it was introduced in 6.6.22.
all good with UDP traffic as well.
Uptime is 22hrs currently. On @qosmio latest 6.6.29 repo build.
faultless from what i can see. but i do have a simplified usage as AP only.
I meant both the @TomCruise error and the error I experienced from wg-bench linked in my previous post. Just because RTNETLINK showed there.
I believe you know better than me but my error just happened when I changed above advanced settings but it just maybe a chance.
And of course the real reason is probably completely different and not connected to any of those advanced settings.
But still I wonder if RTNETLINK is the main culprit for both issues or it is only an aftermath of any other issue?
Those look interesting. I'll take a look at those today. Thanks!
Also, it looks like we may have to revisit re-enabling the 'disable_offloads' options of ECM. We're technically supposed to keep GRO and GSO off. But I've found disabling GSO actually negatively impacted wifi performance.
In my IPQ8074a device the Wi-Fi is provided by two external transceivers and 12 front end module's, the Ethernet is provided by two external PHY chips and I'm guessing the SOC provides the switching?.
I'm guessing all devices are of similar modular setup.
@qosmio just a question.. your build is working for all IPQ807x target? I want to test it into a ZTE MF269
Does it able to run while having only 512MB of RAM? I see the build from @AgustinLorenzo where ath11k seems compiled with a limite of 512MB, so may be these builds are suitable only for device with 1GB of RAM?
If there's support in main stream for your device. You just need NSS patches. You'll have to test and see.
512M is the baseline for many of these routers, ideal is 1GB with the "mindset" of a constrained device like 512MB.
ATH11K is a memory hog so, effort was made to minimize it as much as possible, since many folks are running more than couple of apps on these routers.
Yup, it's on purpose. You don't gain any extra performance boost from more memory available to the driver than necessary. Because once it's assigned at boot that's it.