I didn't touch kernel configs/settings only config of nlbwmon. But in my previous post i got issues again
after you restarted the service (nlbwmon) check you system log and see if you get the same error I am getting.. perhaps the changes you are making and being blocked like mine. That would explain why you are still getting the errors even after increasing the buffer... it isn't allowing it.
Tue Feb 1 22:05:16 2022 daemon.err nlbwmon: The netlink receive buffer size of 2097152 bytes will be capped to 180224 bytes Tue Feb 1 22:05:16 2022 daemon.err nlbwmon: by the kernel. The net.core.rmem_max sysctl limit needs to be raised to Tue Feb 1 22:05:16 2022 daemon.err nlbwmon: at least 2097152 in order to sucessfully set the desired receive buffer size!
Yes is capped. Will change rmem max to 524288 and change nlbwmon to default again
Thanks will check if this works without errors
root@OpenWrt:/etc/config# sysctl -w net.core.rmem_max=524288 net.core.rmem_max = 524288 root@OpenWrt:/etc/config# sysctl net.core.rmem_max net.core.rmem_max = 524288
Awesome - let us know if that fixes the errors!
Seems like this is something that everyone probably just ignored in the past... lol
Indeed it has not point to set it to 5xxxx ++++ if the kernel doesn't allow it.
I know that Kongs build had 1048576 in nlbwmon conf but i don't know if the kong kernel allows that.
Will report back if problems come back !
Well, the error message it is rather new. Jow added the warning to nlbwmon in September, in response to discussion in
I have increased the value in my own sysctl for some time now, but looks like I haven't pushed it to the build defaults.
I am new to that, but looked into it:
It got everything I need, but for me it's a bit overkill I think, but I might give it a try if SQM keeps giving me troubles. I will post my experiences when I do.
I enabled SQM again on the 18683 to see if it's stable...
I haven't tested Qosify extensively either, but I'm very skeptical of it based on this comment:
Currently I have an SQM Cake setup that works well for me, but I'm open to experimenting with Qosify again if a benefit to DSCP marking is found.
Speaking of long ignored errors, is anyone else seeing errors like this in the system log or is it just me? I've had them for a long time.
Tue Feb 1 18:16:37 2022 daemon.notice netifd: wan6 (1553): cat: write error: Broken pipe Tue Feb 1 18:16:37 2022 daemon.notice netifd: wan6 (1553): cat: write error: Broken pipe Tue Feb 1 18:16:40 2022 daemon.notice netifd: wan6 (1553): cat: write error: Broken pipe
Another crash / reboot with the 18683 firewall build and SQM (cake/layer_cake) enabled. I do think it logged the crash at /sys/fs/pstore/dmesg-ramoops-1:
For me it's hard to read what is the cause as I am not deep enough into the openwrt system, but it might be readible to some of you?
Maybe useful information as I run quite some services behind the router:
cat /proc/sys/net/netfilter/nf_conntrack_count 6271
<1>[183083.086172] 8<--- cut here --- <1>[183083.086212] Unable to handle kernel paging request at virtual address c170d758 <1>[183083.088135] pgd = db195541 <1>[183083.095415] [c170d758] *pgd=4361141e(bad) <0>[183083.098201] Internal error: Oops: 8000000d [#1] SMP ARM <4>[183083.102363] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet ath10k_pci ath10k_core ath wireguard pptp pppox ppp_mppe ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack_netlink nf_conntrack mac80211 libchacha20poly1305 libblake2s curve25519_neon cfg80211 x_tables slhc sch_cake poly1305_arm nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c libblake2s_generic crc_ccitt compat chaoskey chacha_neon fuse cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact ledtrig_usbport msdos ip_gre gre ifb ip6_udp_tunnel udp_tunnel sit tunnel4 ip_tunnel <4>[183083.103115] tun vfat fat hfsplus cdrom cifs nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp850 nls_cp437 nls_cp1250 sha512_generic sha1_generic seqiv md5 md4 ecb des_generic libdes cmac usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 exfat crc32c_generic <4>[183083.214079] CPU: 0 PID: 30449 Comm: kworker/0:0 Not tainted 5.10.92 #0 <4>[183083.236304] Hardware name: Generic DT based system <4>[183083.242751] Workqueue: events dbs_work_handler <4>[183083.247591] PC is at 0xc170d758 <4>[183083.252112] LR is at __krait_mux_set_sel+0x70/0x9c <4>[183083.255572] pc : [<c170d758>] lr : [<c0695270>] psr: 60000013 <4>[183083.260178] sp : c4edbd84 ip : 00000000 fp : c1ed4500 <4>[183083.266772] r10: c1702718 r9 : 00000000 r8 : c4edbdd4 <4>[183083.272067] r7 : 00000002 r6 : ffffffff r5 : 00000001 r4 : c170d758 <4>[183083.277365] r3 : c0b12350 r2 : c0d9a310 r1 : 20000013 r0 : 000346dc <4>[183083.283706] Flags: nZCv IRQs on FIQs on Mode SVC_32 ISA ARM Segment none <4>[183083.290300] Control: 10c5787d Table: 456d806a DAC: 00000051 <0>[183083.297590] Process kworker/0:0 (pid: 30449, stack limit = 0xb292179c) <0>[183083.303407] Stack: (0xc4edbd84 to 0xc4edc000) <0>[183083.309934] bd80: 00000001 ffffffff c06952fc c170d764 c0d7fc10 ffffffff c0696a50 <0>[183083.314454] bda0: 00000000 c0d7fc10 ffffffff c03413cc c1702700 c0d7fc10 c170e240 00000002 <0>[183083.322702] bdc0: 2faf0800 c1ed4680 00000000 c0685714 c15a2e00 c170bb80 2faf0800 23c34600 <0>[183083.330948] bde0: c15a2e00 23c34600 c170e240 00000000 c1590180 c0689b78 c170e128 c1590180 <0>[183083.339194] be00: c06957a8 c14c83c0 2faf0800 c1ed4680 00000000 c0689bbc c170e240 00000000 <0>[183083.347441] be20: 23c34600 c1590180 c1ed4700 c1ed4680 00000000 c0689de8 23c34600 23c34600 <0>[183083.355688] be40: 00000000 ffffffff 23c34600 c0d81e54 c170e240 c1ed5040 23c34600 dd98b010 <0>[183083.363933] be60: 2faf0800 c1ed4700 c1ed4680 c0689f68 c1edb000 23c34600 dd98b010 2faf0800 <0>[183083.372181] be80: c1ed4700 c07a9884 00000000 c0dd04a8 c1ed46b8 c1ed4738 00000000 23c34600 <0>[183083.380427] bea0: c4eda000 c1edb800 00000000 c0dd0470 00000001 000927c0 00000000 00000000 <0>[183083.388674] bec0: c4eda000 c07aeadc c1edb800 000c3500 000927c0 000000a1 c1edb800 c1ed4980 <0>[183083.396921] bee0: c1ed4a00 c1ed4980 c1ed56c0 c1ed4a00 00000000 c07b2074 c1ed49b8 00000000 <0>[183083.405167] bf00: c1ed4984 c0d8fa84 00000000 00000000 00000000 c07b2d5c c1ed49b8 c5d8b600 <0>[183083.413413] bf20: dd9918c0 dd994a00 00000000 c033820c 00000008 dd9918d8 c5d8b600 c5d8b614 <0>[183083.421660] bf40: dd9918c0 00000008 dd9918d8 c0d03d00 dd991a80 c03384f4 c0d9aaa8 c0d0c07c <0>[183083.429907] bf60: c5d8b600 c607c080 c6dfa0c0 00000000 c4eda000 c0338480 c5d8b600 c5751ec4 <0>[183083.438152] bf80: c607c0a4 c033e3d8 00000000 c6dfa0c0 c033e27c 00000000 00000000 00000000 <0>[183083.446398] bfa0: 00000000 00000000 00000000 c0300148 00000000 00000000 00000000 00000000 <0>[183083.454644] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <0>[183083.462891] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000 <0>[183083.471135] [<c0695270>] (__krait_mux_set_sel) from [<c170d764>] (0xc170d764) <0>[183083.479377] Code: 00000003 00000000 00000001 00000201 (c170e240) <4>[183083.486572] ---[ end trace b6a761c5b562e7b2 ]--- <0>[183083.511088] Kernel panic - not syncing: Fatal exception <2>[183083.511132] CPU1: stopping <4>[183083.515470] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G D 5.10.92 #0 <4>[183083.517983] Hardware name: Generic DT based system <4>[183083.525643] [<c030e32c>] (unwind_backtrace) from [<c030a1ac>] (show_stack+0x14/0x20) <4>[183083.530326] [<c030a1ac>] (show_stack) from [<c062eae8>] (dump_stack+0x94/0xa8) <4>[183083.538310] [<c062eae8>] (dump_stack) from [<c030d050>] (do_handle_IPI+0x140/0x184) <4>[183083.545426] [<c030d050>] (do_handle_IPI) from [<c030d0b0>] (ipi_handler+0x1c/0x2c) <4>[183083.553411] [<c030d0b0>] (ipi_handler) from [<c0370f7c>] (__handle_domain_irq+0x90/0xf4) <4>[183083.560793] [<c0370f7c>] (__handle_domain_irq) from [<c0648e40>] (gic_handle_irq+0x90/0xb8) <4>[183083.569127] [<c0648e40>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90) <4>[183083.577620] Exception stack(0xc146df18 to 0xc146df60) <4>[183083.585003] df00: 00000000 0000a683 <4>[183083.590144] df20: 1cd5a000 dd9a0cc0 00000000 7358a820 c174c040 00000000 dd99ffb0 0000a683 <4>[183083.598390] df40: 00000000 0000a683 1ceb1880 c146df68 c07b5d94 c07b5db4 60000013 ffffffff <4>[183083.606634] [<c0300b0c>] (__irq_svc) from [<c07b5db4>] (cpuidle_enter_state+0x180/0x380) <4>[183083.614873] [<c07b5db4>] (cpuidle_enter_state) from [<c07b6004>] (cpuidle_enter+0x3c/0x5c) <4>[183083.623120] [<c07b6004>] (cpuidle_enter) from [<c034df10>] (do_idle+0x208/0x2a4) <4>[183083.631275] [<c034df10>] (do_idle) from [<c034e268>] (cpu_startup_entry+0x1c/0x20) <4>[183083.638912] [<c034e268>] (cpu_startup_entry) from [<4230152c>] (0x4230152c)
From the panic log, CPU0's LR is at
__krait_myx_set_sel. Looks very similar to the panic I encountered recently. I posted the issue here:
Seems like for 5.4 and 5.10, the ipq806x CPU does not like changing CPU frequency. Maybe this issue has been their for the ipq806x all along. Probably just more pronounced with more recent kernels.
So a workaround for the crashing could be using performance governor instead?
Maybe it's a work-around. I have no way to confirm at the moment tho.
I am trying it now and will report. I did this:
echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor echo performance > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
Things I notice: the webinterface feels a lot snapier but as a downside it increased the thermals a few degrees on all zones. (which makes sense of course)
Edit I checked my CPU load when using the full bandwidth and decided to decrease the clockspeed (and temperature) a bit by setting the core freq to 1.4Mhz instead of 1.725Mhz
echo 1400000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq echo 1400000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq
I can confirm in the graphs that the clockspeed is now fixed at 1.4Mhz.
It also makes sense to do some tests: packet steering enabled, software offloading & IRQ balance disabled:
100/100 loaded connection core 0 load SQM disabled: 40%
100/100 loaded connection core 0 load with SQM layer_cake enabled: 81%
100/100 loaded connection core 0 load with SQM piece_of_cake enabled: 84%
Enabling sofware flow offloading in the UI and IRQ balance at /etc/config/irqbalance didn't change the results: probably because SQM is enabled?
I have been using these in my local start-up tab for over a year, and haven't had a crash/reboot since. The main culprit for me was the frequency switching to lowest freq. All of the values were gathered from around this forum.
echo schedutil > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor echo schedutil > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor echo 1725000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq echo 1725000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq echo 800000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq echo 800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq echo 3 > /sys/devices/virtual/net/br-lan/queues/rx-0/rps_cpus echo 3 > /sys/devices/virtual/net/eth0.2/queues/rx-0/rps_cpus echo 3 > /sys/devices/virtual/net/eth1.1/queues/rx-0/rps_cpus echo 3 > /sys/devices/virtual/net/ifb4pppoe-wan/queues/rx-0/rps_cpus echo 3 > /sys/devices/virtual/net/lo/queues/rx-0/rps_cpus echo min_power > /sys/devices/platform/soc/29000000.sata/ata1/host0/scsi_host/host0/link_power_management_policy exit 0
I use SQM with layered cake
Thanks! The one I posted above seems to work ok for me as well, but I prefer it to scale up to 1.7Mhz when needed so I will give your config a try.
See! there is definitely some problem with the scaling and it does crash always right after the mux code... (that i'm 100% sure it's called by the krait notifier for the safe parent) with a random error like this one for the virtual page fault...
forcing the system to max freq should remove any instability by this problem and the system would crash only with some defect on the chip/bad power supply.
The fact is that regulators are not set to work 100% of the time at max voltage and this on the long run cause crash due to overheat or power supply spike... (or even grid problems)
Back in the old 5.4 days all worked well because we didn't have a cpu freq driver and the cpu and cache was set to the lower value of 800mhz... so the cpu wasn't that sensible to voltage change or problems by the regulator overheating...
would also explain why with the nss core the system is more stable... cpu is less used... less cpu freq change... less load on the regulators...
Yes that makes a lot of sense.
I think a higher minimum scaling freq might help or a fixed one at 1.4Mhz if you don't have extreme bandwidth and want to use SQM. I am going to test both for a few days.
If you do have a high bandwidth connection you probably want nss or set the core fixed at 1.7Mhz and hope for the best
I've always kind of wondered what a big fat heat sink would do to improve things here.
not much if the system works outside his spec due to sw bugs