Build for Netgear R7800

I haven't tested Qosify extensively either, but I'm very skeptical of it based on this comment:

Currently I have an SQM Cake setup that works well for me, but I'm open to experimenting with Qosify again if a benefit to DSCP marking is found.

Speaking of long ignored errors, is anyone else seeing errors like this in the system log or is it just me? I've had them for a long time.

Tue Feb  1 18:16:37 2022 daemon.notice netifd: wan6 (1553): cat: write error: Broken pipe
Tue Feb  1 18:16:37 2022 daemon.notice netifd: wan6 (1553): cat: write error: Broken pipe
Tue Feb  1 18:16:40 2022 daemon.notice netifd: wan6 (1553): cat: write error: Broken pipe

Another crash / reboot with the 18683 firewall build and SQM (cake/layer_cake) enabled. I do think it logged the crash at /sys/fs/pstore/dmesg-ramoops-1:
For me it's hard to read what is the cause as I am not deep enough into the openwrt system, but it might be readible to some of you? :slight_smile:
Maybe useful information as I run quite some services behind the router:
cat /proc/sys/net/netfilter/nf_conntrack_count 6271

The crashlog:

<1>[183083.086172] 8<--- cut here ---
<1>[183083.086212] Unable to handle kernel paging request at virtual address c170d758
<1>[183083.088135] pgd = db195541
<1>[183083.095415] [c170d758] *pgd=4361141e(bad)
<0>[183083.098201] Internal error: Oops: 8000000d [#1] SMP ARM
<4>[183083.102363] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet ath10k_pci ath10k_core ath wireguard pptp pppox ppp_mppe ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack_netlink nf_conntrack mac80211 libchacha20poly1305 libblake2s curve25519_neon cfg80211 x_tables slhc sch_cake poly1305_arm nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c libblake2s_generic crc_ccitt compat chaoskey chacha_neon fuse cls_bpf act_bpf sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact ledtrig_usbport msdos ip_gre gre ifb ip6_udp_tunnel udp_tunnel sit tunnel4 ip_tunnel
<4>[183083.103115]  tun vfat fat hfsplus cdrom cifs nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp850 nls_cp437 nls_cp1250 sha512_generic sha1_generic seqiv md5 md4 ecb des_generic libdes cmac usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 exfat crc32c_generic
<4>[183083.214079] CPU: 0 PID: 30449 Comm: kworker/0:0 Not tainted 5.10.92 #0
<4>[183083.236304] Hardware name: Generic DT based system
<4>[183083.242751] Workqueue: events dbs_work_handler
<4>[183083.247591] PC is at 0xc170d758
<4>[183083.252112] LR is at __krait_mux_set_sel+0x70/0x9c
<4>[183083.255572] pc : [<c170d758>]    lr : [<c0695270>]    psr: 60000013
<4>[183083.260178] sp : c4edbd84  ip : 00000000  fp : c1ed4500
<4>[183083.266772] r10: c1702718  r9 : 00000000  r8 : c4edbdd4
<4>[183083.272067] r7 : 00000002  r6 : ffffffff  r5 : 00000001  r4 : c170d758
<4>[183083.277365] r3 : c0b12350  r2 : c0d9a310  r1 : 20000013  r0 : 000346dc
<4>[183083.283706] Flags: nZCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[183083.290300] Control: 10c5787d  Table: 456d806a  DAC: 00000051
<0>[183083.297590] Process kworker/0:0 (pid: 30449, stack limit = 0xb292179c)
<0>[183083.303407] Stack: (0xc4edbd84 to 0xc4edc000)
<0>[183083.309934] bd80:          00000001 ffffffff c06952fc c170d764 c0d7fc10 ffffffff c0696a50
<0>[183083.314454] bda0: 00000000 c0d7fc10 ffffffff c03413cc c1702700 c0d7fc10 c170e240 00000002
<0>[183083.322702] bdc0: 2faf0800 c1ed4680 00000000 c0685714 c15a2e00 c170bb80 2faf0800 23c34600
<0>[183083.330948] bde0: c15a2e00 23c34600 c170e240 00000000 c1590180 c0689b78 c170e128 c1590180
<0>[183083.339194] be00: c06957a8 c14c83c0 2faf0800 c1ed4680 00000000 c0689bbc c170e240 00000000
<0>[183083.347441] be20: 23c34600 c1590180 c1ed4700 c1ed4680 00000000 c0689de8 23c34600 23c34600
<0>[183083.355688] be40: 00000000 ffffffff 23c34600 c0d81e54 c170e240 c1ed5040 23c34600 dd98b010
<0>[183083.363933] be60: 2faf0800 c1ed4700 c1ed4680 c0689f68 c1edb000 23c34600 dd98b010 2faf0800
<0>[183083.372181] be80: c1ed4700 c07a9884 00000000 c0dd04a8 c1ed46b8 c1ed4738 00000000 23c34600
<0>[183083.380427] bea0: c4eda000 c1edb800 00000000 c0dd0470 00000001 000927c0 00000000 00000000
<0>[183083.388674] bec0: c4eda000 c07aeadc c1edb800 000c3500 000927c0 000000a1 c1edb800 c1ed4980
<0>[183083.396921] bee0: c1ed4a00 c1ed4980 c1ed56c0 c1ed4a00 00000000 c07b2074 c1ed49b8 00000000
<0>[183083.405167] bf00: c1ed4984 c0d8fa84 00000000 00000000 00000000 c07b2d5c c1ed49b8 c5d8b600
<0>[183083.413413] bf20: dd9918c0 dd994a00 00000000 c033820c 00000008 dd9918d8 c5d8b600 c5d8b614
<0>[183083.421660] bf40: dd9918c0 00000008 dd9918d8 c0d03d00 dd991a80 c03384f4 c0d9aaa8 c0d0c07c
<0>[183083.429907] bf60: c5d8b600 c607c080 c6dfa0c0 00000000 c4eda000 c0338480 c5d8b600 c5751ec4
<0>[183083.438152] bf80: c607c0a4 c033e3d8 00000000 c6dfa0c0 c033e27c 00000000 00000000 00000000
<0>[183083.446398] bfa0: 00000000 00000000 00000000 c0300148 00000000 00000000 00000000 00000000
<0>[183083.454644] bfc0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<0>[183083.462891] bfe0: 00000000 00000000 00000000 00000000 00000013 00000000 00000000 00000000
<0>[183083.471135] [<c0695270>] (__krait_mux_set_sel) from [<c170d764>] (0xc170d764)
<0>[183083.479377] Code: 00000003 00000000 00000001 00000201 (c170e240) 
<4>[183083.486572] ---[ end trace b6a761c5b562e7b2 ]---
<0>[183083.511088] Kernel panic - not syncing: Fatal exception
<2>[183083.511132] CPU1: stopping
<4>[183083.515470] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D           5.10.92 #0
<4>[183083.517983] Hardware name: Generic DT based system
<4>[183083.525643] [<c030e32c>] (unwind_backtrace) from [<c030a1ac>] (show_stack+0x14/0x20)
<4>[183083.530326] [<c030a1ac>] (show_stack) from [<c062eae8>] (dump_stack+0x94/0xa8)
<4>[183083.538310] [<c062eae8>] (dump_stack) from [<c030d050>] (do_handle_IPI+0x140/0x184)
<4>[183083.545426] [<c030d050>] (do_handle_IPI) from [<c030d0b0>] (ipi_handler+0x1c/0x2c)
<4>[183083.553411] [<c030d0b0>] (ipi_handler) from [<c0370f7c>] (__handle_domain_irq+0x90/0xf4)
<4>[183083.560793] [<c0370f7c>] (__handle_domain_irq) from [<c0648e40>] (gic_handle_irq+0x90/0xb8)
<4>[183083.569127] [<c0648e40>] (gic_handle_irq) from [<c0300b0c>] (__irq_svc+0x6c/0x90)
<4>[183083.577620] Exception stack(0xc146df18 to 0xc146df60)
<4>[183083.585003] df00:                                                       00000000 0000a683
<4>[183083.590144] df20: 1cd5a000 dd9a0cc0 00000000 7358a820 c174c040 00000000 dd99ffb0 0000a683
<4>[183083.598390] df40: 00000000 0000a683 1ceb1880 c146df68 c07b5d94 c07b5db4 60000013 ffffffff
<4>[183083.606634] [<c0300b0c>] (__irq_svc) from [<c07b5db4>] (cpuidle_enter_state+0x180/0x380)
<4>[183083.614873] [<c07b5db4>] (cpuidle_enter_state) from [<c07b6004>] (cpuidle_enter+0x3c/0x5c)
<4>[183083.623120] [<c07b6004>] (cpuidle_enter) from [<c034df10>] (do_idle+0x208/0x2a4)
<4>[183083.631275] [<c034df10>] (do_idle) from [<c034e268>] (cpu_startup_entry+0x1c/0x20)
<4>[183083.638912] [<c034e268>] (cpu_startup_entry) from [<4230152c>] (0x4230152c)
1 Like

From the panic log, CPU0's LR is at __krait_myx_set_sel. Looks very similar to the panic I encountered recently. I posted the issue here:

Seems like for 5.4 and 5.10, the ipq806x CPU does not like changing CPU frequency. Maybe this issue has been their for the ipq806x all along. Probably just more pronounced with more recent kernels.

4 Likes

So a workaround for the crashing could be using performance governor instead?

Maybe it's a work-around. I have no way to confirm at the moment tho.

I am trying it now and will report. I did this:

echo performance > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo performance > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor

Things I notice: the webinterface feels a lot snapier but as a downside it increased the thermals a few degrees on all zones. (which makes sense of course)

Edit I checked my CPU load when using the full bandwidth and decided to decrease the clockspeed (and temperature) a bit by setting the core freq to 1.4Mhz instead of 1.725Mhz

echo 1400000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq 
echo 1400000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq 

I can confirm in the graphs that the clockspeed is now fixed at 1.4Mhz.
It also makes sense to do some tests: packet steering enabled, software offloading & IRQ balance disabled:
100/100 loaded connection core 0 load SQM disabled: 40%
100/100 loaded connection core 0 load with SQM layer_cake enabled: 81%
100/100 loaded connection core 0 load with SQM piece_of_cake enabled: 84%

Enabling sofware flow offloading in the UI and IRQ balance at /etc/config/irqbalance didn't change the results: probably because SQM is enabled?

3 Likes

I have been using these in my local start-up tab for over a year, and haven't had a crash/reboot since. The main culprit for me was the frequency switching to lowest freq. All of the values were gathered from around this forum.

echo schedutil > /sys/devices/system/cpu/cpufreq/policy0/scaling_governor
echo schedutil > /sys/devices/system/cpu/cpufreq/policy1/scaling_governor
echo 1725000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_max_freq
echo 1725000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_max_freq
echo 800000 > /sys/devices/system/cpu/cpufreq/policy0/scaling_min_freq
echo 800000 > /sys/devices/system/cpu/cpufreq/policy1/scaling_min_freq

echo 3 > /sys/devices/virtual/net/br-lan/queues/rx-0/rps_cpus
echo 3 > /sys/devices/virtual/net/eth0.2/queues/rx-0/rps_cpus
echo 3 > /sys/devices/virtual/net/eth1.1/queues/rx-0/rps_cpus
echo 3 > /sys/devices/virtual/net/ifb4pppoe-wan/queues/rx-0/rps_cpus
echo 3 > /sys/devices/virtual/net/lo/queues/rx-0/rps_cpus

echo min_power > /sys/devices/platform/soc/29000000.sata/ata1/host0/scsi_host/host0/link_power_management_policy

exit 0

I use SQM with layered cake

3 Likes

Thanks! The one I posted above seems to work ok for me as well, but I prefer it to scale up to 1.7Mhz when needed so I will give your config a try.

See! there is definitely some problem with the scaling and it does crash always right after the mux code... (that i'm 100% sure it's called by the krait notifier for the safe parent) with a random error like this one for the virtual page fault...

forcing the system to max freq should remove any instability by this problem and the system would crash only with some defect on the chip/bad power supply.
The fact is that regulators are not set to work 100% of the time at max voltage and this on the long run cause crash due to overheat or power supply spike... (or even grid problems)

Back in the old 5.4 days all worked well because we didn't have a cpu freq driver and the cpu and cache was set to the lower value of 800mhz... so the cpu wasn't that sensible to voltage change or problems by the regulator overheating...

would also explain why with the nss core the system is more stable... cpu is less used... less cpu freq change... less load on the regulators...

3 Likes

Yes that makes a lot of sense.
I think a higher minimum scaling freq might help or a fixed one at 1.4Mhz if you don't have extreme bandwidth and want to use SQM. I am going to test both for a few days.
If you do have a high bandwidth connection you probably want nss or set the core fixed at 1.7Mhz and hope for the best :stuck_out_tongue:

I've always kind of wondered what a big fat heat sink would do to improve things here.

not much if the system works outside his spec due to sw bugs

not to be that guy, but we've been on 5.4 (and then 5.10) for well over a year now? knock on wood, i haven't experienced this instability we're talking about. my current uptime is 93 days.

i also pin the min frequency to 800mhz, i seem to recall there was a comment in the qsdk code (or in the patch series at some point) that talked about their being a bug with scaling below 800mhz.

seeing as ya'll bout to be chastised by the OP for posting in this thread on something not specific to this build,

could the discussion about the random crashes be kept on the Netgear R7800 exploration (IPQ8065, QCA9984) topic? It would be easier to follow.

master-firewall4-r18722-e045e40671-20220203

Fixes to dependencies in iptables/nftables were merged today ( https://github.com/openwrt/openwrt/pull/5004 ), so the firewall4 starts to stabilise regarding the iptables/nftables availability for other packages. Also SQM dependencies were fixed. Banip and bcp38 are still missing from the build, as they require ipset. The build also included the nftables LuCI status page, although jow has not yet merged the PR officially.

Ps. For those who have personal iptables rules/commands in scripts, "nft" is the firewall rule manipulation command instead of "iptables" in master.

"iptables" is currently actually a translator from the iptables-nft package, translating the iptables syntax to nftables syntax and applying them. Possibly goes ok if vanilla commands, possibly wrong if more exotic commands. "iptables-legacy" would be the old iptables.

"nft list ruleset" shows the firewall.

root@router1:~# iptables-legacy -V
iptables v1.8.7 (legacy)

root@router1:~# iptables -V
iptables v1.8.7 (nf_tables)

root@router1:~# nft list ruleset
table inet fw4 {
        chain input {
                type filter hook input priority filter; policy accept;
                iifname "lo" accept comment "!fw4: Accept traffic from loopback"
                ct state established,related accept comment "!fw4: Allow inbound established and related flows"
                tcp flags syn / fin,syn,rst,ack jump syn_flood comment "!fw4: Rate limit TCP syn packets"
                iifname "br-lan" jump input_lan comment "!fw4: Handle lan IPv4/IPv6 input traffic"
                iifname "eth0.2" jump input_wan comment "!fw4: Handle wan IPv4/IPv6 input traffic"
        }

        chain forward {
                type filter hook forward priority filter; policy drop;
                ct state established,related accept comment "!fw4: Allow forwarded established and related flows"
...
2 Likes

The Netgear devices (r7500/ r7500v2/ r7800/ d7800/ xr500) do have a quite sane thermal design, including a big heat sink covering most of the board, although ipq806x is still running rather hot. The ZyXEL nbg6817 less so, running even hotter. ipq8064 also seems to run cooler in general (smaller heat sinks), not quite surprising given the lower clock rate (so the additional performance in ipq8065 seems to mostly stem from running the silicon closer to its maximum, rather than process improvements).

btw i run my r7800 at 1.9 ghz and 1.7 ghz with cache for more than 40 days with no problem... thermal is really not the problem

2 Likes

Currently running the r18722 build. Since the last three? builds I have a small issue that I can't access my (port forwarded) services using my public IP when accessing them from my own LAN. From the internet everything works. Probably a setting in the new firewall?

One thing to consider for @hnyman : setting the min core to 800000 solves the reboot issues I had earlier...

@jow is probably currently fixing the nat reflection...
See

That fix (firewall4 version bump) will likely be merged in to OpenWrt soon.

2 Likes