Ipq806x NSS build (Netgear R7800 / TP-Link C2600 / Linksys EA8500)

sppmaster · October 19, 2022, 9:52pm

Is there a tool, command or utility that can be used to read/monitor the krait NSS CPU load and frequency transitions.

qosmio · October 19, 2022, 10:35pm

[ 4.372392] bf0ab3c0: Warn: trap[645]: Trapped: Thread: 2, reason: 00001000, PC: 400501A8, previous PC: 40050154

This is the exact core dump I've been getting as well. No clue how to tackle this one. Spoken with @Ansuel about it too. Very hard to decipher NSS FW errors as they're pretty cryptic in how the errors are worded and the little to no documentation.

I had suspected something to do with the memory reservation. Had the same error over 2 years with a discussion I had with @Gram (under Coredump 1). Disabling pstore/ramoops in this instance didn't help.

List of all the crashes trying different techniques to load it.

Core dump 1

crash8.txt:[    6.391559] bf1ade00: Warn: trap[813]: Trap on CHIP ID 00050000
crash8.txt:[    6.391574] bf1ade00: Warn: trap[620]: Trapped: TRAP_TD(00000004) DCAPT(3C000080)
crash8.txt:[    6.391587] bf1ade00: Warn: trap[645]: Trapped: Thread: 2, reason: 00001000, PC: 400501A8, previous PC: 40050154
crash8.txt:[    6.403687] bf1ade00: Warn: trap[594]: A0_3: 3F02FAAC 471AF740 3F02F8AC 40096800
crash8.txt:[    6.403700] bf1ade00: Warn: trap[594]: A4_7: 40096800 4004E5E0 3F00C0A8 3F00AF30
crash8.txt:[    6.403714] bf1ade00: Warn: trap[599]: D0_3: 00000000 27FFFFFC FFFFFFFF 00000000
crash8.txt:[    6.403728] bf1ade00: Warn: trap[599]: D4_7: 00000001 00000000 00000000 00000000
crash8.txt:[    6.403740] bf1ade00: Warn: trap[599]: D8_11: 00000000 00000000 00000000 00000000
crash8.txt:[    6.403753] bf1ade00: Warn: trap[599]: D12_15: 00000000 00000000 00000000 00000000
crash8.txt:[    6.415340] bf1ade00: Warn: trap[649]: Thread_2 has non-recoverable trap

Core dump 2

crash9.txt:[    5.885525] bf1a9e00: Warn: trap[813]: Trap on CHIP ID 00050000
crash9.txt:[    5.891227] bf1a9e00: Warn: trap[620]: Trapped: TRAP_TD(00000004) DCAPT(3C000080)
crash9.txt:[    5.897071] bf1a9e00: Warn: trap[645]: Trapped: Thread: 2, reason: 00001000, PC: 400501A8, previous PC: 40050154
crash9.txt:[    5.904785] bf1a9e00: Warn: trap[594]: A0_3: 3F02FAAC 4722F740 3F02F8AC 40096800
crash9.txt:[    5.904791] bf1a9e00: Warn: trap[594]: A4_7: 40096800 4004E5E0 3F00C0A8 3F00AF30
crash9.txt:[    5.904795] bf1a9e00: Warn: trap[599]: D0_3: 00000000 27FFFFFC FFFFFFFF 00000000
crash9.txt:[    5.904799] bf1a9e00: Warn: trap[599]: D4_7: 00000001 00000000 00000000 00000000
crash9.txt:[    5.904802] bf1a9e00: Warn: trap[599]: D8_11: 00000000 00000000 00000000 00000000
crash9.txt:[    5.922028] bf1a9e00: Warn: trap[599]: D12_15: 00000000 00000000 00000000 00000000
crash9.txt:[    5.936729] bf1a9e00: Warn: trap[649]: Thread_2 has non-recoverable trap

Core dump 3

crash.txt:[    8.009139] bf231100: Warn: trap[813]: Trap on CHIP ID 00050000
crash.txt:[    8.019733] bf231100: Warn: trap[620]: Trapped: TRAP_TD(00000004) DCAPT(3C000080)
crash.txt:[    8.032921] bf231100: Warn: trap[645]: Trapped: Thread: 2, reason: 00001000, PC: 400501A8, previous PC: 40050154
crash.txt:[    8.040291] bf231100: Warn: trap[594]: A0_3: 3F02FAAC 451B7740 3F02F8AC 40096800
crash.txt:[    8.050538] bf231100: Warn: trap[594]: A4_7: 40096800 4004E5E0 3F00C0A8 3F00AF30
crash.txt:[    8.057837] bf231100: Warn: trap[599]: D0_3: 00000000 27FFFFFC FFFFFFFF 00000000
crash.txt:[    8.065284] bf231100: Warn: trap[599]: D4_7: 00000001 00000000 00000000 00000000
crash.txt:[    8.072656] bf231100: Warn: trap[599]: D8_11: 00000000 00000000 00000000 00000000
crash.txt:[    8.080060] bf231100: Warn: trap[599]: D12_15: 00000000 00000000 00000000 00000000
crash.txt:[    8.087352] bf231100: Warn: trap[649]: Thread_2 has non-recoverable trap

Mpilon · October 20, 2022, 1:21am

if you can catch it on the way down, try to grab the memory around the PC and previous PC -- I'm assuming this is NSS code space.

Be interesting to disassemble that stuff.

also maybe cross check the address regs and the fail PCs to see if they match / are within / close-to areas reserved by the kernel, as logged in the kernel log.

-- could give a hint to who's stepping on NSS, if that's happening.

quarky · October 20, 2022, 2:47am

We actually know where in physical memory the firmware is loaded from the logs printed by the qca-nss-drv. The log below shows the 11.2 firmware being loaded:

[ 15.966863] nss_driver - fw of size 544712 bytes copied to load addr: 40000000, nss_id : 0
[ 15.967581] nss_driver - Turbo Support 1
[ 15.974085] Supported Frequencies -
[ 15.974088] 110Mhz
[ 15.978175] 600Mhz
[ 15.981733] 800Mhz
[ 15.983553]
[ 15.987976] 2b3e489f: meminfo init succeed
[ 16.012210] nss_driver - fw of size 218860 bytes copied to load addr: 40800000, nss_id : 1

The NSS core-dumps are likely crashes encountered by the NSS firmware, so it's unlikely for us to do anything about it without the NSS firmware source code.

It would be cool tho. if we could build our own firmware to make use of the two UBI32 cores. Sadly it is beyond my competence to take on such a project.

D43m0n · October 20, 2022, 6:53am

I assume that most of us running the 5.15 kernel are not clamping down the cpu frequency, am I right? I’ve been too busy with other things so my R7800’s run with a 5.10 build that used to reboot every few days. I have set them to a fixed cpu frequency and uptime so far is 26+ days. I’m simply mentioning this as a thought that the frequency scaling issues in 5.15 may be less than in 5.10, but not yet 100% resolved. Just a thought though, what if some 5.15 R7800’s would clamp cpu frequency down to one fixed rate and a few others don’t and then compare in a few days?

sppmaster · October 20, 2022, 9:28am

I do this at the moment.
I had crashes every few days with frequency transitions allowed.
Now I run 5.15 with performance governor for almost 7 days. No crashes have occurred jet.
You can see above my relative posts from few days ago.
With kernel 5.10 (no irqbalancer nor packet steering) and ondemand governor I have more than 78 days uptime on four other R7800s.

asvio · October 20, 2022, 3:41pm

Today I installed the new @qosmio (2022-10-20) update 5.15qsdk11-new-krait-cc and in less than an hour I had a spontaneous reboot. Luckily the restart has left me these logs.
I hope it helps.
I usualy don't have expontaneous reboots.

<4>[   64.808875] ath10k_pci 0001:01:00.0: Invalid peer id 1 or peer stats buffer, peer: e403f226  sta: 00000000
<6>[  125.918243] ath10k_pci 0000:01:00.0: mac flush null vif, drop 0 queues 0xffff
<6>[  125.946228] IPv6: ADDRCONF(NETDEV_CHANGE): phy0-ap0: link becomes ready
<6>[  125.946376] br-lan: port 3(phy0-ap0) entered blocking state
<6>[  125.951708] br-lan: port 3(phy0-ap0) entered forwarding state
<1>[ 3997.479619] 8<--- cut here ---
<1>[ 3997.479668] Unable to handle kernel paging request at virtual address 9043f45c
<1>[ 3997.481601] pgd = b3c74188
<1>[ 3997.488864] [9043f45c] *pgd=00000000
<0>[ 3997.491505] Internal error: Oops: 80000005 [#1] SMP ARM
<4>[ 3997.495216] Modules linked in: nss_ifb ecm nft_fib_inet nf_flow_table_ipv6 nf_flow_table_ipv4 nf_flow_table_inet ath10k_pci ath10k_core ath wireguard nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_objref nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_counter nft_compat nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack mac80211 libchacha20poly1305 iptable_mangle iptable_filter ipt_REJECT ipt_ECN ip_tables curve25519_neon cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_ecn xt_dscp xt_comment xt_TCPMSS xt_LOG xt_HL xt_DSCP xt_CLASSIFY x_tables ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda sch_cake ppp_async poly1305_arm nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 libcurve25519_generic libcrc32c crc_ccitt compat
<4>[ 3997.495900]  chacha_neon fuse sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32 cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred act_gact qca_nss_qdisc qca_nss_crypto qca_nss_pptp pptp qca_nss_pppoe pppoe pppox ppp_generic slhc ledtrig_usbport cryptodev qca_mcs msdos ip_gre gre ifb ip6_udp_tunnel udp_tunnel sit tunnel4 ip_tunnel tun autofs4 dns_resolver nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp850 nls_cp437 nls_cp1250 wp512 twofish_generic twofish_common tea serpent_generic khazad cast6_generic cast5_generic cast_common camellia_generic blowfish_generic blowfish_common anubis xts crypto_user algif_skcipher algif_rng algif_hash algif_aead af_alg sha1_generic seqiv ecb cmac authencesn authenc uas usb_storage leds_gpio xhci_plat_hcd xhci_pci xhci_hcd dwc3 dwc3_qcom ohci_platform ohci_hcd phy_qcom_ipq806x_usb ahci fsl_mph_dr_of ehci_platform ehci_fsl sd_mod ahci_platform libahci_platform libahci libata scsi_mod scsi_common ehci_hcd qca_nss_drv
<4>[ 3997.565827]  qca_nss_gmac ramoops reed_solomon pstore gpio_button_hotplug vfat fat ext4 mbcache jbd2 exfat dm_mirror dm_region_hash dm_log dm_crypt dm_mod dax crc32c_generic cbc encrypted_keys trusted tpm oid_registry asn1_encoder asn1_decoder
<4>[ 3997.673885] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.15.74 #0
<4>[ 3997.695215] Hardware name: Generic DT based system
<4>[ 3997.701462] PC is at 0x9043f45c
<4>[ 3997.706061] LR is at _ath10k_ce_completed_send_next_nolock+0x178/0x19c [ath10k_core]
<4>[ 3997.709105] pc : [<9043f45c>]    lr : [<bf8c2638>]    psr: a0000113
<4>[ 3997.717090] sp : c0d01db8  ip : 00200000  fp : c0d03d00
<4>[ 3997.723078] r10: 0004ac00  r9 : c5f52f80  r8 : c5f51f80
<4>[ 3997.728286] r7 : 0000001f  r6 : c0d01dec  r5 : c26f56c0  r4 : 00000017
<4>[ 3997.733495] r3 : 9043f45c  r2 : bf8cb77c  r1 : 0004ac04  r0 : c5f51f80
<4>[ 3997.740093] Flags: NzCv  IRQs on  FIQs on  Mode SVC_32  ISA ARM  Segment none
<4>[ 3997.746608] Control: 10c5787d  Table: 4bdb806a  DAC: 00000051
<1>[ 3997.753808] Register r0 information: non-slab/vmalloc memory
<1>[ 3997.759537] Register r1 information: non-paged memory
<1>[ 3997.765264] Register r2 information: 113-page vmalloc region starting at 0xbf87b000 allocated at load_module+0x9a8/0x27a8
<1>[ 3997.770221] Register r3 information: non-paged memory
<1>[ 3997.781149] Register r4 information: non-paged memory
<1>[ 3997.786184] Register r5 information: slab kmalloc-192 start c26f56c0 pointer offset 0 size 192
<1>[ 3997.791226] Register r6 information: non-slab/vmalloc memory
<1>[ 3997.799728] Register r7 information: non-paged memory
<1>[ 3997.805543] Register r8 information: non-slab/vmalloc memory
<1>[ 3997.810492] Register r9 information: non-slab/vmalloc memory
<1>[ 3997.816220] Register r10 information: non-paged memory
<1>[ 3997.821862] Register r11 information: non-slab/vmalloc memory
<1>[ 3997.826811] Register r12 information: non-paged memory
<0>[ 3997.832626] Process swapper/0 (pid: 0, stack limit = 0x0cba19dc)
<0>[ 3997.837662] Stack: (0xc0d01db8 to 0xc0d02000)
<0>[ 3997.843825] 1da0:                                                       c5f597fc c0d01dec
<0>[ 3997.848089] 1dc0: c5f5977c 00000001 c5f52480 1cd4a000 c0d01e70 bf8c02e4 c0d01df0 c5f51f80
<0>[ 3997.856249] 1de0: c5f597fc bf8fe674 0000000c bf8c0368 c0d01df0 c0d01df0 00000000 00000001
<0>[ 3997.864408] 1e00: 00000004 00000000 c5f51f80 bf8c0448 c5f58480 c5f51f80 00000040 c0d01e67
<0>[ 3997.872567] 1e20: c5f52480 bf9006fc 00000001 c5f58480 00000040 c0d01e67 c0d01e68 c086e3ec
<0>[ 3997.880728] 1e40: c5f58480 dd990e40 0000012c c0c46e40 c0d01e68 c086e730 c0d01e6c 0005a458
<0>[ 3997.888888] 1e60: c27c9800 00000001 c0d01e68 c0d01e68 c0d01e70 c0d01e70 c0d01f00 00000000
<0>[ 3997.897047] 1e80: 00000003 c0d0308c c0d03080 40000003 c0d00000 00000100 c0d01ea0 c03012cc
<0>[ 3997.905207] 1ea0: c1669060 c0374e7c 00000009 c0c46040 0005a457 04200002 00000000 ffffe000
<0>[ 3997.913365] 1ec0: 00000000 00000043 00000000 c0d7ea64 c0c452fc c0d01f00 c0c456e8 c0325898
<0>[ 3997.921526] 1ee0: c0c452f0 c0374f24 c0d01f20 de802000 c0d0532c de80200c c0d7ea64 c06698f0
<0>[ 3997.929685] 1f00: c0308590 60000013 ffffffff c0d01f54 00000000 c0d00000 10c5387d c0300b7c
<0>[ 3997.937846] 1f20: 02083446 00000000 00000001 c0315a00 c0d00000 00000000 c0d04f10 c0d04f54
<0>[ 3997.946004] 1f40: 00000000 00000000 10c5387d c0c456e8 c0dcbe98 c0d01f70 c030858c c0308590
<0>[ 3997.954165] 1f60: 60000013 ffffffff 00000051 10c5387d c0d00000 c0353044 c0d01f78 c0c456e8
<0>[ 3997.962325] 1f80: c0da6024 c0da6020 00000000 000000e6 c0da6024 c0da6020 00000000 c0d04ec0
<0>[ 3997.970485] 1fa0: 512f04d0 10c5387d c0da6014 c03533b4 c0da603c c0c010b8 ffffffff ffffffff
<0>[ 3997.978644] 1fc0: 00000000 c0c006d8 c0c35a54 e2d3e515 00000000 c0c00470 00000051 10c0387d
<0>[ 3997.986804] 1fe0: 0000136c 430c9fc8 512f04d0 10c5387d 00000000 00000000 00000000 00000000
<0>[ 3997.995023] [<bf8c2638>] (_ath10k_ce_completed_send_next_nolock [ath10k_core]) from [<bf8c02e4>] (ath10k_ce_completed_send_next+0x3c/0x5c [ath10k_core])
<0>[ 3998.003130] [<bf8c02e4>] (ath10k_ce_completed_send_next [ath10k_core]) from [<bf8fe674>] (ath10k_pci_htc_tx_cb+0x38/0x4c0 [ath10k_pci])
<0>[ 3998.016869] [<bf8fe674>] (ath10k_pci_htc_tx_cb [ath10k_pci]) from [<bf8c0448>] (ath10k_ce_per_engine_service_any+0x80/0x1d4 [ath10k_core])
<0>[ 3998.028736] [<bf8c0448>] (ath10k_ce_per_engine_service_any [ath10k_core]) from [<bf9006fc>] (ath10k_pci_napi_poll+0x4c/0x164 [ath10k_pci])
<0>[ 3998.041234] [<bf9006fc>] (ath10k_pci_napi_poll [ath10k_pci]) from [<c086e3ec>] (__napi_poll+0x34/0x188)
<0>[ 3998.053642] [<c086e3ec>] (__napi_poll) from [<c086e730>] (net_rx_action+0xd0/0x260)
<0>[ 3998.062927] [<c086e730>] (net_rx_action) from [<c03012cc>] (__do_softirq+0xe4/0x2a4)
<0>[ 3998.070568] [<c03012cc>] (__do_softirq) from [<c0325898>] (irq_exit+0xc0/0x114)
<0>[ 3998.078551] [<c0325898>] (irq_exit) from [<c0374f24>] (handle_domain_irq+0x68/0x98)
<0>[ 3998.085585] [<c0374f24>] (handle_domain_irq) from [<c06698f0>] (gic_handle_irq+0x80/0xb4)
<0>[ 3998.093223] [<c06698f0>] (gic_handle_irq) from [<c0300b7c>] (__irq_svc+0x5c/0x78)
<0>[ 3998.101553] Exception stack(0xc0d01f20 to 0xc0d01f68)
<0>[ 3998.109021] 1f20: 02083446 00000000 00000001 c0315a00 c0d00000 00000000 c0d04f10 c0d04f54
<0>[ 3998.114061] 1f40: 00000000 00000000 10c5387d c0c456e8 c0dcbe98 c0d01f70 c030858c c0308590
<0>[ 3998.122216] 1f60: 60000013 ffffffff
<0>[ 3998.130369] [<c0300b7c>] (__irq_svc) from [<c0308590>] (arch_cpu_idle+0x3c/0x50)
<0>[ 3998.133673] [<c0308590>] (arch_cpu_idle) from [<c0353044>] (do_idle+0x224/0x298)
<0>[ 3998.141313] [<c0353044>] (do_idle) from [<c03533b4>] (cpu_startup_entry+0x1c/0x20)
<0>[ 3998.148692] [<c03533b4>] (cpu_startup_entry) from [<c0c010b8>] (start_kernel+0x5c8/0x6a0)
<0>[ 3998.156074] Code: bad PC value
<4>[ 3998.164466] ---[ end trace 480a4abccde4425e ]---
<0>[ 3998.190692] Kernel panic - not syncing: Fatal exception in interrupt
<2>[ 3998.190732] CPU1: stopping
<4>[ 3998.196112] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G      D           5.15.74 #0
<4>[ 3998.198635] Hardware name: Generic DT based system
<4>[ 3998.206188] [<c030f6e0>] (unwind_backtrace) from [<c030b07c>] (show_stack+0x14/0x20)
<4>[ 3998.210789] [<c030b07c>] (show_stack) from [<c064f524>] (dump_stack_lvl+0x40/0x4c)
<4>[ 3998.218687] [<c064f524>] (dump_stack_lvl) from [<c030dea4>] (do_handle_IPI+0x12c/0x184)
<4>[ 3998.226066] [<c030dea4>] (do_handle_IPI) from [<c030df14>] (ipi_handler+0x18/0x2c)
<4>[ 3998.233965] [<c030df14>] (ipi_handler) from [<c037b644>] (handle_percpu_devid_irq+0x80/0x168)
<4>[ 3998.241606] [<c037b644>] (handle_percpu_devid_irq) from [<c0374f20>] (handle_domain_irq+0x64/0x98)
<4>[ 3998.250200] [<c0374f20>] (handle_domain_irq) from [<c06698f0>] (gic_handle_irq+0x80/0xb4)
<4>[ 3998.259052] [<c06698f0>] (gic_handle_irq) from [<c0300b7c>] (__irq_svc+0x5c/0x78)
<4>[ 3998.267297] Exception stack(0xc146df60 to 0xc146dfa8)
<4>[ 3998.274765] df60: 00127322 00000000 00000001 c0315a00 c146c000 00000001 c0d04f10 c0d04f54
<4>[ 3998.279804] df80: 00000000 00000000 00000000 c0c456e8 c0dcbe98 c146dfb0 c030858c c0308590
<4>[ 3998.287960] dfa0: 60000013 ffffffff
<4>[ 3998.296113] [<c0300b7c>] (__irq_svc) from [<c0308590>] (arch_cpu_idle+0x3c/0x50)
<4>[ 3998.299417] [<c0308590>] (arch_cpu_idle) from [<c0353044>] (do_idle+0x224/0x298)
<4>[ 3998.307055] [<c0353044>] (do_idle) from [<c03533b4>] (cpu_startup_entry+0x1c/0x20)
<4>[ 3998.314433] [<c03533b4>] (cpu_startup_entry) from [<42301530>] (0x42301530)

Mpilon · October 20, 2022, 3:56pm

my hunch is the 5.15 kernel is the killer here, NSS the victim.

suggestions for finding heavy-handed clobbering:

cross reference the regs with values in the address space - against areas mentioned in the kernel log -- in/near. If you get a match, that bit of the kernel can be looked at.

B) more theoretical because I haven't ever - integrate kgdb in the kernel build, boot and dump (disassemble?) the memory before - up to - those fail PCs. look at the regs involved @ the failed instruction - any interesting addresses?

This is such a nice, clean failure that it may be a functionality change which happened in 5.15 or nss devel kit 11.

could try nss 11 on 5.10, and nss 10 on 5.15.

that's all I've got - I'll see meself out.

sppmaster · October 20, 2022, 7:19pm

I ask myself is it really so hard task such a firmware to be built. Considering the fact that these UBI32 cores do only very specific tasks. But it's still good that we can use the already built firmware and it was proved that it can be really stable at least with the latest master builds with kernel 5.10 and NSS 10 without the need to fix the CPU cores frequency.

This seems really reasonable to try. Actually what is the difference (or the benefits) of NSS 12 vs 11 vs 10 except for 12 being bigger than 10.

Mpilon · October 20, 2022, 7:49pm

As far as clock frequency issues go - my strong suggestion is to fix the clock at one particular value -- high/max frequency.

I don't think anything has been 'fixed' as far as clocking goes -- our problems are greatly reduced when clock Hz is fixed at one value.

And it's such a confounding factor to all the other issues being tackled - why not mandate one value for all to use - and maybe lock it into @ACwifidude's code base.

my 0.00002,
M.

quarky · October 21, 2022, 12:39am

Any inputs from the community are valuable inputs, so you're being too modest here. The issue as I see it at the moment is that we cannot reliably trigger the L2 cache corruption hypothesis, so we're kind of shooting in the dark here.

Maybe we can try creating a kmod that periodically stress the CPU to trigger CPU clock scaling and hopefully can consistently recreate the issue. My ipq806x routers are both in production use now, so can't really test this out. I'll probably explore this once I can free up one of them.

For the NSS firmware crashes, I'm afraid we have to live with it tho.

quarky · October 21, 2022, 12:44am

It is really hard when we do not have documentations. Reverse engineering and doing code disassembly is tricky.

It seems that the UBI32 cores are communicating with the Krait cores via DMA from what I can piece together from the nss-drv codes.

So we have to figure out how to bootstrap the UBI32 cores, figure out the data exchange mechanism and setup a build sub-system for the firmware among other stuff I've missed. It would be really cool tho. if the community can get it to work.

Mpilon · October 21, 2022, 12:58am

Thanks for the kind words - I'm mainly suggesting splitting efforts on CPU clock scaling from 5.15 NSS crashes. - there are some clock values which keep the router up a long time - we could go with 1 of those and ignore L2 cpu cache timing for now.

trying to resolve both is going to be a challenge.

Is there any documentation on those NSS cores? compiler? instruction set? sample public eval code?

mooninite · October 21, 2022, 1:27am

ACwifidude, do you plan to do a release to sync with 22.03.2, etc. to include the latest CVE fixes? Thank you.

quarky · October 21, 2022, 1:33am

Not publicly available as far as I'm aware of tho.

D43m0n · October 21, 2022, 7:00am

I believe @Ansuel is working on a theory, don’t know if we know?

There is a package called stress that might just do the trick already. Tweak with options and put it in a cronjob?

I’m preparing another R7800 I picked up recently. I’m also refactoring my uci/shell scripts on this unit. It’s running idle practically all the time now and I’ve set the performance governor yesterday and plugged in a Shelly Plug S to do some power consumption measurements. I’ll set it to the ondemand governor in a moment to compare power consumption and thermal zones as well. I must say that fixing the CPU’s to one fixed frequency has been a huge difference for my R7800’s.

If temperature and power consumption are not too far off from 600/800 MHz to 1700 MHz, I think choose one that suits your situation. But let’s see some numbers first

quarky · October 21, 2022, 7:20am

It would be ideal if we can reliably trigger this issue which can be used to prove or disprove theories.

Cool. I'll take a look at this. As long as it can cause CPU clock freq to change and can be used to reliably trigger the L2 scaling bug, it will be invaluable to help fix the bug.

Edit: It looks like the stress package is a user land tool. May not be able to trigger the issue, as I think we need a process running in the kernel to trigger it. I could be wrong tho. Maybe it may manifest as a user process core dump.

D43m0n · October 21, 2022, 9:59am

I've got some graphs on power consumption and temperature on an idle running R7800 with only 5GHz active, one LAN port active and nothing else.

CPU frequency:

You can see it's not really doing anything:

The smallest (first) load spike is when I logged in. The middle load spike is when I switched the governor from performance back to ondemand (which is a minimum of 800MHz in my case). The largest load spike is when I checked the graphs in Luci:

Here's the power consumption over the last 3 hours. I picked a time frame of 3 hours instead of 1 hour since this shows the trend a bit better in the decrease of power consumption when I switched the governor to ondemand again. In particular between 11:00 and 11:30 you can see the trend of power consumption going down a bit:

Schermafbeelding 2022-10-21 om 11.46.27

And to see what this does with thermal zones inside the R7800, I've picked a few graphs but not all of them:

It does matter, in the sense that a change in CPU frequency is visible when measured. In the case of NSS offloading, CPU usage is usually low. WiFi is also partially offloaded to NSS cores. In terms of "thermal wear" or power consumption the differences are minimal. You could also meet in the middle and set a fixed CPU frequency on 1400MHz of 1000MHz and be done with it. In my case (and several others here too) setting a fixed CPU frequency attributes to a lot less unexpected reboots.

altuntepe · October 21, 2022, 1:53pm

Hello everyone, why doesn't acwifidude add the unifi hd device? It uses ipq8064 and CPU usage reaches 100% in AP mode. Could you please compile the nss supported firmware.

pattagghiu · October 21, 2022, 6:47pm

no in my case it's a stand alone 7800.

that's interesting. I have a QNAP nas, a sky receiver, 2 smart tvs, a denon receiver and a couple of windows pcs wired, i see no error packages. Can this be something filtered by managed switches? (i have vlans)

ues i use smcproxy, but it is not for IPTV stream, i use it (ir better i'd like to use it) to pass multicast traffic between vlans, to let my smart tvs see and use dlna server in other subnets

(sorry for late reply, new job here, not much time to play :))