Netgear R7800 exploration (IPQ8065, QCA9984)

quarky · February 3, 2022, 2:22am

From my understanding of reading the codes, the container_of() macro just performs pointer arithmetics and there's no use of pointers to access data, so it shouldn't cause any NULL pointer de-references.

From the panic log, the problem seem to be caused by NULL pointer de-reference at the __timer_delay() function, since CPU PC is pointed at that function. But __timer_delay() is just doing some integer computation and doing a bunch of CPU nops. I can't see anywhere that will cause a NULL pointer de-reference. Interesting ...

Maybe I'm not reading the panic log correctly.

anon98444528 · February 3, 2022, 3:05am

@ansuel i saw that you rebased 4748 - thx for that and there are no more merge errors; however, i'm getting new errors about patches failing during building the toolchain after a make distclean. In particular, 765-3-net-next-net-dsa-stop-updating-master-MTU-from-master.c.patch all hunks failed (it looks like these changes were already accepted?) and 765-4 a couple succeeded with fuzzing, but a least one failed.

It's possible that for the one 5.15 build that did succeed for me above I did with a toolchain built with 5.10 - i didn't realize the toolchain is dependent on the kernel version (or i thought if it mattered the toolchain would be rebuilt after selecting the testing 5.15 kernel). Not sure if that could have caused the build time errors about the missing mr42/52 dtb files or if you just haven't got to that yet.

No hurry, I'll play some more and check back later.

ninjanoir78 · February 3, 2022, 1:04pm

same here........

Ansuel · February 3, 2022, 3:45pm

Yes sorry. I pushed the wrong patchset. Now it's the one I use in my buildroot

Ansuel · February 3, 2022, 3:56pm

IMHO to investigate the problem we shouldn't care about the NULL pointer dereference but the fact that these panic happens right after the mux notifier...
IMHO the NULL pointer is caused by the cpu in a bad state and cause all sort of problems... (stall... random NULL pointer... random not implemented ops)

I'm still convinced in all these years that the krait notifier for the safe parent is implemented in a bad way and in a corner case it can happen that all the mux configuration is wrong causing the cpu to be clocked to an abnormal freq or a too low freq... this with the introduction of the regulator actually working makes the problem even worse as in the old days the regulator was set to the max the cpu could handle abnormal freq spike but now if it's set to the correct voltage and a spike happen 99% the system will crash for not sufficient voltage.

Also in our system we are lacking the safe parent for the gmac... That it looks to be called even if the nss core never change frequency...

I need to find some time to stop working on qca8k driver and check if the coordinated clk patch that i posted earlier can be actually applied and used in our system and prepare a patch so you guys can test it.

Ansuel · February 3, 2022, 4:45pm

@quarky was checking did you ever experienced the qca-rfs module?

the thing is that this is pure luck. You can totally have a idle system + a good chip + a good power brick and never experience this bug. The bug is triggered by a mix of factor that cause at the end glitch to the system.

facboy · February 3, 2022, 5:42pm

ah ok, i got u.

anon98444528 · February 3, 2022, 5:48pm

I also correlate keeping my min cpu freq at 800 as contributing to maintaining long uptimes (as well as the great work by those maintaining ipq806x systems). However, there is also at least one comment in the forums that the bug @facboy mentioned was fixed and this is no longer necessary.

@facboy you could switch back to 325 MHz (or whatever the old min freq was) and see if you don't get a crash. A subsequent crash over the next 4-5 days would both help to confirm keeping the cpu at 800 mitigates the issue and removes a bit of the "luck" factor @ansuel (justifiably) suggests.

anon98444528 · February 3, 2022, 6:54pm

hmm, using your the latest push to pr 4748 (i did notice that pr 4828 is no longer needed for k515/dsa) i got this error

net/dsa/tag_qca.c: In function 'qca_tag_rcv':
net/dsa/tag_qca.c:46:25: error: 'struct dsa_switch' has no member named 'tagger_data'
   46 |         tagger_data = ds->tagger_data;
      |                         ^~
net/dsa/tag_qca.c: In function 'qca_tag_connect':
net/dsa/tag_qca.c:98:11: error: 'struct dsa_switch' has no member named 'tagger_data'
   98 |         ds->tagger_data = tagger_data;
      |           ^~
net/dsa/tag_qca.c: In function 'qca_tag_disconnect':
net/dsa/tag_qca.c:105:17: error: 'struct dsa_switch' has no member named 'tagger_data'        
  105 |         kfree(ds->tagger_data);
      |                 ^~
net/dsa/tag_qca.c:106:11: error: 'struct dsa_switch' has no member named 'tagger_data'        
  106 |         ds->tagger_data = NULL;
      |           ^~
net/dsa/tag_qca.c: At top level:
net/dsa/tag_qca.c:112:10: error: 'const struct dsa_device_ops' has no member named 'connect'  
  112 |         .connect = qca_tag_connect,
      |          ^~~~~~~
net/dsa/tag_qca.c:112:9: warning: the address of 'qca_tag_connect' will always evaluate as 't\
rue' [-Waddress]
  112 |         .connect = qca_tag_connect,
      |         ^
net/dsa/tag_qca.c:113:10: error: 'const struct dsa_device_ops' has no member named 'disconnec\
t'
  113 |         .disconnect = qca_tag_disconnect,
      |          ^~~~~~~~~~
net/dsa/tag_qca.c:113:23: warning: excess elements in struct initializer
  113 |         .disconnect = qca_tag_disconnect,
      |                       ^~~~~~~~~~~~~~~~~~
net/dsa/tag_qca.c:113:23: note: (near initialization for 'qca_netdev_ops')
make[7]: *** [scripts/Makefile.build:277: net/dsa/tag_qca.o] Error 1
make[6]: *** [scripts/Makefile.build:540: net/dsa] Error 2
make[6]: *** Waiting for unfinished jobs....

Ansuel · February 3, 2022, 7:08pm

2 other missing patch force pushed...

anon98444528 · February 3, 2022, 7:35pm

so I mentioned the missing dtb's earlier. I know how to bypass this build error - just didn't this time as i wanted to see if this resulted from something i did wrong in previous attempts. It does not look my issue:

FATAL ERROR: Couldn't open "/home/sn/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_\
eabi/linux-ipq806x_generic/linux-5.15.19/arch/arm/boot/dts/qcom-ipq8068-mr52.dtb": No such fi\
le or directory
mkimage: Can't open /home/sn/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_eabi/lin\
ux-ipq806x_generic/meraki_mr52-fit-uImage.itb.new.tmp: No such file or directory
make[5]: *** [Makefile:45: /home/sn/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_e\
abi/linux-ipq806x_generic/meraki_mr52-fit-uImage.itb] Error 255
Parallel mksquashfs: Using 1 processor
Creating 4.0 filesystem on /home/sn/openwrt/build_dir/target-arm_cortex-a15+neon-vfpv4_musl_e\
abi/linux-ipq806x_generic/root.squashfs, block size 262144.
^

To work around this, I comment out the following lines in target/linux/ipq806x/image/generic.mk:

...
#TARGET_DEVICES += meraki_mr42
...
#TARGET_DEVICES += meraki_mr52

and resume the build.

You may need to look at this so others can build the using the pr.

HTH

EDIT loading a 5.15 multi cpu dsa image now

EDIT: it worked and also (forcefully) upgraded without issue from k510 dsa. I'll run it for a while and report back later.

Ansuel · February 3, 2022, 8:12pm

thx for the report... I should have fixed it.

facboy · February 3, 2022, 8:53pm

384mhz. i'll try it for a bit, though my r7800 sees reduced duty now as it is mainly used as a WAP - i'm trying out a nanopi r4 as the main router.

quarky · February 3, 2022, 10:00pm

qca-rfs is not compatible with ipq806x. It requires the ess switch which ipq806x is not using, IIRC.

anon98444528 · February 3, 2022, 11:29pm

I've already pulled this build off and gone back to a k510+dsa (pr4036 only).

I did a netperf from a 5g wifi client that is capable of doing 450+ mbps to a linux box attached via wire to the r7500v2 (wan port, using vlan's) and, on the 5.15 pr 4748 build, I could only do 200 mbps at best. One r7500v2 cpu is maxed out and the other is at 50+%. Going back to a k510+dsa build, i regain the netperf rates for this client with much much less cpu usage.

I seem to recall that vlans might be an issue - so perhaps nothing new here.

Unfortunately, I confounded these results by also changing from wpad (full) to wpad-wolfssl (full, pr4748 build), iw to iw-full, aht10k-ct htt-full to ath10k-htt. I don't think these changes would explain the excessive cpu usage with pr 4748.

EDIT: Not to mention that my 5.10 builds are from aug-nov time frame last year - I'll update a 5.10 build with pr 4036 to eliminate any such differences.

Ansuel · February 4, 2022, 12:10am

@anon98444528 can you test this https://github.com/Ansuel/openwrt/tree/5.15-improve (there is an extra commit)

anon98444528 · February 4, 2022, 12:22am

i will, but i'm going to do an updated 5.10+dsa build first as a baseline otherwise I'll always be second guessing the results.

May take a few days, as the r7500v2 will be in use for a bit.

Ansuel · February 4, 2022, 12:23am

If you want to have some fun in both tests, comment ds->pcs_poll = true; from qca8k.c

Anyway from what I observed the slowdown seems to be with the internal phy part.
The mdio patch was introduced to enable the assisted learning for multicpu that require multiple write/read to access the fdb... with this we now do it in one go directly. Probably the mdio is faster for the phy part... We can totally both way

I didn't trust the phy part and it would makes sense if it does cause some slowdown (not 200mbps but should be fixed by the extra patch i posted)
In theory disabling the pcs_pool should give our perf back... at worst we can just disable the eth mgmt for the phy mdio part... the mdio part is now optimized and use one write/read instead of 3 so should be fine...

In fact... @nmrh thx for the bench that's exactly what i needed... As it looks stable now...

tell me if you need some instruction on how to disable eth for phy

anon98444528 · February 4, 2022, 12:32am

best tell me what you want. But sleep first, it's got to be late for you and I won't get to this quickly.

Ansuel · February 4, 2022, 12:39am

Anyway to disable the mdio for phy

in qca8k_internal_mdio_write and qca8k_internal_mdio_read just comment the qca8k_phy_eth_command part and the if.

This way phy will use mdio instead of eth mgmt