Rockchip (rtl8211f) stmmac RX crash

Hello all,

I'm testing the latest snapshot builds on my R6S after getting an upgraded connection 1.2gbps symmetric.

Both of the 2.5gbps (rtl8125) PCIE NICs works perfectly, I can get the full upload and download speeds (1100mbps+ in both directions)

However when using the 1gbps (rtl8211f) MAC attached NIC I can get around 950mbps download however when exceeding around 500mbps upload I get a kernel panic. Restricting upload to below 500mbps seems to resolve the issue (or at the very least makes it harder to trigger), see logs below

  120.746602] Mem abort info:
[  120.746848]   ESR = 0x000000009600014f
[  120.747189]   EC = 0x25: DABT (current EL), IL = 32 bits
[  120.747668]   SET = 0, FnV = 0
[  120.747943]   EA = 0, S1PTW = 0
[  120.748225]   FSC = 0x0f: level 3 permission fault
[  120.748650] Data abort info:
[  120.748908]   ISV = 0, ISS = 0x0000014f, ISS2 = 0x00000000
[  120.749392]   CM = 1, WnR = 1, TnD = 0, TagAccess = 0
[  120.749835]   GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
[  120.750311] swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000003ddd000
[  120.750902] [ffff000003210000] pgd=18000001ffff8003, p4d=18000001ffff8003, pud=18000001ffff7003, pmd=18000001fffde003, pte=0060000003210783
[  120.752014] Internal error: Oops: 000000009600014f [#1] PREEMPT SMP
[  120.752562] Modules linked in: pppoe ppp_async nft_fib_inet nf_flow_table_inet pppox ppp_generic nft_reject_ipv6 nft_reject_ipv4 nft_reject_inet nft_reject nft_redir nft_quota nft_numgen nft_nat nft_masq nft_log nft_limit nft_hash nft_flow_offload nft
_fib_ipv6 nft_fib_ipv4 nft_fib nft_ct nft_chain_nat nf_tables nf_nat nf_flow_table nf_conntrack slhc r8169 nfnetlink nf_reject_ipv6 nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 crc_ccitt gpio_button_hotplug(O)
[  120.756247] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G           O       6.6.45 #0
[  120.756894] Hardware name: FriendlyElec NanoPi R6S (DT)
[  120.757351] pstate: 80400009 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[  120.757959] pc : dcache_inval_poc+0x40/0x58
[  120.758331] lr : arch_sync_dma_for_cpu+0x2c/0x3c
[  120.758739] sp : ffff80008000bcf0
[  120.759030] x29: ffff80008000bcf0 x28: ffff0001018e8900 x27: ffff000104920000
[  120.759657] x26: 0000000000000000 x25: ffff000103d28500 x24: ffff0001018e8900
[  120.760284] x23: ffff0001018ec900 x22: 00000000fffffc36 x21: 0000000000000002
[  120.760910] x20: ffff000100bf6410 x19: 0000000000589000 x18: 0000000000000000
[  120.761537] x17: 1128298ef1fd0a08 x16: 01010000efc30001 x15: ffffffffffffffff
[  120.762164] x14: ffffffffffffffff x13: ffffffffffffffff x12: ffffffffffffffff
[  120.762790] x11: ffffffffffffffff x10: ffffffffffffffff x9 : ffffffffffffffff
[  120.763417] x8 : ffffffffffffffff x7 : 0000000000000640 x6 : dead00000000003f
[  120.764043] x5 : 0000000000000001 x4 : 0000000000000000 x3 : 000000000000003f
[  120.764670] x2 : 0000000000000040 x1 : ffff000100588c00 x0 : ffff000003210000
[  120.765296] Call trace:
[  120.765512]  dcache_inval_poc+0x40/0x58
[  120.765849]  dma_sync_single_for_cpu+0xec/0x110
[  120.766250]  stmmac_napi_poll_rx+0x30c/0xd9c
[  120.766628]  __napi_poll+0x38/0x178
[  120.766939]  net_rx_action+0x114/0x23c
[  120.767270]  handle_softirqs+0x108/0x248
[  120.767617]  __do_softirq+0x14/0x20
[  120.767926]  ____do_softirq+0x10/0x1c
[  120.768249]  call_on_irq_stack+0x24/0x4c
[  120.768594]  do_softirq_own_stack+0x1c/0x28
[  120.768963]  irq_exit_rcu+0xbc/0xd8
[  120.769272]  el1_interrupt+0x38/0x68
[  120.769590]  el1h_64_irq_handler+0x18/0x24
[  120.769951]  el1h_64_irq+0x68/0x6c
[  120.770251]  cpuidle_enter_state+0x130/0x2f0
[  120.770625]  cpuidle_enter+0x38/0x50
[  120.770941]  do_idle+0x19c/0x1f0
[  120.771229]  cpu_startup_entry+0x38/0x3c
[  120.771575]  __cpu_disable+0x0/0xdc
[  120.771883]  __secondary_switched+0xb8/0xbc
[  120.772255] Code: 8a230000 54000060 d50b7e20 14000002 (d5087620) 
[  120.772787] ---[ end trace 0000000000000000 ]---
[  120.773192] Kernel panic - not syncing: Oops: Fatal exception in interrupt
[  120.773790] SMP: stopping secondary CPUs
[  120.774203] Kernel Offset: disabled
[  120.774507] CPU features: 0x0,c0000000,70028141,1000700b
[  120.774971] Memory Limit: none
[  120.775242] Rebooting in 3 seconds..

I have validated this on a few difficult builds, all with the same issue.

I'm a bit lost as to how to debug this one, any advice would be appreciated

Thanks in advance

After some debugging i can see that the crash happens because in the above

buf1_len = 0
buf2_len = 4294966042

this buf2 length runs outside the dma_buf size (thats 1536)

@Ansuel / @robimarko / @slh

have any of you seen any buffer overflow issues in stmmac and rtk chips ?

After some more debugging i have found that for stmmac_rx_buf2_len the length parameter is larger than the result of stmmac_get_rx_frame_len

For example for a crash with buf2_len = 4294966330

from within the stmmac_rx_buf2_len function i see the values
plen = 2106
len = 3072

the return value would be plen - len or -966 (4294966330 as a uint32
that matches the buf2_len)

I am unsure if its the result of stmmac_get_rx_frame_len thats problematic or the supplied length parameter

There are pending fixes for the r6s/r6c on the kernel mailing list https://patchwork.kernel.org/project/linux-rockchip/list/?series=861351
and improvement for rk3588 for openwrt https://github.com/openwrt/openwrt/pull/16149
I had tested on the r6c, no issues with stmmac driver.

Where you able to exceed 500mbps upload and download speeds ?

I cannot see anything in those patches that would change the behaviour of the GMAC NIC in relation to the crash im seeing

yes about 920mbps up and down

in https://patchwork.kernel.org/project/linux-rockchip/patch/20240612205056.397204-3-seb-dev@mail.de/

+/* RTL8211F-CG Ethernet */
 &mdio1 {
 	rgmii_phy1: ethernet-phy@1 {
 		compatible = "ethernet-phy-id001c.c916";
 		reg = <0x1>;
+		phy-supply = <&vcc_3v3_s0>;
 		pinctrl-names = "default";
-		pinctrl-0 = <&rtl8211f_rst>;
+		pinctrl-0 = <&gmac1_rstn_l>;
 		reset-assert-us = <20000>;
 		reset-deassert-us = <100000>;
 		reset-gpios = <&gpio3 RK_PB7 GPIO_ACTIVE_LOW>;
 	};
 };

Unfortunately non of those changes have resolved the issue for me :disappointed:

Something interesting to add, when other devices are connected to the 1Gbps port the issue doesn't show.

i have just tested with a laptop and desktop, both could hit around 940Mbps in both directions (with a hefty 120ms+ hit to latency at those speeds).

The device that causes the crash when connected is a dynalink dl-wrx36 ipq8074 router working as an access point (so the R6S is connected to one of the switch port on the dl-wrx36), every time that device will cause a crash in the stmmac driver when under load

check this out if you haven't already:

Also maybe try changing this:
(those 3 extra lines in the gmac section)

&gmac1 {
       clock_in_out = "output";
       phy-handle = <&rgmii_phy1>;
       phy-mode = "rgmii-rxid";
       pinctrl-0 = <&gmac1_miim
                    &gmac1_tx_bus2
                    &gmac1_rx_bus2
                    &gmac1_rgmii_clk
                    &gmac1_rgmii_bus>;
        pinctrl-names = "default";

	    snps,aal;
	    snps,rxpbl = <0x4>;
	    snps,txpbl = <0x2>;

        tx_delay = <0x42>;
        status = "okay";
};

hell maybe that even fixes the vlan problem who knows

you could also check if the problem is with eee if it's only happening with some devices

ethtool --set-eee eth0 eee off

there's other things you can try in the dts too you could try like:
(if it's not set somewhere else)

                        rx-fifo-depth = <4096>;
                        tx-fifo-depth = <2048>;

so e.g:

&gmac1 {
       clock_in_out = "output";
       phy-handle = <&rgmii_phy1>;
       phy-mode = "rgmii-rxid";
       pinctrl-0 = <&gmac1_miim
                    &gmac1_tx_bus2
                    &gmac1_rx_bus2
                    &gmac1_rgmii_clk
                    &gmac1_rgmii_bus>;
        pinctrl-names = "default";

        rx-fifo-depth = <4096>;
        tx-fifo-depth = <2048>;

	    snps,aal;
	    snps,rxpbl = <0x4>;
	    snps,txpbl = <0x2>;

        tx_delay = <0x42>;
        status = "okay";
};