[PoC] [WIP] DSA support for ath79 built-in switch

yes hoping they have compatible regs... (you need to set it to the new tagger) but i'm not sure at this point... we should check honestly...

I whipped up a patch to do that via matching different OF compatible string, building now.

Edit: And it crashes at boot. Enough for today :wink:

3 Likes

And it was just me being idiot and writing to unallocated block of memory during probe.
Output of tcpdump now looks better:

15:10:17.385134 AF Unknown (4294967295), length 344: 
        0x0000:  ffff ffff ffff 30b5 c2dd 81cc 8082 0800  ......0.........
        0x0010:  4500 0148 0000 0000 4011 79a6 0000 0000  E..H....@.y.....
        0x0020:  ffff ffff 0044 0043 0134 800f 0101 0600  .....D.C.4......
        0x0030:  1bd3 2a89 257f 0000 0000 0000 0000 0000  ..*.%...........
        0x0040:  0000 0000 0000 0000 30b5 c2dd 81cc 0000  ........0.......
        0x0050:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0060:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0070:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0080:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0090:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x00a0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x00b0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x00c0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x00d0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x00e0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x00f0:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0100:  0000 0000 0000 0000 0000 0000 0000 0000  ................
        0x0110:  0000 0000 0000 0000 6382 5363 3501 0139  ........c.Sc5..9
        0x0120:  0202 4037 0801 0306 0c0f 1c2a 790c 0f74  ..@7.......*y..t
        0x0130:  706c 696e 6b2d 6d72 3334 3230 7632 3c0c  plink-mr3420v2<.
        0x0140:  7564 6863 7020 312e 3335 2e30 ff00 0000  udhcp.1.35.0....
        0x0150:  0000 0000 0000 0000                      ........

The actual header at 0xC now matches the datasheet. However, still no traffic on the port at all in either direction.
I wonder if that PORT_NUM field of the header is a bitmap, or an actual port number, as is the case for ingress traffic on AR83x7.

Also, there is still no traffic on GMAC0, so there have to be some differences between ar933x and ar934x w.r.t. MDIO register mapping. And this is probably the cause for lack of traffic on switch as well. I need to dig that through.

you should have logs for received tagged packets

There aren't any, so probably the link between GMAC1 and the switch is as misconfigured as one between PHY0 and GMAC0. I'll dig into that later, it's great the driver is regmap-based, because debugging will be relatively easy.

mhhh if that's the case then the switch is not sending packet back or the system is not receiving them...

Anway from documentation the port num is a bitmap... but the receive path is not clear in documentation so it can be that on receive it does give us the number... (that's why i added the debug print log)

Anyway parsing the packet gives strange results... (remember that it should be converted to little endian)

I converted WR743NDv1 et al (AR7240) and again I get traffic, but there are some woes with regards to PHY mapping and interrupts, that need digging around. Only LAN2 and LAN4 works on it, others spew out oopses about unhandled IRQs. This is gonna be fun.

And this is getting funny indeed. One time, the WR743 did work just like that - every port, including WAN, but most of the time, I get at most two PHYs running, others getting disabled due to MDIO write timeouts on between the MAC and switch. Main difference between AR7240 and AR9331 which worked is that switch is connected to MDIO0, not MDIO1.
The usual error message:

[  125.980138] ar9331_switch mdio.0:10 lan1: configuring for phy/internal link mode
[  125.989847] ar9331_switch mdio.0:10: PHY write error: -145

[  125.995492] ------------[ cut here ]------------
[  126.000123] WARNING: CPU: 0 PID: 222 at drivers/net/phy/phy.c:942 phy_state_machine+0xb0/0x340
[  126.008830] Modules linked in: ath9k ath9k_common pppoe ppp_async iptable_nat ath9k_hw ath xt_state xt_nat xt_conntrack xt_REDIRECT xt_MASQUERADE xt_FLOWOFFLOAD xt_CT wireguard pppox ppp_generic nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_pptp nf_nat_irc nf_nat_h323 nf_nat_amanda nf_nat nf_flow_table nf_conntrack_tftp nf_conntrack_snmp nf_conntrack_sip nf_conntrack_pptp nf_conntrack_irc nf_conntrack_h323 nf_conntrack_broadcast nf_conntrack_amanda nf_conntrack mac80211 lzo libchacha20poly1305 libblake2s ipt_REJECT cfg80211 xt_time xt_tcpudp xt_multiport xt_mark xt_mac xt_limit xt_comment xt_TCPMSS xt_LOG ts_kmp ts_fsm ts_bm slhc poly1305_mips nf_reject_ipv4 nf_log_syslog nf_defrag_ipv6 nf_defrag_ipv4 mdio_netlink lzo_rle lzo_decompress lzo_compress libcurve25519_generic libblake2s_generic iptable_raw iptable_mangle iptable_filter ip_tables crc_ccitt compat chacha_mips asn1_decoder fuse xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet
[  126.009586]  ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 ip6_udp_tunnel udp_tunnel vfat fat nls_utf8 nls_iso8859_1 nls_cp437 nls_base sha256_generic libsha256 seqiv jitterentropy_rng drbg kpp hmac cmac crypto_acompress sd_mod scsi_mod scsi_common gpio_button_hotplug ext4 mbcache jbd2 crc16 mii crc32c_generic
[  126.145857] CPU: 0 PID: 222 Comm: kworker/0:2 Not tainted 5.15.34 #0
[  126.152251] Workqueue: events_power_efficient phy_state_machine
[  126.158233] Stack : 807f0000 80eb5880 8040eaf0 00000009 80c0555c 800c2210 80f621f0 807efea3
[  126.166696]         809b32d4 000000de 80f75d5c 807f0000 80726058 00000001 80f75d30 cc39136f
[  126.175155]         00000000 00000000 80726058 80f75bb0 ffffefff 00000000 00000000 ffffffea
[  126.183616]         000000bb 80f75bbc 000000bb 807f5fd8 809b0000 00000009 00000000 8040eaf0
[  126.192060]         00000009 80c0555c 81000605 80eb58c0 00000018 803b1100 00000000 809b0000
[  126.200518]         ...
[  126.203000] Call Trace:
[  126.205454] [<80067068>] show_stack+0x28/0xf0
[  126.209863] [<80086d1c>] __warn+0xc0/0x12c
[  126.214026] [<80086de4>] warn_slowpath_fmt+0x5c/0xac
[  126.219023] [<8040eaf0>] phy_state_machine+0xb0/0x340
[  126.224133] [<8009e57c>] process_one_work+0x224/0x4b4
[  126.229211] [<8009e98c>] worker_thread+0x180/0x5c4
[  126.234048] [<800a66b4>] kthread+0x140/0x164
[  126.238365] [<800623b8>] ret_from_kernel_thread+0x14/0x1c
[  126.243813] 
[  126.245313] ---[ end trace 9296a3e8dac35458 ]---

Ocasionally, I get similar timeouts on link state change IRQs from the switch, but the likely reason behind them is the same.

I pushed my current tree to Github as well.

A small update. Just discovered on one of my previous builds, that qca8k driver used to conflict with ar9331 despite matching OF compatible strings, which results with some woes on my WDR4300 and C7v2, it takes considerable time for them to attach to network for some reason. I need to figure out what's going on when they connect again.

wait they have different switch? thought they were ar8327

They are, but I have both ar9331 and qca8k enabled in kernel in that build. qca8k probes and works, but for some reason both devices experience a very high CPU load and it takes minutes for them to get DHCP lease. I have yet to discover why :expressionless:

you have removed the swconf driver right?

Yes. After dropping the patches for ar9331 it persisted for some reason, maybe kernel config. This is strange, both units show load average around 4-5, despite top reporting 99% idle.
I'll try a clean build. Will post a note here when I figure it out.

Edit: it seems that something broke when I updated the kernel to 5.15, I haven't rebased your patches for qca8k for long. Just having kernel 5.15 enabled did this, even without patches for ar9331. So while this was false alarm, something is up with qca8k and kernel 5.15 - currently at 5.15.45, because devices at either kernel 5.10 or without ar8327 did not misbehave for me. While at that, I noticed that blinking pattern for LEDs wasn't set correctly while at 5.15, with the patches my branch carries for that.

You said that both driver are slow or only qca8k? The qca8k in 5.15 have the Mgmt ethernet stuff that is not present in 5.10 and it does work differently in your switch... That cause the init of the switch really slow... You can disable that by return tweaking the change master function in the qca8k driver (it's a workaround while we bisect problems with this)

Acked. I'll stay at 5.10 for the time being for those devices. My intention was to check the DSA performance with recent fw4 update, but my build still includes fw3 for some reason. I need to change the config manually and rebuild, and maybe then refresh patches for qca8k.

What's strange is that when router boots after a couple of minutes, it is responsive, but IIRC 5 kernel threads are in "disk sleep" state according to htop. Kinda makes sense, as the device has 5 PHYs.

Going back to ar9331, I'll continue to play with that tomorrow.