Transmit queue timeouts are unfortunately still a thing on devices with a mt7530 switch, as evident from this thread: Mtk_soc_eth watchdog timeout after r11573
One patch that has made a big difference in how often this issue is triggered, is the disabling of Flow Control on the CPU port: https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=498f1f4f5df2d077ba524f5735906bb52c12d580
However, this patch does not disable flow control completely. It only disables it on the CPU port. Devices connected directly to the switch still see flow control as being advertised, and thus will still send pause frames, which seems to be the direct cause of all the issues as documented by @kristrev. This is evident from running ethtool on a PC connected to the router: Link partner advertised pause frame use: Symmetric
.
Now I REALLY want to try and disable flow control on ALL ports, to hopefully prevent any connected devices from sending any pause frames (if they honor the advertisement). I am hoping this will fix the timeouts completely. I have 6 mt7621 devices in production use. 5 of them can get hundreds of days of uptime without any issues. And 1 is running into the timeout issue ~once a week. The big difference is that the one having an issue is connected to an unmanaged switch, which presumably uses pause frames.
I am trying to reverse engineer the patch that disables flow control, so I can hopefully apply it to the other ports as well. But this has proven to be a bit more difficult than I had hoped. I am using the following document as reference: http://47.107.224.89/redmine/attachments/download/49/MT7621_ProgrammingGuide_GSW_v01.pdf
The big change in the above mentioned patch that disabled flow control on the CPU port, is this:
/* (GE1, Force 1000M/FD, FC ON, MAX_RX_LENGTH 1536) */
mtk_switch_w32(gsw, 0x2305e33b, GSW_REG_MAC_P0_MCR);
mt7530_mdio_w32(gsw, 0x3600, 0x5e33b);
is changed to:
/* (GE1, Force 1000M/FD, FC OFF, MAX_RX_LENGTH 1536) */
mtk_switch_w32(gsw, 0x2305e30b, GSW_REG_MAC_P0_MCR);
mt7530_mdio_w32(gsw, 0x3600, 0x5e30b);
Looking at the mt7530_mdio_w32
function, we can see that it is writing a 32-bit value to the switch registers. 0x3600
is the location, which is the register for MAC 6 according to the mediatek documentation. In the old version this value is:
0x5e33b = 1011110001100111011
And the new value is:
0x5e30b = 1011110001100001011
As we can see, counting from zero and from the right to left, the 4th and 5th bit have been switched from 1 to 0. Looking at the documentation, these bits are for FORCE_TX_FC_P6 and FORCE_RX_FC_P6 respectively, ie it enables/disables flow control for TX and RX on MAC 6.
The registers for MAC 0 through 5 are also in the same programming guide: 0x3000 through 0x3500. Now I am running into multiple issues that prevent me from knowing how to disable flow control on the other ports:
- The programming guide clearly states "We would suggest don't use the register 0x3000 to 0x3400. It may not work.". What's preventing us from writing the registers for MAC 0 through MAC 4? What happens if we try anyway? I'd love to just try, but this is where the next issues come into play.
- The 0x5e30b value contains bits specifically tailored for MAC 6. For example, it's configured as NOT being connected to a PHY and operates in MAC mode, which makes sense since it's the CPU port. However, I am unsure what values I should write for the other MACs. Which 5 MACs are the ones connected the the 5 physical ports on my device? And why is there a 7th MAC? Are the PHY/MAC mode bit and external PHY bit the only two bits that should differ from the CPU port to the 5 physical ports, or should more bits be changed?
- Why is the same value (with some more bits in front set) also written with that weird
mtk_switch_w32
function? It's using theGSW_REG_MAC_P0_MCR 0x100
constant as the register address. From the name of the constant it seems to be a register for Port 0. If I were to want to disable flow control on all ports, would I also have to use this function multiple times?
There is also another potential solution that I found, but also here I am missing some crucial knowledge that is preventing me from implementing a patch. The earlier mentioned Mediatek document also mentions something interesting specifically regarding flow control in section 2.16. Namely, bit 31 at register 0x1fe0 enables/disables flow control globally (FC_EN). My questions regarding this are:
- the 0x1fe0 register is never written to in any of OpenWRT's code. Therefor, it's still at the default values. However, AFAIK it is impossible to only write to a single bit. So I either need to read from that register (how?) and flip the 31th bit to 0. Or does the "Reset" row in the documentation mean the default values after a reset? Because if that's the case, I can simply copy those values, change the 31th bit, and write those 32 bits in one go without having to read first.
- Again, should I write to these registers with the
mt7530_mdio_w32
function, themtk_switch_w32
function, or both? - Should I write to this register at the same location in the code as where FC is disabled on MAC 6? Ie, just after the switch has been reset?
Thank you very much in advance for any help! By the way, I am trying to write this patch for the 19.07 codebase (non dsa-driver).