Thank you so much everyone for pointers into the right direction Now it's time to deploy this build to the router having issues, and see if this prevents the issue from cropping up. My final patch ended up doing:
Disabling flow control on all 7 MACs
Disabling flow control globally for good measure
Disable pause advertisement on all 5 PHYs
The patch was made for the 19.07 branch (I applied it to 19.07.4 myself). Those who want to play around with it as well to test for stability, in the "gsw_mt7621.c" file, change the following:
/* (GE1, Force 1000M/FD, FC OFF, MAX_RX_LENGTH 1536) */
mtk_switch_w32(gsw, 0x2305e30b, GSW_REG_MAC_P0_MCR);
mt7530_mdio_w32(gsw, 0x3600, 0x5e30b);
to
/* (GE1, Force 1000M/FD, FC OFF, MAX_RX_LENGTH 1536) */
mtk_switch_w32(gsw, 0x2305e30b, GSW_REG_MAC_P0_MCR);
for (i = 0; i <= 6; i++) {
mt7530_mdio_w32(gsw, 0x3000 + (i * 0x100), 0x5e30b);
}
/* Disable Flow Control Globally */
val = mt7530_mdio_r32(gsw, 0x1FE0);
val &= ~BIT(31);
mt7530_mdio_w32(gsw, 0x1FE0, val);
/* turn off pause advertisement on all PHYs */
for (i = 0; i <= 4; i++) {
val = _mt7620_mii_read(gsw, i, 4);
val &= ~BIT(10);
_mt7620_mii_write(gsw, i, 4, val);
}
Good job. I have no more timed out errors since i separated each port into different VLANs (i guess because in this way the devices do not communicate directly through the Switch). I will test your patch using the same VLAN on the ports where the computers are part of the same subnet to see what happens. Maybe it was giving problems due to the use of pause frames.
Please keep me posted! Really curious to see if this helps others as well. Because if so, I can make a pull request. I've just flashed the one device that was having the transmit queue has timed issue about ~once a week. Curious to see how it's gonna do with flow control disabled. Fingers crossed.
Unfortunately it doesn't work properly. Forces the link at 1Gbps on all ports, even if nothing is connected. Devices only 100Mbps do not work.
I think this is not necessary for GMAC 0~4:
/* (GE1, Force 1000M/FD, FC OFF, MAX_RX_LENGTH 1536) */
mtk_switch_w32(gsw, 0x2305e30b, GSW_REG_MAC_P0_MCR);
for (i = 0; i <= 6; i++) {
mt7530_mdio_w32(gsw, 0x3000 + (i * 0x100), 0x5e30b);
}
These bits should be like this in registers 0x3000 ~ 0x3400 or not set at all (and check that pause frames are still disabled):
Yes, you are right. MAC 0 through 4 are all connected to the external PHYs and shouldn't be touched I think. Because bit 4&5 only work if bit 15 is set to 1, but setting bit 15 to 1 will make the MAC use all the bits with "FORCE_" prepended, and thus will force the link to a certain speed. The patch should probably be changed to:
/* (GE1, Force 1000M/FD, FC OFF, MAX_RX_LENGTH 1536) */
mtk_switch_w32(gsw, 0x2305e30b, GSW_REG_MAC_P0_MCR);
for (i = 5; i <= 6; i++) {
mt7530_mdio_w32(gsw, 0x3000 + (i * 0x100), 0x5e30b);
}
/* Disable Flow Control Globally */
val = mt7530_mdio_r32(gsw, 0x1FE0);
val &= ~BIT(31);
mt7530_mdio_w32(gsw, 0x1FE0, val);
/* turn off pause advertisement on all PHYs */
for (i = 0; i <= 4; i++) {
val = _mt7620_mii_read(gsw, i, 4);
val &= ~BIT(10);
_mt7620_mii_write(gsw, i, 4, val);
}
An update on the flashed router by the way: It has been running for a week now, but unfortunately I did see the transmit queue has timed out error. It didn't result in a reboot. I am not sure if the connection was interrupted and for how long, since I wasn't using the connection when it happened. Still, it's not very promising to see the issue crop up again.
Since May there had been no problem. It has rebooted after 12 hours working with ports 1~4 in the same VLAN. I give up, it is best to use an external Switch and use each port of the mediatek in a different VLAN (with Software Bridge if necessary).
It is ironic that being a SoC intended for use as a switch it cannot be used as such. I hope that in the official firmwares the same does not happen (for example in EdgeOS from Ubiquiti), if not this SoC is rubbish.
I leave it as before: GMAC Port 5 (6) FC Off, interrupt handling patch and separate VLAN for each port. In this way is rock solid.
Well, I am not entirely sure yet. The problematic router has been running for 2.5 weeks now without crashing. Previously, it would crash once a week on average. While 2.5 weeks without crashing was possible before, it was pretty rare. So I am going to let it run for a bit longer to see if it can get 4 weeks uptime. If so, I will flash it again with the latest version of the patch (Which doesn't force 1000 mbit on all ports) and see if that is still sufficient to prevent it from crashing. I'm starting to become cautiously optimistic at the moment.
So to add to the work you had did, I've allowed flow control on ports 0-4 and left ports 5,6 as off for flow control.
Without flow control the cpu is working twice as hard and can cause kernel crashes.
But i've added a reset function on the switch, as pause frames can still be sent from other switches. I've experienced complete lockups on the switch, so with this new reset function it clears itself automatically.
This is definitely something I will add to my builds as a precaution. I haven't experienced any crashes yet (thankfully!), but improving the reset functionality isn't a bad idea.
Which flow control option causes this exactly? I agree that the flow control options on the MAC 0 through 4 should be disabled, since they need to be able to auto-negotiate things like link speed, so forcing everything there isn't a viable option. But as far as I know, disabling flow control globally shouldn't be an issue, right?
At the moment I haven't had any kernel crashes, and I didn't see worse performance / higher CPU usage either. What exactly where you seeing? And was this still an issue with my second version which doesn't touch MAC 0 through 4?
So thats the patch i used that you had found, it never got applied in mainstream. Which is a shame as it does alleviate when the mt7530 goes mental and hard locks.
That stack trace seems to be about flow offload, which is something completely different than flow control. In flow control, pause frames can be sent at the ethernet level to signify the other end of the link to pause transmission when overwhelmed to clear up the queues. Without flow control, there is no way to signify the other end of the link, which will lead to dropped frames. Layers on top (TCP/IP) should slow down and do a retransmission, basically shifting the responsibility of slowing down to higher layers. Ultimately, having flow control enabled / disabled shouldn't impact the CPU usage at all:
With flow control the other side is asked to slow down.
Without flow control the other side isn't asked to slow down and the frames are simply dropped instead. These frames will never reach the CPU, and will therefor have zero impact on the CPU usage.
It's been up for 3 weeks now without reboots or lockups with the first version of these patches (the one that forces ethernet to 1Gb/s). It does have a few transmit queue has timed out errors in the log, but it didn't result in anything breaking. Without these patches, it would lock up or hard reboot once every week on average. Going to let it run for one more week to reach the 1 month mark. I will then flash it with the second version of the patch that doesn't force the ethernet speeds and reevaluate stability. Looking good so far though!
Another update: Four weeks without a single hardlock/reboot now. I am not sure if this fixes it for everyone, but for me personally I consider it fixed since this used to happen once a week on average. My steps will be:
This patch only touches mt7621 devices, which all have a mt7530 switch and hence all probably have this bug. The reason why not all mt7621 devices require it (for example, I have a mt7621 device at home that doesn't) is that the bug is triggered by other devices sending pause frames in a certain way. So if there aren't incompatible devices that are connected, you won't ever see this bug pop-up.
In my opinion, the best way to implement this would be to add in support so that ethtool can be used for toggling pause frames on/off, so that by default this functionality is in one state, and people are able to change it if they run into issues. However, this driver is only used in the 19.07 branch, since the master branch is using DSA drivers. Therefor, I think it's unlikely a patch implementing this new feature would be accepted in a stable branch. So for me personally, this small patch is the best solution for getting the 19.07 branch stable.
I will evaluate again once a stable 20.xx branch is out to see what the best course of action is for that particular branch. I haven't tested the master branch myself on the problematic device, since it's used in a small business in production, so I cannot really tinker with it.
I have been using that patch (interrupt handling) since May, without it transmit timed out happens every day. But this alone does not cause reboots or hangs.
I have tested your patch on 19.07.4 and more than one port in the same VLAN and in 12 hours it rebooted.
For me the only solution to reboots or hangs is to separate each port in a VLAN, and if you don't want to see more trasmit timed outs use the interrupt handling patch.
Do you have multiple ports on the same VLAN on the problematic router?
I am currently on 19.07.4 with no problems, but i can't put more than one port on the same VLAN, this is what triggers the reboots.