Can you elaborate on this? I read quite a few commits, and tickets about this same issue (I myself is experiencing it on a 860L), but this is the very first time I see and established and well respected developer say "Unlikely it will ever get fixed."
I kinda hoped that the ongoing ramips initial 4.19 support plus the 5.x backports might fix this, but if you say it is unlikely thats is not very good news. Do you have any recommendations - if it is really unfixable - about how to avoid it? For example at some point there was a recommendation to disable flow control on the connected client devices, which did not worked, but if you have any recommendations that would be nice to hear.
I am sorry, but I think I am "barking" at the right tree, as I (and many others) have exactly the same issue as the topic starter's problem. The linked conversation - although I did not started it - maybe in the wrong place, non the less the linked part is relevant in this topic, and that is what I would like neheb's comment about.
That conversation is in the wrong place (as they note in the responses there). However, the final comment actually discusses a (possible) bug in the flow control mechanism in the mt7621 gigabit switch, which is resolved by disabling flow control
FYI, I am using the patched version for 6 days now, and there are no Mtk_soc_eth timeouts or any other kernel errors in the log. The router sits on a 1000/300Mbit PPPoE FTTH connection, and I also run quite a few full speed speed tests with SW and HW offload turned on. Of course this is not a conclusive test yet, but given the fact the last couple masters developed this fault rather quickly (usually within 2-3 days) I would say there is a high chance that this fix actually fixes this long outstanding issue.
On my end, sometimes after an error like this the router's switch part was either very slow or not working at all. Means that the WAN interface was reachable externally, but the LAN side disconnected from the router. Although the LAN devices can still talk to each other. But this did not happen every time the timeout presented itself in the kernel log.
Since the early days of 4.14 I see this issues with varying frequency, but the single common point is that there was not a single time when the router did not developed at least one of this timeout after 10 days. So I will continue the test and will report back, but none the less it looks promising.
What would be beneficial is to create a pull request where the modifications of this patch can be turned on an off without the need to create a custom build. This way more people can test this.
Please, can you tell me where to find the patch for disable flow control? I only found one very old for kernel 4.4.
I have those errors every few days, most of the time nothing happens but others leave the router innacesible and I have to restart it by disconnecting the power. It is currently in 19.07-snapshot (a bit old, kernel 4.14.152) if the router is not fixed, it will be trash and I will not buy never anything with mediatek chip again
@neheb: I am almost at 10 days uptime, and with this patch applied, there is no kernel errors at all. Since this issue was "introduced", it never happened before that I had a clean kernel log for 10 days. So it seems this patch actually fixes or at least eliminates this issue. Adding it to master would allow more people to test it on more platforms.