For those who are interested, the above linked topic by me contains patches to disable flow control on ALL MACs instead of only one, disable it globally as well AND disable pause frame advertisement on the PHYs. I have tested it on my home router and everything is running fine.
However, this router has always been stable, so not sure if it actually fixes the transmit queue has timed out issue. I will deploy it to the router having issues this week. If anyone else wants to test it out feel free
On the other hand, can someone tell me what are the recommended settings for mt7621? In the last couple weeks I noticed that with default settings (packet steering and soft flow offload ON) I cannot reach more than 350Mbits and only a single core is utilized (with PPPoE).
It only solves the PPPoE disconnect issue. No HW offload, I tried it. However it would be very nice if someone can clarify what these commits are actually achieving and what are the recommended settings, as it is quite clear that the default settings and only enabling software offload is far from enough.
Testing it now on my R6850 router (mt7621a/t). Was getting lots of modem hangup on PPPoE like every hour or 2. So far 5 hours in and PPPoE still up with that commit.
The packet steering with software offload became the default in one of the commits months ago. I can't recall, but it said it provides more performance with it enabled along with SW offload. Maybe you can try it now with HW NAT, since along with that latest MT76 patch, HW NAT on my device works (based on the fact that with it enabled, SQM is ignored as intended. That's as far as I can test with my capabilities).
Packet steering and SW offload is enabled, yet without tweaking kernel parameters, by default this setting combination gets 350Mbits and single core limit. This is clearly not the desired operation.
I found an interesting patch set among the preliminary 5.9 kernel support in Felix's repository:
The PPE (packet processing engine) is used to offload NAT/routed or even bridged flows. This patch brings up the PPE and uses it to get a packet hash. It also contains some functionality that will be used to bring up flow offloading later.
I was almost sure this bug was fixed in the latest trunk with the DSA changes and patches but it occured again. The router continued to work after the exception.
I found that the DSA-driven mt7530 switch can now set VLAN through UCI. Netifd provided support in the latest submission:
This is my uci settings:
uci set network.sw=interface
uci set network.sw.type='bridge'
uci add network bridge-vlan
uci set network.@bridge-vlan[0].device='br-sw'
uci set network.@bridge-vlan[0].vlan='1'
uci set network.@bridge-vlan[0].ports='lan1:t lan2 lan3'
uci add network bridge-vlan
uci set network.@bridge-vlan[1].device='br-sw'
uci set network.@bridge-vlan[1].vlan='3'
uci set network.@bridge-vlan[1].ports='lan1:t lan4'
uci add network bridge-vlan
uci set network.@bridge-vlan[2].device='br-sw'
uci set network.@bridge-vlan[2].vlan='4'
uci set network.@bridge-vlan[2].ports='lan1:t'
uci set network.lan.ifname='br-sw.1 bat0'
The VLAN is correctly set, and the iptv multicast data is transmitted stably on the VLAN; but batman-adv will cause the kernel to panic:
Finally, with 19.07.4 the router has locked up again, having to disconnect it from the power supply after several consecutive reboots but without any transmit timed out error. The configuration is the same as it had on 19.07.1 which was being stable for months. It has coincided that I have changed the 2A power supply that I put in May, again for the original.
I don't know if this may have been the cause, it seems strange to me that the original power supply is not able to give the current it needs (the router drains 5W maximum and the power supply is 6W). I have returned to 19.07.1 to see what happens with original power supply that I've never tried while it's been stable.
I have reverted to 19.07.1 with the same configuration that was not giving me problems, and it keeps restarting or hangs after 24 hours. Using iperf from WAN to LAN i get it to hang or reboot almost immediately, to speed up the testing process.
I have tried different Switch VLAN configurations, disabling all ports except eth0 (in VLAN2) and eth1 (in VLAN1). I have also disabled Software Flow Offloading, stopped OpenVPN, etc. Unsuccessfully. Everything indicates that the power adapter that comes with the Edgerouter X is not enough or is defective.
There are definitely no devices and services on the network that use any type of VPN.
If VPN was used, then the entries in the LOG would be permanent.
This is a one-time error occurring a few minutes before error 320.
Obviously connected with overloading of some of the blocks or buses of SOC , or wrong timings / buffers.
PS. I playing around CPU/OCP/SYS devider by bootstrap resistor.
880/293/220 is more stable then frequency 880/220/220 MHz.
Bandwidth of OCP bus is matter.
The hangs and reboots issues came back after four months stable, and I can't find a way to make it work fine again.
While it was working correctly, it was using 19.07.1 with GMAC Port 5 FC Off and Interrupt Handling Patch (https://patchwork.ozlabs.org/project/openwrt/patch/20190306040846.21746-1-rosenp@gmail.com/). Each Ethernet port in a different VLAN (or with more than one). In this way it was 100% stable, but I made the changes below and since then I have had problems with hangs and reboots when squeezing the connection:
I updated to 19.07.4.
I added another managed switch in cascade to port eth1 where I connected the machines that were on ports eth2, eth3 and eth4. Thus I eliminated the need to use software-bridge between the different VLANs of each port. I disabled the ports eth2, eth3 and eth4 (they do not belong to any VLAN). Ports eth0 (WAN) and eth1 (LAN) remain unchanged, VLAN2 and VLAN1 respectively.
I reconnected the original power supply, while it was stable I had another one in use to rule out problems.
I have tried to revert this changes to the previous functional one, but keeping the Switch and not connecting anything to the ports eth2, eth3 and eth4 (although they have a VLAN assigned, and belong to a software-bridge). The same problem continues, if I use the WAN intensively, the communication between the ethernet ports is lost and the router cannot be accessed, or it is directly rebooted. The syslog is clean, no transmit timed out or kernel crash.
I can not find a logical explanation, having tried to leave it rolled (almost) as before. Actually the only change right now is that there is nothing connected to ports eth2, eth3 and eth4. I already had a Switch connected to eth1, I just added another one in cascade.
I will compile 19.07.1 without Interrupt Handling Patch, which was reverted in branch 19.07 for alleged problems. But I remember that without that patch, the logs were flooded with transmit timed out errors on a daily basis.