Xiaomi 4A GE br-lan issues

Hello everyone. I have a question about some issue i came across with Xiaomi 4A GE. I think there is an issues with internal switch or br-lan interface. I really do not know how to debug or resolv it. I would like to get some ideas, tips and directions to find out what is going on.

Here is my situation (at my brothers house): I made a simple 5GHz mesh with 3 4A-GE routers. Master node is connected to ISP and sends WiFi on 2,4GHz. On other 2 nodes are PC and TV connected. I used 22.03.4/5 for very long time and it ran great. Last week we upgraded to 23.05.2 (There was issus, i wrote about it at: Internet keeps stopping in Xiaomi 4a(gigabit edi) router with openwrt - #5 by ilija.culap). We switched to SNAPSHOT and then to 23.05.3. Every one of them hat issues, where one of the nodes could not connect to the internet.

Issue:

  • After 12 - 24 hours
  • Mesh link is okay
  • No OOM
  • No errors in logread or dmesg
  • Allready connected ethernet client (with IP address) can connect to router (web and ssh)
  • Cannot ping the gateway
  • No new clients can connect or existing reconnect, because DHCP is on master node
  • ifconfig br-lan down ; ifconfig br-lan up does not help
  • Resetting wifi does not help
  • Reboot helps

Here are some information about network:

  • 3 Nodes, 4A-GE, 5GHz
  • Master node with DHCP and Firewall. This node makes no problems
  • On nodes 2 and 3: firewall, dnsmasq and odhcpd are disabled
  • On nodes 2 and 3: wan port is added to br-lan
  • On all nodes IPv6 is disabled for br-lan

Some files from node 2 (same as 3):

Info
{
	"kernel": "5.15.150",
	"hostname": "Culap-X2",
	"system": "MediaTek MT7621 ver:1 eco:3",
	"model": "Xiaomi Mi Router 4A Gigabit Edition",
	"board_name": "xiaomi,mi-router-4a-gigabit",
	"rootfs_type": "squashfs",
	"release": {
		"distribution": "OpenWrt",
		"version": "23.05.3",
		"revision": "r23809-234f1a2efa",
		"target": "ramips/mt7621",
		"description": "OpenWrt 23.05.3 r23809-234f1a2efa"
	}
}
Network
config interface 'loopback'
	option device 'lo'
	option proto 'static'
	option ipaddr '127.0.0.1'
	option netmask '255.0.0.0'

config globals 'globals'
	option ula_prefix 'xxxx:xxxx:xxxx::/48'
	option packet_steering '1'

config device
	option name 'br-lan'
	option type 'bridge'
	option ipv6 '0'
	list ports 'lan1'
	list ports 'lan2'
	list ports 'wan'

config interface 'lan'
	option device 'br-lan'
	option proto 'dhcp'
Wireless
config wifi-device 'radio0'
	option type 'mac80211'
	option path '1e140000.pcie/pci0000:00/0000:00:01.0/0000:02:00.0'
	option channel '1'
	option band '2g'
	option htmode 'HT20'
	option disabled '1'

config wifi-device 'radio1'
	option type 'mac80211'
	option path '1e140000.pcie/pci0000:00/0000:00:00.0/0000:01:00.0'
	option channel '40'
	option band '5g'
	option htmode 'VHT80'
	option cell_density '0'

config wifi-iface 'wifinet0'
	option device 'radio1'
	option mode 'mesh'
	option encryption 'sae'
	option mesh_id 'MX-32468034'
	option mesh_fwding '1'
	option mesh_rssi_threshold '-80'                                
        option mesh_gate_announcements '1'                              
        option mesh_hwmp_rootmode '3'
	option key 'xxxxxxxxxxxxx'
	option network 'lan'

What i tried:

  • I have one spare 4A-GE that i added to my own mesh (2 x Xiaomi AX3200) so i can test and investigate. It crashed with 23.05.2 (at night) but i could not test anythig because i could not add any new client (DHCP). Now is this router connected to my PC
  • I tought there is issue with DHCP renewal (12h lease time) so i tried to reduce lease time to 2m on my mesh but this did not trigger the issues. i ran it for 1 hour
  • I tried to download and upload some files through the router (around 120GB on all 3 ports) but this did not trigger it either.

My next steps are to remove wan port from br-lan, activate bring up empty bridge and let all services run on the router. Problem is that i have to wait 24h to see if it works. So i hope someone can tell me what else i could do. How to trigger this behavior. Here is something that seems very similar: Br-lan client can not connect problem (It seems that wireless devices (wifi-iface) disconnect themselves from br-lan.)

Thank you all in advance
Best regards

Any chance of an opkg list of packages installed?

For best wifi performance, always set your country in the wifi config. Set the same country on both radios even if you are only using one.

"Mesh is OK" and "can't ping the gateway" (presumably via mesh) are contradictory statements. You may want to set static IPs on the LAN of all nodes to simplify ping testing.

If you think the problem is related to DHCP renewal time, you could speed up troubleshooting by using short lease times. Be absolutely certain that only one DHCP server is active in the network.

Hi. Thank you for your reply. I created images for those routers with Imagebuilder. Here is the command:

make image PROFILE=xiaomi_mi-router-4a-gigabit PACKAGES='-uboot-envtools base-files busybox ca-bundle dnsmasq dropbear firewall4 fstools kmod-gpio-button-hotplug kmod-leds-gpio kmod-mt7603 kmod-mt76x2 kmod-nft-offload libc libgcc libustream-mbedtls logd luci mtd netifd nftables odhcp6c odhcpd-ipv6only procd procd-seccomp procd-ujail uboot-envtools uci uclient-fetch urandom-seed urngd uhttpd uhttpd-mod-ubus libiwinfo-lua luci-base luci-app-firewall luci-mod-admin-full luci-theme-bootstrap luci-app-opkg -ppp -ppp-mod-pppoe -wpad-basic-wolfssl -wpad-basic-mbedtls wpad-mesh-mbedtls opkg iperf3 htop' 

For best wifi performance, always set your country in the wifi config. Set the same country on both radios even if you are only using one.

Thank you for your reply. I forgot them indeed. Thank you.

"Mesh is OK" and "can't ping the gateway" (presumably via mesh) are contradictory statements. You may want to set static IPs on the LAN of all nodes to simplify ping testing.

With "Mesh is OK" i meant Layer-2 infrastructure. I tested it with iw dump station. With "can't ping the gateway" i meant IP ping (Layer-3). There are static leases on master node and every node gets their IP.

If you think the problem is related to DHCP renewal time, you could speed up troubleshooting by using short lease times. Be absolutely certain that only one DHCP server is active in the network.

I did that. I changed it to 2 min and let it run for more than 1h. That did not trigger my issue.

So i think i found solution to my problem. The problem is that i do not know why it is happening. So by checking "Bring up empty bridge" in br-lan setting problem goes away (Tested for ca 30h).

My theory: Somehow wifi-iface disconnect itself from the bridge -> bridge goes down -> some device connects over ethernet -> bridge goes up but without wifi-iface (mesh interface). Without Mesh wifi-iface there is no DHCP and no connection to the rest of the mesh. Devices that have their IP leases can access the router (LAN has also IP address).

I hope that this information will help someone. Best regards

So sadly the problem is stil there. I can tell now that the "Bring up empty bridge" did not solve the problem. Removing wan from bridge did also not solve the problem. What i also did, and it did not help, is add Country Code to both radios (which improved mesh performace, thx) and set static addresses for both nodes.

Then i made test, changed channel to 64, and disabled "Dissassociate on low ack" for both mesh wifi-iface-s. There were no problems for about 25-30h. So now switched to CH 40 and now i have to wait.

With my own 4A-GE in my mesh i had no luck crashing it. It tried so many things, but no luck.

On internet i found some bugs related to 4A-GE, MT7621, 3Gv1 etc. but i have to confirm it is the same error (they are all old). I can post some links soon.

1 Like

Is the WAN connection PPPoE or DHCP?
Have you got enough space for mesh11sd as @bluewavenet has developed a new autoconfiguration that this sounds suited for since version 3.1
https://openwrt.org/docs/guide-user/network/wifi/mesh/mesh11sd

When a mesh interface comes up, it connects to a peer node then muticasts its presence into the mesh. All will be well for a few hours, depending on traffic, then the node will be dropped as it has not repeated its its multicast notification.

For this notification to be automated you must turn on the mesh's built in mac-routing protocol (HWMP).

Unfortunately this cannot be done from the wireless config as it must be done after the mesh interface is established.

This is where the mesh11sd package comes in. It will monitor the mesh and turn on HWMP (along with a few other essentials), then continues to monitor the mesh status.

The version of mesh11sd in OpenWrt 23.05.x has a few issues, and will not work for you as you actually have the mesh configured in the wireless config.

A new version is coming soon so until then I recommend you wait until then - hopefully just a few days.

Good to know this!

I have a MT7628 router running openwrt v23.5.2 non-mesh mode, connect to a Linksys mesh wifi network, working 1~7 days and randomly disconnected / cannot re-connect.
Then we just decreased the mesh devices (Linksys) to 2 sets from 3 sets, All ok for now.

Is the WAN connection PPPoE or DHCP?

It is DHCP

Have you got enough space for mesh11sd as @bluewavenet has developed a new autoconfiguration that this sounds suited for since version 3.1

I have enough space, but i do not want to use mesh11sd. It is very complicated.

Thank you for your replay @bluewavenet . I think that was indeed my problem. Yesterday i noticed some mpath-s in my main mesh nodes. Then i look online and found:

According to GitHub it is not a bug but symptom of not seted mesh parameters (If i did understood right). So i tried to remove all mpath-s manually and internet connetion from mesh nodes was immediatly recoverd. Only thing i am not sure about is which params are important? I have some in wireless config (i know they should not work, but they ware there (iw dev ... mesh_param dump)).

So like i said i do not want to use mesh11sd. I created a script that sets all mesh_params and removes those mpath-s. Script runs every minute. I also removed every mesh_param from wireless config file. So now i have to sit back and watch if everything works as intended.

If anybody has interest in my script can post it here.

Good to know this!

I have a MT7628 router running openwrt v23.5.2 non-mesh mode, connect to a Linksys mesh wifi network, working 1~7 days and randomly disconnected / cannot re-connect.
Then we just decreased the mesh devices (Linksys) to 2 sets from 3 sets, All ok for now.

That is very interesting. My mesh consisting of 2 nodes has no problems at all. But those bogos mpath-s are there sometime. Mesh of my brother consists of 3 nodes. Maybe you have same error as we do (I deployed my script onto my mesh too). Maybe you could post output of:

iw dev <mesh-iface> mpath dump

And maybe even with three nodes and same output from main node.

Here is output from main node of my brother:

root@Culap-X1:~# iw dev phy1-mesh0 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME	DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
1c:cc:d6:db:XX:XX 00:00:00:00:00:00 phy1-mesh0	0	0	0	0	1600	4	0x0	0	0
40:d3:ae:19:XX:XX 00:00:00:00:00:00 phy1-mesh0	0	0	0	0	1600	4	0x0	0	0
64:64:4a:3b:XX:XX 64:64:4a:3b:XX:XX phy1-mesh0	37204	118	0	5700	100	0	0x15	1	201
64:64:4a:3c:XX:XX 64:64:4a:3c:XX:XX phy1-mesh0	37861	122	0	5700	100	0	0x15	1	501

The Linksys mesh network and router not near my home, I cannot touch it.

We noticed, the 7628 router can receive 2 nodes at sametime when using 3 mesh devices, one node signal is stronger. After 2 nodes used, the router can receive one signal only.
We do not know the real reasons why.

Indeed, it is because you have not configured the mesh correctly.

But then they will come back eventually.

Sure, just like mesh11sd does. I guess you know what to set for the parameters?

It is designed to install and forget.
It can be complicated if you want to do all sorts of "complicated" stuff of your own that is not mesh related .
This is because there is a bug that does not turn off all the autoconfiguration so you can get clashes, but that is fixed in the upcoming release.

Those two meshnodes have 201 and 501 path changes respectively. This means your mesh is very unstable.
I'll let you figure it out :wink:

But then they will come back eventually.

My script checks that and removes them accordingly.

Sure, just like mesh11sd does. I guess you know what to set for the parameters?

Exacly, mesh11sd-lite :slight_smile: . I am not sure. here is what i am setting on all nodes:

setParams () {
	iw dev $MESH_IFACE set mesh_param mesh_rssi_threshold="-70"
	iw dev $MESH_IFACE set mesh_param mesh_hwmp_rootmode="3"
	iw dev $MESH_IFACE set mesh_param mesh_max_peer_links="10"
	iw dev $MESH_IFACE set mesh_param mesh_fwding="1"

	iw dev $MESH_IFACE set mesh_param mesh_connected_to_gate="0"
	iw dev $MESH_IFACE set mesh_param mesh_connected_to_as="1"
	iw dev $MESH_IFACE set mesh_param mesh_gate_announcements="1"
	iw dev $MESH_IFACE set mesh_param mesh_fwding="1"
}

For rootmode i could not find any explaination which mode is what. If i understood right connected_to_as is for node that is connected to upstream router, and connected_to_gate is for when the node is also AP. But i am not sure.

This is because there is a bug that does not turn off all the autoconfiguration so you can get clashes, but that is fixed in the upcoming release.

Exacly. I tried it yesterday. I tried to make my image with imagebuilder. But the node configured itself and changed everything (for example my Hostname). I hope that i can try it out in the future and hopefully figure it out.

Those two meshnodes have 201 and 501 path changes respectively. This means your mesh is very unstable.
I'll let you figure it out

I think that is because of those other mpath-s. Here is how it looks now. (with my script enabled)

root@Culap-X1:~# iw dev phy1-mesh0 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME	DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
64:64:4a:3b:XX:XX 64:64:4a:3b:XX:XX phy1-mesh0	6238	116	0	4910	100	0	0x15	1	63
64:64:4a:3c:XX:XX 64:64:4a:3c:XX:XX phy1-mesh0	6581	122	0	4910	100	0	0x15	1	25

Thank you very much for your help @bluewavenet

These were 63 and 25 at the moment you did the mpath dump.
What was it a few minutes later?

Path change does not increment if the mesh is stable. The default is 1. the "path change" when the mesh establishes. 63 and 25 is still very bad, I wonder what is is now? You could cheat by running wifi and resetting everything.

What was it a few minutes later?

They were increasing. But i found out how to set params properly (which param is for which node) and now it look like:

EST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME	DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
64:64:4a:3c:44:5f 64:64:4a:3c:44:5f phy1-mesh0	1757	117	0	4940	100	0	0x15	1	27
64:64:4a:3b:f1:73 64:64:4a:3b:f1:73 phy1-mesh0	1689	116	0	4940	100	0	0x15	1	3

It does not change (i watched for about 1 hour now). I would say my mesh is now much more stable.

Can somebody tell me if skip_inactivity_poll and disassoc_low_ack plays role for mesh?

Hello everyone. So after some days of testing and rewriting my script i can now confirm that my mesh (of my brother) is much more stable. I was monitoring it yesterday with a simple PING and CURL tester (it pings ans curls every minute every node). So far there were no issues whatsoever. So after setting all mesh parameters mesh nodes do not disconnect. What else i did i enabled STP on all bridges, today i reduced rssi to -55 from -80 and added connected_to_gate to all nodes, even node 2 and 3 do not have WiFi but some ethernet clients. But i also have some concerns:

  • PATH CHANGE on all nodes increases constantly
  • After some time of not using mesh there are some large delays for SSH and PING (for example). I have to check if i do nothing on mesh for a day or 2 if the nodes still work properly.
  • RSSI treshold is set only on mesh paramters, not in wireless config. In mesh11sd it is set on both places. I am not sure if this is important.

So like i said there is another problem. PATH CHANGE value on mpath dump increases constantly. If i understööd correctly it should not change from value 1. Mesh node should not find better paths. I do not know how to make this more stable and if this value plays role in day-to-day usage.

Second problem: I tried adding one more router (only for testing purposes) to my mesh. It is an old ASUS AC51U.

ISP -> AX3200 ))) ((( AX3200 ))) ((( AC51U
                         |
                       My PC

I installed 23.05.3 on it and added my script so that all mesh parameter could be set dinamically. After ca. 20 hours same problem, my brother had, occured to me. I could not access this 3rd router from my computer. So i tried accessing (SSH) my main router (mesh node 1) and from there node 3. And that worked. This is very strange. CHANGE PATH is also a very big number on all nodes.

Here are parameters:

root@ASUS:~# iw dev phy0-mesh0 mesh_param dump
mesh_retry_timeout = 100 milliseconds
mesh_confirm_timeout = 100 milliseconds
mesh_holding_timeout = 100 milliseconds
mesh_max_peer_links = 10
mesh_max_retries = 3
mesh_ttl = 31
mesh_element_ttl = 31
mesh_auto_open_plinks = 0
mesh_hwmp_max_preq_retries = 4
mesh_path_refresh_time = 1000 milliseconds
mesh_min_discovery_timeout = 100 milliseconds
mesh_hwmp_active_path_timeout = 5000 TUs
mesh_hwmp_preq_min_interval = 10 TUs
mesh_hwmp_net_diameter_traversal_time = 50 TUs
mesh_hwmp_rootmode = 2
mesh_hwmp_rann_interval = 5000 TUs
mesh_gate_announcements = 1
mesh_fwding = 1
mesh_sync_offset_max_neighor = 50
mesh_rssi_threshold = -80 dBm
mesh_hwmp_active_path_to_root_timeout = 6000 TUs
mesh_hwmp_root_interval = 5000 TUs
mesh_hwmp_confirmation_interval = 2000 TUs
mesh_power_mode = active
mesh_awake_window = 10 TUs
mesh_plink_timeout = 0 seconds
mesh_connected_to_gate = 0
mesh_nolearn = 0
mesh_connected_to_as = 0

And here mpath dump from main router

root@Imagine-1:~# iw dev wl1-mesh0 mpath dump
DEST ADDR         NEXT HOP          IFACE	SN	METRIC	QLEN	EXPTIME	DTIM	DRET	FLAGS	HOP_COUNT	PATH_CHANGE
70:4d:7b:92:XX:XX 5e:02:14:b5:XX:XX wl1-mesh0	29130	31	0	5190	100	0	0x15	2	13593
5e:02:14:b5:XX:XX 5e:02:14:b5:XX:XX wl1-mesh0	70582	9	0	5990	100	0	0x15	1	81

After i remove all mpaths or restart wifi on all routers, it works again as it should. Is there a possibility there is some bug with mt76 driver or something like that.

Here is my script i was using:

This week i am going to post more information about mesh, mesh_params etc i found online Maybe someone will use it.