This is unrelated to your issues, but I notice that you're running 22.03.5 on two of your routers. Considering that 22.03.x is going EOL this month, you really should upgrade to 23.05.3+ soon, that might also help with your meshing (getting the mesh relevant package versions closer to each other).
that is why I get confused: the documentation talks about portal, Peer, Gateway; I read I should connect 'normal wifi user devices' to a mesh peer. How is this possible if there are no roles?
but maybe the documentation means the DHCP server for the routers management part, not the DHCP server for the actual wifi clients I add on the routers in the mesh. Of course I don't care what are the routers IP addresses (if I can find them back...).
that is my fall back position to understand it ;-). And my expectation is that I add on top of the mesh the wifi interfaces with the SSIDs I want to use on each router.
this is another confusing part for me... The basic idea is to install the daemon and all device running the daemon will extend the mesh.
Only if you want some security, you need to configure each device with your the mesh name and key. I don't know many use cases where this no security option is valid. I would always have written with security in mind and only refer to the full auto version to a 'test case' or have big warning this is fully unsecure. I don't want anyone who brings his router to connect to my network without any security. I don't want anyone to plug their ethernet cable in my ethernet ports without me supervising it.
There is, 23.05.x has been released in mid October 2023 (with 23.05.3 as the latest maintenance release in late March). All of your devices are supported by it (gl-mt6000 support was backported to 23.05.3, but you can also stick to snapshots there), there is no reasons to stay on older releases on either of them (but many reason to stay updated, to profit from mt76 improvements and more).
I fully agree: it is just that I took the 'default' when I installed my devices a month ago and then somehow it resulted in a version that is old. I guess I need to read more on this topic.
All mesh nodes have the same role in the mesh ie they are all peers and provide links into the layer 2 mesh backhaul (like virtual ethernet ports on the virtual switch).
A meshnode can also have an upstream connection to another network, for example an Internet feed. If it does it will advertise the fact to its peers that it has a portal to somewhere else. This type of node is a Portal node or mesh_portal.
A meshnode can also have a downstream connection to another network, for example an access point providing wifi connectivity for end users. This type of node is a Gateway node or mesh_gate
A meshnode can be both a Portal and a Gateway. eg if that node has both upstream and downstrem links.
A meshnode with neither upstream nor downstream, is just a peer in the mesh, providing forwarding services to its peers.
No, it provides ipv4 dhcp to all devices downstream in the mesh, including mesh peers and user devices connected to mesh gateway access points.
Mesh11sd provides its own tools for remote access to other nodes for admin purposes, using ipv6.
Alternatively you can look at the /tmp/dhcp.leases file on the portal node to get peer node ipv4 addresses.
The use case for using the default hashed id and key is a public network where someone adding their own mesh node (eg a travel router configured as a meshnode) poses no security issues at all.
For secure home/office use case then yes you would configure your own secret mesh id and mesh key strings to be hashed by mesh11sd. This is most simply done by adding to the config in the Firmware Selector to produce a common image (or series of images if meshnode hardware varies).
But it states the daemon can be installed to set some of the parameters. Is doing the setup as described on that page referenced actually compatible with the daemon. I would expect the daemon not to configure any mesh network if you manually setup a mesh wifi.
You mentioned the root cause of my problem is lacking activation of HWMP; I do find several parameters about HWMP; but not one to 'activate' it. Is the daemon by default activating it?
Yes that is true, including enabling HWMP. The problem is the current version of mesh11sd has a bug where "manual" configuration mode does not turn off all the auto config, resulting in failure of the mesh with some drivers. This is why I said wait until the new version arrives....
It is one cause of the problem.
For a portal node mesh_hwmp_rootmode should be set to 4
For a peer node mesh_hwmp_rootmode should be set to 2
By default it is set to 0
This can be set by an iw command, but it does not survive restart of the wireless, let alone a reboot.
Mesh11sd monitors the mesh and sets this as required, along with numerous other parameters.
So I upgraded my routers to OpenWrt 23.05.3.
First router 2 => I got the upgrade firmware, upgrade and router didn't register on the mesh: logical since I had to replace the wpad version with a wpad-mesh supporting version. I connected the router to the ethernet and fixed it... Rookie mistake...
Router 3: created a firmware with correct wpad -mesh package and in one go the upgrade worked
Router 1: created a firmware with correct wpad-mesk package, installed all working fine... except the router is not visible in the layer 3... It works perfectly fine, has the mesh working, other routers register to the mesh, its wifi network allows clients. But it became unmanageable... very funny situation.. Somehow it must have forgotten it should do a DHCP request...
and fixed: the router popped up with another Mac Address and so another IP address... funny ;-).
Ok; but I can set this and this should make my router stable. Or do i miss something?
My routers normally have just like 2 reboots a year... (mostly because of power outage ;-)).
I can create a script that that set these... (versus a deamon that is doing things i dont understand ;-)).
Yes you can write a script that monitors the mesh interface, then when it has established it will set the necessary mesh parameters. Then this script must continue to monitor the mesh and the set parameters, resetting or adjusting them as required. For example if you make a change to an AP encryption key and restart the wireless to activate it, this will also reset the mesh. In this case your script that is still running will detect this and set those parameters again.
Oh yes, wail a minute, this means your script must run as a service daemon to do this......
that was the part of the of my reaction you should have ignored ;-).
My point was more that I was wondering if could fix manually this issue and have a stable router running for weeks instead of a few days.
I'check if I can set any of these paramets by hand.
Side question: I hit the issue on the Portal mesh node: but actually it should also happen on the peer nodes I understand. Can it be explained why I see it on the portal one 'first'. I cannot remember another one with the issue.
Can you explain exactly what you mean? You mean the issue of the mesh stopping working?
If so, it is the remote nodes loosing their layer 2 paths to the portal. Restarting the portal will reconnect to the nodes and multicast its presence, but then the "timing out" problem starts all over again.
I assume the issue on the portal node. In the sense that all nodes from the mesh stop working and restarting the portal fixes the issue: restarting a remote node does nothing to fix the issue.
They don't stop working, they just stop passing ipv4 traffic because the layer 2 mesh backhaul times out so in turn arp stops working.
If, by chance, due to the locations of your mesh nodes, your backhaul is a star layout with the portal in the middle, then just restarting a peer will have no effect on kicking things back into life.
Nothing has crashed. Using the analogy of the switch, a node is like a port on the switch. If you unplug the ethernet from the port of a real switch, it does not mean the switch has stopped working.....