BUG Report: 802.11s Mesh (V19.07.4)

I am not sure I understand...

Let's recap, the WiFi client connects to the wlan interface on OpenWrt, which in turn must be bridged to the mesh interfaces, so the packets propagate from the wifi client to the other mesh points and one of these mesh points is connected via ethernet cable to the main router of the house/office/whatever which has a DHCP server and provides internet connection (default route), that means also the ethernet port which connects from the mesh point to this main router must be in the same bridge where the mesh and wlan interfaces are, otherwise layer2 packets will not be propagated.
Is my understanding of the use case explained here flawed? What differs?
How can layer2 packets be propagated to the network without bridging the different interfaces involved?

Way back in April 2019 I posted on this thread the reasons this can happen and what to do to prevent it. It is not a bug. It is a configuration issue.
@protectivedad has posted what looks like a good implementation and others have indicated what might be in their own.

Yes, all nodes must be running on the same interconnected and fully bridged layer 2 network. This network is built between nodes by the 802.11s protocol.

A typical mesh node will have interfaces wlan0 (with an ssid for clients to connect to), wlan0-1 (the mesh interface) and one or more ethernet interfaces (can be used by wired clients, as a connection to the mesh Internet feed or as inter mesh ethernet connections). Within each node, all these interfaces should be bridged (eg wlan0, wlan0-1, eth0, eth1 are all members of br-lan).

I am the OP.

If you reproduce my setup - you will see it is a bug.

Indeed. So configure it for the edges.
It is not a bug as such. It is configuration that needs to be done after the mesh interface has come up.

I have read your workaround. But it seems to depend on a script that runs every now and then - which applies a fix.

One of the big advantages of 802.11s is the VERY fast response to topology changes (when all nodes can hear each other). The way I see it is that your workaround would mean there are gaps of time where the connectivity may be lost - until the next loop of the script.

To put this in context: I had a client that required max. 250 ms gaps with no communication with nodes that were mobile (i.e. vehicles) and my previous work with 802.11s on openWRT told me the re-routing on topology changes was a few millisec.

I don't think your work around would be compatible with comfortably beating the 250ms objective.

Am I right?

"re-routing" in the sense of self healing layer 2 "routing" can indeed be very fast and there are many timeout parameters that can be set with "iw" to tune this.

I am using the term "routing" here to mean "layer 2 routing" and it has nothing to do with IP routing of course. Perhaps we could call it dynamic bridging - but in reality it is like a virtual layer 2 switch with each mesh node acting like a switch port.

However 802.11s is not intended for use with mobile mesh nodes, particularly fast moving ones (moving relative to others).

Quickly moving a node out of range of its immediate peers will mean that node will be dropped until it can rejoin the mesh in a new location. This can take a few seconds. But if that node keeps moving it will probably never rejoin.

It is not a work around or fix. It is a means of adding parameters to the mesh configuration that can only be done after the mesh interface has come up. It does not have to run every now and again, just after the interface has come up. You could either check every now and again to make sure the config is still correct, or get hotplug to do it for you.

Yes, your understanding seems correct.

A client that is not the mesh node itself and/or a gateway that is not the mesh node itself have to be bridged to the mesh or routed to/from the mesh in one way or other.

However, that's irrelevant to the problem. The posters had issues between the mesh nodes. So they would've still had the same problem if all edge clients / gateways were eliminated and just the mesh nodes tried to talk to each other.

I think you thoroughly misread me. I'm aware that the mesh is spanned by the 802.11s wireless nodes.

I don't think I misread at all.

Unless they set

mesh_fwding='1'
mesh_gate_announcements='1'
mesh_rssi_threshold='-80' (or some other sensible value that gives a good connection)

then yes they would indeed have the same problem.

Well, I think you still do. You try to explain things to me that are already clear to me.

My first reply was to nemesis, due to them suggesting something irrelevant to the problem at hand. Therefore I asked them about their understanding of the mesh and the problem. (And they answered and do indeed seem to understand it well enough.)

I appreciate that you gave a useful solution to the problem and I never denied that, there's no need to repeat that to me over and over.

I apologise for repeating myself. I am too used to people not reading full threads and instead skipping to the end and not noticing solutions have been suggested/put forward.
This thread should be closed now in my opinion.

There is an argument that the solution should be incorporated somehow into the mesh configuration processing.... That is probably for a new thread.

This is most likely a misconfiguration and not a bug (my 2 cents).
In these cases sharing the relevant configuration of the devices may help debugging.

Hi!

thanks for the pointers in the right direction.
Mesh forwarding and RSSI threshold seems to be set correctly but for the gate announcements, I had to use the scripts above.
the hotplug script did not work for me for some reason (current openwrt master build) so I built a combination of the two posted solutions like so:

In etc/crontabs/root

*/5 * * * * /usr/bin/mesh_param.sh

and in /usr/bin/mesh_param.sh

#!/bin/sh

. /lib/functions.sh

parse_list() {
        local value="$1"
        local _device="$2"

	current_value="$(iw $_device get mesh_param)"
	[ "$value" != "$current_value" ] && iw $_device set mesh_param $value
}

parse_interface() {
        local section="$1"
        local _mode
        local _mesh_param
        local _mesh_id
        local _device

        config_get _mode "$section" mode
        # Not mesh then exit
        [ "$_mode" != "mesh" ] && return 0

        config_get _mesh_param "$section" mesh_param
        # No mesh_param list then exit
        [ -z "$_mesh_param" ] && return 0

        config_get _mesh_id "$section" mesh_id
        while true; do
		sleep 5
                _device=$(iwinfo | grep "$_mesh_id" | awk '{print $1}')
                [ -z "$_device" ] && continue

                config_list_foreach "$section" mesh_param parse_list $_device
                break
        done
}

config_load wireless
config_foreach parse_interface wifi-iface

Together with the aforementioned option list mesh_param 'mesh_gate_announcements=1' this works nicely.

Interestingly enough the mac80211.sh script seems to correctly set all the paramters but the set for mesh_gate_announcements is ignored while the other two are set correctly (I did some "echo debugging" and it correctly calls the iw binary to set all three parameters).

On another probably related note: I can't get multi-hop meshes to work reliably even with the options set correctly. I'm running current OpenWRT master on IPQ4019 (hap-ac2 and disc-lite5-ac; tested with both upstream and -ct driver & firmware with exactly the same results).

The mesh builds correctly and works but when it's not a full mesh (so forwarding is required), the moment I start iperf the mesh is dead and does not recover without restarting the mesh interfaces on all involved routers...

Has anyone experienced a similar issue and knows how to work around this? (In my case it's only a 3-device mesh that is pretty much static so using two separate, bridged meshes is possible but I'd like to have the airtime management benefit of running a single mesh)

Update:
With my laptop in monitor mode I can confirm that even with the correct settings applied, ARP frames aren't relayed anymore after a traffic burst.
My SSH session still lives after starting iperf but iperf drops to 0bit/s after the first few packets and after that point ARP doesn't get relayed anymore until I restart the mesh interface.