Three routers in 802.11s mesh. Single main node, two satellites. Both satellites work if the main node is the closest, but won't work if a satellite is closest

I've got a three router mesh setup, with the whole setup working as an AP (no DHCP servers running on the routers, firewalls disabled, etc.). I'm using tri-band routers (with 2 5G radios), with one of the 5G radios working as the backhaul.

This all works fine if I place the routers so that the main node is in the middle, one satellite is to the left of that main node, and the other satellite is to the right of that main node. I'm able to ping all nodes, traffic seems to flow correctly, etc. in this case. Graphically, this is what works:

Sat1 ----------- Main Node ------------- Sat 2

However, if I place the nodes such that one satellite tries to communicate through another satellite in order to get back to the main node (because this is the best route), it doesn't work. Graphically, this is what doesn't work:

Main Node -------- Sat 1 -------------- Sat 2

In this case (which is really the one I need, since the network hardware that's all being hooked up is at one end of the building), Main Node can ping and access Sat 1 (and vice-versa), and Sat 1 and Sat 2 can ping and access each other, but Main Node and Sat 2 cannot communicate with each other. Further, no devices plugged into Sat 2 can communicate with Main Node (but can communicate with devices on Sat 1).

All nodes have firewall disabled, odhcpd disabled, and dnsmasq disabled. In the non-working case, the nodes still all seem to know about each other, as a run of iw dev phy2-mesh0 mpath dump shows that both Sat 2 and Main Node know the MAC address of each other and know that they can reach each other via a next hop of Sat 1 (which should be correct?), but I've never gotten any packets to make it between the two.

Various things I've tried:

  • Changing `mesh_hwmp_rootmode` value on the main node (was initially 4, also tried 2).
    
  • Changing `mesh_hwmp_rootmode` value on the satellites (was initially 0, also tried 2).
    
  • Enabling `multicast_to_unicast_all` on all nodes.
    
  • Enabling `mesh_fwding` on all nodes (it was already enabled on the main node, but not the satellites -- this was the one I thought would fix it, but it did not).
    

This mesh isn't using 802.11sd, but instead I just manually configured it as I thought it would be doable that way (but maybe not?). Snippet of the configs, as configured currently:

Satellite nodes:

config wifi-iface 'mesh'
        option device 'radio2'
        option encryption 'sae'
        option key 'redacted'
        option mesh_id 'MESH'
        option mode 'mesh'
        option network 'lan'
        option mesh_fwding '1'
        option mesh_gate_announcements '0'
        option mesh_hwmp_rootmode '0'
        option mesh_max_peer_links '3'
        option mesh_ttl '5'
        option mesh_element_ttl '3'
        option mesh_hwmp_max_preq_retries '2'
        option mesh_rssi_threshold '-75'
        option multicast_to_unicast_all '1'

Main node:

config wifi-iface 'mesh'
        option device 'radio2'
        option encryption 'sae'
        option key 'redacted'
        option mesh_id 'MESH'
        option mode 'mesh'
        option network 'lan'
        option mesh_fwding '1'
        option mesh_gate_announcements '1'
        option mesh_hwmp_rootmode '2'
        option mesh_max_peer_links '5'
        option mesh_ttl '5'
        option mesh_element_ttl '3'
        option mesh_hwmp_max_preq_retries '2'
        option mesh_rssi_threshold '-75'
        option multicast_to_unicast_all '1'

Anyone know what else I should try? This is driving me nuts. It feels like a Layer 3 problem, but I'm not sure why that would be when everything is just bridged on all the nodes and the firewall isn't even running.

Full disclosure: this is an NSS build (on LN1301 / MX4300), so it is possible this is just an NSS issue, but I'm hoping I've just screwed something up in the config and it's workable...

Thanks!

This is typical for the "built in" basic static 802.11s mesh config. It will only reliably work between any two nodes with a single hop.

This is not a bug. Rather, it is "something missing".

The mesh config options you have added to the wireless config are along the right lines, so well researched so far, the problem arises in that most of those options (better described as parameters), although "supported" in the wireless config, are never activated (this is the "something missing").
The reason is simple, the uci wireless config is used to set up the wireless vifs (virtual interfaces) and then bring them up, but most of the mesh parameters can only be set AFTER the mesh vif is up, so they fail.

The actual mesh parameters that fail to be set is dependent somewhat on the wireless driver being used, but generally it is safe to say that only mesh_forwarding and mesh_rssi_threshold are actually activated. There is much more to it than that as well, as the actual HWMP settings need to be dynamic, but this is not the place for going into the detail of that.

It is not, at least in a consistent way.

For more than a simple two node mesh, you need some kind of mesh management service. The two most common are BATMAN and Mesh11sd.

Batman does not even use the kernel's built in 802.11s mesh protocol (HWMP) but uses its own and works best for very large mesh networks.

Mesh11sd, below version 6, is designed more for local WISP implementations (but v6 is more flexible).

It is effectively. Your config/layout almost certainly results in continuous layer 2 path changes, flipping between single hop / two hop. The effect on layer 3+ is usually fatal. eg TCP connections establish/drop/re-establish/drop etc..

Without installing the "something missing", your only hope with this setup is to very carefully position the three nodes and/or try various levels of mesh_rssi_threshold to enforce some level of stability.

  1. I would try setting mesh_rssi_threshold to -65dBm on all nodes and rebooting.

  2. If the issue is still present, try -61dBm, test again. Keep going in 3dBm steps until it (hopefully) starts working.

  3. If one node is failing to connect, go the other way, ie -68dBm test repeat until (hopefully) it starts working.

If you achieve the required result, bare in mind it is not guaranteed to keep working because we are working with microwaves here and physics can have a big effect. This sounds silly, but someone opening the fridge door in the kitchen could start the path flip flopping again - even your cat walking up the stairs could do it.

A cumbersome alternative would be to have two separate mesh ids, one for A to B, the other for B to C, if you see what I mean. The downside of this, apart from being pretty complex, is that for max-throughput you would need quad-radio nodes and more bandwidth than 5Ghz can give you :scream:

If you want to discuss mesh11sd, or even test the upcoming version 6, let me know.

3 Likes

Thanks, this is all very helpful information! I had been trying to understand some of the intricacies of the mesh settings, and the limitations of setting them via LuCI/UCI, and this has clarified a lot of what I didn't understand in my previous attempts to understand it.

I would definitely be interested in trying a mesh11sd setup with these routers, though I fear there also might be a Qualcomm NSS issue at play. In the NSS discussion thread, one user indicated that their issue with two hop communications went away when they switched to the FOSS build of OpenWRT (away from NSS), though with the downside of reduced performance and increased CPU load (due to lack of hardware NSS acceleration). And Qosmio indicated in that thread that he believes there may be an NSS issue with 2 hop communications inside of the mesh. However, maybe the underlying issue is simply that OpenWRT's OSS driver is more forgiving when the mesh parameters are not configured properly? I suppose it's worth a shot!

Possibly some other problem, but this is expected anyway.

802.11s is a vif (Virtual InterFace) thing, operating on the radio at layer 2 so, theoretically, should be unaffected by hardware acceleration of packet routing decisions.

You are missing the point - it is not possible to fully enable and tune the 802.11s mac-routing (aka HWMP) by uci config options in /etc/config/wireless alone - something else is needed - Mesh11sd is an example.

Like I said, "theoretically", 802.11s should not be effected by NSS. That user may well have changed the locations of his nodes, or just been very lucky - without proper testing, we will never know.