Ping anomaly in trunc image (batman, mesh)

On my network, there are five routers with different architectures which, whenever I get around to it, get a new image built from trunc.
The last image was from January and I didn't notice anything out of the ordinary.
But the images from last week and the day before yesterday appear to have unstable WiFi.
From time to time, there are huge stutters and strange delays. Everything in /etc/config/ is the same as it was before.

The noise is reported by the AP to be -107 dBm while it's working and when it's wonky.
The busy/active-ratio is about 0.1. I've calculated the offsets from before and after a ping-capture while it was wonky, found it to be 0.1 then as well, and decided it's unlikely to be a relevant factor.

This is an average output of ping:

  • from the AP to the client:
--- 192.168.0.124 ping statistics ---
20 packets transmitted, 20 packets received, 0% packet loss
round-trip min/avg/max = 2.597/128.578/259.988 ms
  • the other way around:
--- n1212.fritz.box ping statistics ---
20 packets transmitted, 20 received, 0% packet loss, time 19030ms
rtt min/avg/max/mdev = 1.914/10.191/69.681/17.618 ms

Both pings were run simultaneously. Note the different deviation when pinging the client.

But when it's particularly crazy:

  • AP -> client
--- 192.168.0.124 ping statistics ---
40 packets transmitted, 12 packets received, 70% packet loss
round-trip min/avg/max = 3.095/14.371/45.857 ms
  • client -> AP
--- n1212.fritz.box ping statistics ---
40 packets transmitted, 22 received, 45% packet loss, time 117619ms
rtt min/avg/max/mdev = 3.955/286.780/1033.594/454.832 ms

What could cause such a difference?
Both directions require a round trip - shouldn't the timings be similar?
Is there something else I should have a look at?

You might at least tell us which routers you are talking about.
There are large differences between targets especially regarding the wifi drivers, so just "five different routers" and "from January"+"last week" do not tell much details about the actual versions.

The 70% or 45% packet loss makes me to think about MAC duplication, which might cause half of the packets to get routed wrongly. Are you sure that you are not overriding the hardware MACs in network settings? (if two or three routers have the same MAC override, some packets will get lost)

Hi and thanks for your reply!

Ooops.. This is the exact version of openwrt. The router for this test is a Compex WPJ428 (IPQ40XX) which has dual-band ath10k onboard. Didn't want to use MT7621 because there was quite a hefty update for mt76. Only 2.4GHz.

        option hwmode '11g'
        option path 'platform/soc/a000000.wifi'
        option htmode 'HT20'
        option country 'DE'
        option channel '1'

Very promising idea! I didn't override any MAC addresses but I have two batman meshes (the other four routers are two pairs with IBSS (ath79) / 802.11s (mt7621)) that also got upgraded at the same time.

I'll disable them and keep an eye out for the client in the bridge MAC table on the AP.

The issue seems to have been caused by the mesh.

This is the setup:

    [uplink]           (802.11s)
    (cable) \       ~~~~~~~~~~~~~~~~
             \    /                  \
WPJ428 ---- WR2100  ---- WDR4300  --- WR2100  (all cable)
               ~            ~
     (802.11s) ~            ~ (IBSS)
             WR2100       WP563

I've had this setup for a while now and problems only started to appear lately when, i assume, broadcasts were "multiplied" in one mesh and put back on the wire only to be duplicated in the other and put back again.

Hardly ever happens. It's working right now. But when it does, it stops as soon as any one of the mesh connections (or, in fact, the cable) is cut.

1 Like

Sounds like you created circular routing.

Yeah :smiley:
I assumed batman would handle the duplicate connections in- and outside the mesh.
And it does - in a very impressing way. But it appears to fail if the mesh devices are too close together.
That seems to be a crucial factor in determining how often the loop occurs.

Maybe batman handles broadcasts from the mesh interface differently from those received on hard interfaces..
Don't know..
Don't know whether I'll be looking into it closer.