Loosing Batman connectivity after a while

Hi all,

I have build a small mesh setup with two routers which works fine for the most part but at some point I loose the ability to ping the device over the wifi/batman connection. Does anyone have experiences with such a behavior?
I will refer to one of the routers as "gate" (with wan access and the one still working from my point of view) and the other one as repeater (the other node which is meant to relay the signal).

As far as I can tell restarting net wireless network on the gate does not help but restarting the repeater does do the trick. I am talking plugging the power here. Probably restarting wireless would do the same.

edit
I let it be over the night and interestingly the connection came up again without me doing anything.

Sat Jun 13 05:09:54 2020 daemon.notice wpa_supplicant[1820]: if-bat: new peer notification for 9c:c7:a6:b3:b4:4e
Sat Jun 13 05:09:54 2020 daemon.notice wpa_supplicant[1820]: if-bat: mesh plink with 9c:c7:a6:b3:b4:4e established
Sat Jun 13 05:09:54 2020 daemon.notice wpa_supplicant[1820]: if-bat: MESH-PEER-CONNECTED 9c:c7:a6:b3:b4:4e

Unfortunately I don't have more log lines because I have an ddns service running which happily through errors while connection was down...
\edit

Pretty much everything shown below is from the gate point of view.

Here the details:
wifi connectivity is low as always but still present


batctl n / batctl o still show a neighbor but batctl p fails

root@gate2:~# batctl n
[B.A.T.M.A.N. adv openwrt-2019.2-5, MainIF/MAC: if-bat/9c:c7:a6:b3:b4:4e (bat0/ea:41:15:e2:98:18 BATMAN_IV)]
IF             Neighbor              last-seen
       if-bat     e8:de:27:bc:6b:f4    0.760s
root@gate2:~# batctl p e8:de:27:bc:6b:f4
PING e8:de:27:bc:6b:f4 (e8:de:27:bc:6b:f4) 20(48) bytes of data
Reply from host e8:de:27:bc:6b:f4 timed out
^C--- e8:de:27:bc:6b:f4 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss
rtt min/avg/max/mdev = 0.000/0.000/0.000/0.000 ms

I set up Prometheus scraping to monitor that behavior. I don't want to go into that metric shown here but just give you guys some intuition of what is happening. At some points connectivity is getting worse and then dies off completely.

Some hart and software versions:
Hardware: Fritz 3370 and TP-Link TL-WR1043ND v2
OpenWRT version: 19.07.3.(self compiled)
Kernel version: 4.14.171
batctl openwrt-2019.2-3 [batman-adv: openwrt-2019.2-5]

here my configs:
/etc/conf/wireless

config wifi-device 'radio0'
        option type 'mac80211'
        option path 'pci0000:00/0000:00:00.0/0000:01:00.0'
        option htmode 'HT20'
        option hwmode '11g'
        option country 'DE'
        option legacy_rates '1'
        option channel '2'

config wifi-iface 'lan'
        option device 'radio0'
        option mode 'ap'
        option encryption 'psk2+ccmp'
        option ssid 'FeWo'
        option ifname 'if-lan'
        option key 'xxx'
        option disabled '0'
        option network 'lan'

config wifi-iface 'batman'
        option ifname 'if-bat'
        option device 'radio0'
        option mode 'mesh'
        option mesh_id 'mesh-bridge'
        option mesh_fwding '0'
        option encryption 'sae'
        option key 'xxx'
        option network 'mesh'
        option disabled '0'

config wifi-iface 'wifinet0'
        option device 'radio0'
        option mode 'ap'
        option encryption 'psk2+ccmp'
        option key 'xxx'
        option ssid 'Gast'
        option ifname 'if-guest'
        option network 'guest'
        option isolate '1'

/etc/conf/network
(without wan and vlan setup)

config interface 'lan'
	option type 'bridge'
	option ipaddr 'x.x.x.x'
	option proto 'static'
	option netmask 'x.x.x.x'
	option ifname 'bat0.1 eth0.1'
	list dns '1.1.1.1'
	list dns '8.8.8.8'
	option ip6assign '64'
	option ip6class 'local wan6'
	option ip6hint '0'

config interface 'bat0'
	option proto 'batadv'
	option routing_algo 'BATMAN_IV'
	option aggregated_ogms '1'
	option ap_isolation '0'
	option bonding '0'
	option fragmentation '1'
	option gw_mode 'off'
	option log_level '0'
	option orig_interval '1000'
	option bridge_loop_avoidance '1'
	option distributed_arp_table '1'
	option multicast_mode '1'
	option network_coding '0'
	option hop_penalty '30'
	option isolation_mark '0x00000000/0x00000000'

config interface 'mesh'
	option mtu '2304'
	option proto 'batadv_hardif'
	option master 'bat0'

config interface 'guest'
	option proto 'static'
	option ipaddr 'x.x.x.x'
	option netmask 'x.x.x.x'
	option type 'bridge'
	option ifname 'if-guest bat0.2'
	option ip6assign '64'
	list dns '8.8.8.8'
	list dns '4.4.4.4'
	option ip6hint '2'

repeater config looks quite similar but has lan configured as dhcp client and guest is only bridging the guest wifi and bat0.2.

Every guess helps and missing information is happily provided.

I like to bump this think since the issue still exists.
What I have done till now is adding a script pinging google every 15 minutes and if it not gets trough then restart the devices. Quite aggressive approach and a quite well working one. Although really just an ugly workaround for the problem at hand.

I also noticed when the issue starts appearing more and more packets go lost on that link. Yet again batman neighbors are detect.