Bridge spontaneously stops routing IPv6 and/or IPv4 until kicked

Hey folks,
Every so often, one or more interfaces on my access points stops routing traffic. Most often, IPv6 routing breaks and IPv4 keeps working but sometimes IPv4 breaks and IPv6 keeps working, or both break together.

When IPv6 routing is broken...

  1. The access point's log contains one recent instance of: "br-guest: received packet on eth1.40 with own address as source address (addr:aa:aa:aa:03:00:40, vlan:0)".

  2. Running tcpdump on a router upstream of the access point while pinging the broken interface shows a steady stream of icmp6 (NDP) neighbor-solicitation packets but no neighbor-advertisements in reply.

  3. Running tcpdump on the access point itself unbreaks the routing! Right away, the dumps show a neighbor-solicitation packet, its neighbor-advertisement reply, and echo-requests/replies getting through as expected.

  4. Similarly, restarting the interface on the access point unbreaks the routing.

  5. Once unbroken, routing continues to work for some indefinite period. I haven't observe any pattern to the interval.

(I haven't investigated the IPv4 breaks closely yet but I'd guess the traffic is getting dropped similarly.)

More details...

  1. The access points each have 2 radios each with 3 networks configured (for lan, infra, guest). These networks are bridged to their corresponding Ethernet vlans. For example, the "br-guest" bridge mentioned above contains eth1.40, wifi-guest (2.4 GHz), wifi-guest (5 GHz).

  2. Each of the 3 bridged interfaces on the access point has a unique MAC address.

  3. I'm pretty confident that my network doesn't contain any loops or meshes.

  4. Enabling/disabling STP has no effect.

  5. Enabling/disabling wifi networks has no effect.

  6. Making changes upstream of the access points has no effect.

  7. The only thing that seems to unbreak routing is touching the interface, either by restarting it or by attaching tcpdump (enter promiscuous mode). This works every time.

  8. Remember, the interface isn't completely dead. Routing over IPv4 can continue even while IPv6 is broken and vice-versa.

  9. This problem affects both of my Archer C7 v2s running OpenWRT 19.07.7.

So I'm pretty stumped at this point. It seems like the kernel just gives up routing IPv4 or IPv6 traffic over the bridge interface when it encounters a self-addressed packet. I've no idea where that packet is coming from anyhow.

Advice welcome!
Jeff.

That's wild. So there's a bug in the driver/firmware causing multicast packets to be dropped when interfaces are bridged unless the interface is in promiscuous mode?

This seems like something a lot of folks would encounter. Is there a recommended fix/workaround? (I suppose I could try adding a hotplug script...)

1 Like

Hmm, well that was interesting.

I ran "ifconfig br-guest promisc" and it unbroke the interface temporarily. A few minutes later, the bridge stopped routing packets again and syslog showed "br-guest: received packet on eth1.40 with own address as source address (addr:aa:aa:aa:03:00:40, vlan:0)"

There are 3 vlans on the same physical interface. I noticed that if I ping interfaces on one vlan then it breaks the interfaces on another vlan. They sort of flip-flop. So the workaround seems to be to make all bridges promiscuous to cover all vlans.

Just for fun, I tried "ifconfig eth1.40 promisc" and that broke the bridge completely. Same thing with "ifconfig eth1". So it's important to set the bridge promiscuous rather than its underlying physical interfaces.

I wonder if I'd need to do anything special for any other vlan interfaces that aren't in bridges. (In my case, there are 3 bridges corresponding to 3 vlans so it works out evenly.)

1 Like

For anyone following along in the future, here's my current workaround.

Note: The problem is still not fixed! Even with all bridges set to promiscuous mode I sometimes see one or another break intermittently when I run a script that pings each interface one after the next. Sometimes the pings will hang. It does seem to "fix" itself on the next round-robin attempt but there's still something funny going on behind the scenes.

vim /etc/hotplug.d/iface/99-bridge-promisc

#!/bin/sh
# Bridge interfaces may get into a bad state where they drop multicast packets
# such as NDP solicitations.  As a workaround, set them to promiscuous mode.
case "$ACTION" in
  ifup)
    if [ "${DEVICE:0:3}" = "br-" ]; then
      ifconfig "$DEVICE" promisc
    fi
    ;;
  ifdown)
    ;;
esac

echo "/etc/hotplug.d/iface/99-bridge-promisc" >>/etc/sysupgrade.conf
1 Like

The first solution that springs to mind is to just drop these "own address as source address" packets with an iptables rule.

However, this link purports to have a better solution. It's probably worth a try...

When a Linux bridge receives a packet with a new source MAC address from a particular bridge port, it stores the MAC address along with the port number in its MAC learning table. A timer is associated with each entry in the table, so that the entry expires after a certain period (so-called “ageing time”), unless it is refreshed before then. By default the ageing time in a Linux bridge is set to 300 seconds.

To resolve this problem, we need to disable MAC address learning in a Linux bridge. To do that, we set the the “ageing time” to 0 with following command:

# brctl setageing <bridge-interface> 0

2 Likes

Ooh! There's an interesting idea. I might be concerned about stuff breaking if old entries don't age out eventually though.

Meanwhile, I noticed that the "own address as source address" message was getting written to the log roughly every 4 minutes now that I've set all of the bridges promiscuous. And after comparing traces from a few occurrences, here's the culprit:

00:52:33.634003 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    0.0.0.0 > all-systems.mcast.net: igmp query v2
00:52:33.666036 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    0.0.0.0 > all-systems.mcast.net: igmp query v2

Same thing with -XX to show Ethernet frames:

01:01:05.633997 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    0.0.0.0 > all-systems.mcast.net: igmp query v2
        0x0000:  0100 5e00 0001 aaaa aa02 0040 0800 46c0  ..^........@..F.
        0x0010:  0020 0000 4000 0102 0417 0000 0000 e000  ....@...........
        0x0020:  0001 9404 0000 1164 ee9b 0000 0000       .......d......
01:01:05.666467 IP (tos 0xc0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA))
    0.0.0.0 > all-systems.mcast.net: igmp query v2
        0x0000:  0100 5e00 0001 aaaa aa02 0040 0800 46c0  ..^........@..F.
        0x0010:  0020 0000 4000 0102 0417 0000 0000 e000  ....@...........
        0x0020:  0001 9404 0000 1164 ee9b 0000 0000 0000  .......d........
        0x0030:  0000 0000 0000 0000

So we have received a packet destined to 01:00:53:00:00:01 (multicast) from aa:aa:aa:02:00:40 (the br-guest interface on this device). I wonder how it got here...

Have you compared all the interfaces on both routers for overlap? You're creating 6 virtual wlan interfaces on each router and they're both the same model of router, so if purchased close together, they may well have mac addresses similar enough that the creation of so many virtual wlan interfaces could cause overlap

....just a stab in the dark here, as you did mention that the bridged interfaces have unique macs

You might be onto something. I compared the mac addresses for all interfaces. As expected, the routers do not overlap mac addresses with each other. However, it seems that the mac addresses for eth1 and wlan1 do coincide! How odd.

It looks like the manufacturer provisions these devices with 3 mac addresses at the factory: one for eth0, one for wlan0, and one shared by eth1 and wlan1. I guess TP-Link can exhaust a pool of mac addresses pretty quickly so it makes sense to use fewer of them where possible. Maybe the assumption is that eth1 and wlan1 will never be on the same network if eth1 is mapped to a wan zone and wlan1 is mapped to a lan zone.

Lemme switch that around a bit and see what happens...

Or you could just deliberately override the mac address of eth1 in the interfaces setting so that it's unique

Maybe. Because eth1 is being split into vlans which are bridged, it doesn't show up as a distinct interface in Luci. It was easier to just reconfigure the switch to use eth0. Remarkably I didn't brick anything in the process.

Unfortunately, it doesn't seem to have solved the problem. Maybe rebooting the device will shake something loose?

Alas, I'm still having intermittent routing problems and I'm still seeing occasional log messages about "own address as source address". IPv4 and IPv6 pings sometimes hang inexplicably until I ping an interface on a different vlan or kick the interface that's stuck.

So far...

  1. I set all of the bridges to promiscuous mode based on a suggestion from another thread. No change.

  2. I figured out that the self-addressed packets are IGMP queries that are generated at roughly 4 minute intervals but I don't know where they originate. They correlate with issues routing NDP packets (which are multicast) but I don't know whether the packet is a symptom, cause, or coincidence.

  3. I disabled IGMP snooping on the access points. No change.

  4. I set the mac address learning time to 0 using "brctl setageing" for each bridge. No change.

  5. I've confirmed that all bridge and vlan interfaces have unique mac addresses. The wifi interfaces for lan, infra, and guest networks all share the BSSID of the underlying radios (this is the default configuration and it should be ok, I think).

Any other ideas?

1 Like

Try the brute force approach and block those packets with an iptables rule

No luck blocking the packet with iptables though I might just not have found the right incantation.

As far as I can tell, the "own address as source address" warning emitted by the kernel is fairly benign. The bridge device driver emits this warning when it is asked to add a local MAC address to the forwarding database but it doesn't return any errors or have any other side-effects on its behavior.

Aside: I accidentally found myself looking at the latest version of the kernel source and got all excited when I found the "neigh suppress" code path which very specifically mucks around with ARP and NDP packets on bridges. However, that feature isn't in 4.14.221.