So I've installed 23.05.0-rc1 to rc3 on most of my devices without new or at least unknown issues over 22.03. However, one device (TP-Link Archer C7 v2) is where my WiFi mesh connects to my wired LANs and so also where I initiate my GRE tunnels to get the VLANs plumbed through to the wireless APs.
So here the weird thing:
The wireless APs (e.g., UniFi AP LR) are fine with 23.05.0
The GRE tunnels all work if the source (C7) is 22.03.4
If I reboot the C7 to 23.05.0 some work and some do not -- they just don't come up
The configuration of the ~10 VLANs / tunnels is largely the same since all I'm doing is a bunch of boilerplate mapping eth1.# to @trunk1.#
I say "weird" because normally I'd expect consistent success or failure, but here there's no difference between the tunnels that work vs the ones that fail. So below is the boilerplate, for # in 3..12:
config interface 'NAME'
option type 'bridge'
option device 'br-name'
option proto 'none'
option defaultroute '0'
config switch_vlan
option device 'switch0'
option vlan '#'
option ports '0t 1t'
option vid '#'
config device
option type 'bridge'
option name 'br-name'
option ipv6 '0'
list ports 'eth1.#'
list ports '@trunk0.#'
list ports '@trunk1.#'
Any ideas?
Fwiw there's no errors in the logs. If I boot with 22.03.4 I get messages like below and with 23.05.0 gre4t-trunk1.12 simply never comes up:
# logread | egrep -i 'VLAN.*trunk[0-9]\.[1-9]'
Fri Aug 25 08:03:38 2023 daemon.notice netifd: VLAN 'gre4t-trunk0.8' link is up
Fri Aug 25 08:03:38 2023 daemon.notice netifd: VLAN 'gre4t-trunk0.4' link is up
Fri Aug 25 08:03:38 2023 daemon.notice netifd: VLAN 'gre4t-trunk0.3' link is up
Fri Aug 25 08:03:39 2023 daemon.notice netifd: VLAN 'gre4t-trunk1.3' link is up
Fri Aug 25 08:03:39 2023 daemon.notice netifd: VLAN 'gre4t-trunk1.12' link is up
Hmm currently I'm running a GRE tunnel and it works fine (DSA switch) on 23.05.0-RC3, I see you use non DSA switch?, I think even if its not DSA it should work.
I'm only missing something important here:
What have you configurated inside the GRE interface?
Can you show the network configuration part of this?
I only had once a issue if you are in luci and edit the GRE interface I kept the Network Interface field empty, aswell for the bind interface, otherwise this caused issues on the master router, on the dumbap I only added wan to the bind interface.
Also note that if you rebooted OpenWrt after restart GRE kmods are loaded, but... If you see a message that the trunk is not connected that is normal, just ignore it should work fine.
You can validate this by looking to the rx and tx count on the trunk interface.
When I look to your snippets I notice you are missing the checkbox for force link.
you can fix it by adding it like: uci set network.trunk0.force_link='1' and then uci commit network even when there is no link, the interface keeps up.
then on my DSA configuration I'm not sure if this something in swconfig in the bridge you need to select a option to keep up the bridge.
here is a example I have on my wireless bridge device:
Thanks, I think I had tried that one before, but tried again. It doesn't help.
So I've found a clear, proximate symptom: I was about to tcpdump the GFE interface to see what was going on and when doing ifconfig -a I noticed gre4t-trunk1.12 doesn't even exist: it stopped creating interfaces at gre4t-trunk1.3 (to be clear, .4 - .11 aren't configured on that trunk and trunk2 is disabled).
Here's the config from my wireless bridge device that doesn't work (tried with and without bridge_empty):
But those symptoms let me think... It's not that you cannot run two GRE instances so that is neither the problem.
Now there is MTU, I do know from experience MTU can do really weird behaviour when the MTU is too low or too high.
In your GRE interfaces I did see you override it to 1500 that is a good MTU, because 1500 is what GRE documents it as a stable MTU.
But that doesn't say anything about the devices mtu, on my DSA configuration each device and bridge device also comes with a mtu, on my configuration I have set those aswell to 1500 even though I would have expected it to be overriden by GRE interface.
You could try it and see if it affects the symptom you have?
1500 is the default, but I set it anyway and it didn't help. Also in my experience the MTU would affect the ability of it to pass traffic, not bringing up the interface. The bottom line is that it never even creates the interface.
I also tried much of the other stuff you had in your config, as well as adding it to trunk0 since the only difference between the one that works and the one that doesn't is the one that works is on all the tunnels / trunks. That didn't help, either.
At this point I'm out of ideas and need to revert since I have downstream stuff that is unhappy with being offline so long.
I totally get what you're saying (I've been a sysadmin on networked systems since 1987), but there's still 4 contradictory facts:
This exact same config works in <= 22.03.4
Some of the identically configured tunnels still work in 23.05.0
MTU problems typically manifest as full-sized packets not being passed, which is not the issue here
OpenWrt never even brings the interface up
As the device is aging (8 years old!), I'd rather not reflash it again since IDK how many reflashes it has in it unless I'm more confident of a "smoking gun" or obvious debugging path.
Updated: I tried 23.05.0-rc4 and have the same results.
Btw to @mattimat's point, I spent a couple of hours playing around with MTU settings on 22.03.4 and 23.05.0-rc3, including tcpdumping the traffic and ping -s # -M do, and verified that my MTU settings are functional. To understand why, read the later posts in [SOLVED] Gretap Tunnel and MTU-Size Problem about Layer 2 bridges. Basically by going with the default settings and not forcing MTU changes, larger GRE packets will be fragmented at the source and reassembled at the destination, which hurts performance but otherwise works just fine.
And again, all this MTU discussion ignores that 23.05.0-rc# fails to even create the gre4t-trunk1.12 interface.
@vgaeteraThank you! I already renamed, reflashed, and verified it works.
You have no idea how much time I've wasted on this. It might have been obvious if I'd brought-up, say, gre4t-trunk1.1 - gre4t-trunk1.19 and it stopped creating interfaces at gre4t-trunk1.9, but of course I don't tunnel all my VLANs, just the ones used by the mesh APs, and so it's sparse.