Troubleshoot bridge mode connectivity

Hi there! So today I'm going for a bit of a long shot, as I think I'm way over my head when trying to figure things out for now.

A couple of months ago I moved to a new place, using the same ISP of the old one with the same account number and service tier, while keeping my folks' home connected with an identical installation and a new phone number.

The stack that I'm employing is a Minisforum MS-01 running Proxmox 9.1.5, with an SFP module to handle the ISP's fiber line to the premises and a VirtIO NIC for both the LAN and the WAN, configured in such a fashion that the VM working as a router doesn't have to even consider the VLAN tag that my ISP requires (Telmex México, so 881 if using bridge mode).

For better or for worse, my ISP still relies upon PPPoE to connect with them, and the only way to force them to hand over the credentials is to get our FCC equivalent involved and twist their arm a little, but fortunately that has been done and I'm able to get a link going.

For the time being, the installation at my folks' is running OPNsense, and I was intending on doing the same and keeping a GL.iNet travel router as my OpenWRT tinkering toy when going around. However, it seems that something's different at my new place and I've been struggling to get the Internet going in different ways for the last couple months.

At first, I thought the issue might be with the way my MTU was being configured, so with a little back and forth I messed around with it and my MSS and got nowhere fast.

Then I thought something about OPNsense itself could be a part of the problem, so I detached the virtual disk image and put OpenWRT in its place to see if things improved.

The initial giveaway symptom with AWS hosted downloads went away, but after a couple of days the OpenWRT VM started having issues downloading the package index, and since yesterday I went back to my ISP's ONT in bridge mode in order to discard the SFP module from being a part of the problem (an OCI DFP-34X-2C2)

Now, regardless of the defined MTU, neither the VM nor any of the downstream devices seem to be able to connect to the outside world, mainly showing up as DNS timeouts in spite of being able to ping their IPs or trying to replicate the issue with the ISP's own DNS servers.

The AP I'm using and both the NICs and switches involved are able to handle traffic within the LAN without any issues, it's just things going through the gateway that seem to have problems.

And while I wouldn't think my ISP couldn't be the root cause of the issues, I'd like to be sure there's nothing on my end that's contributing to the problem before going after them for support.

My configuration and current diagnostics are as follows:

uci show network.wan
network.wan=interface
network.wan.device='eth1'
network.wan.proto='pppoe'
network.wan.username='REDACTED'
network.wan.password='REDACTED'
network.wan.ipv6='auto'
network.wan.reqprefix='64'
network.wan.norelease='1'
network.wan.mtu='1464'
uci show dhcp
dhcp.@dnsmasq[0]=dnsmasq
dhcp.@dnsmasq[0].domainneeded='1'
dhcp.@dnsmasq[0].localise_queries='1'
dhcp.@dnsmasq[0].rebind_protection='0'
dhcp.@dnsmasq[0].local='/lan/'
dhcp.@dnsmasq[0].domain='lan'
dhcp.@dnsmasq[0].expandhosts='1'
dhcp.@dnsmasq[0].cachesize='1000'
dhcp.@dnsmasq[0].authoritative='1'
dhcp.@dnsmasq[0].readethers='1'
dhcp.@dnsmasq[0].leasefile='/tmp/dhcp.leases'
dhcp.@dnsmasq[0].localservice='1'
dhcp.@dnsmasq[0].ednspacket_max='1232'
dhcp.@dnsmasq[0].dnsforwardmax='500'
dhcp.@dnsmasq[0].dhcpleasemax='250'
dhcp.@dnsmasq[0].sequential_ip='1'
dhcp.@dnsmasq[0].allservers='1'
dhcp.lan=dhcp
dhcp.lan.interface='lan'
dhcp.lan.start='100'
dhcp.lan.limit='150'
dhcp.lan.leasetime='12h'
dhcp.lan.dhcpv4='server'
dhcp.lan.dhcpv6='server'
dhcp.lan.ra='server'
dhcp.lan.ra_flags='managed-config' 'other-config'
dhcp.wan=dhcp
dhcp.wan.interface='wan'
dhcp.wan.ignore='1'
dhcp.odhcpd=odhcpd
dhcp.odhcpd.maindhcp='0'
dhcp.odhcpd.leasefile='/tmp/hosts/odhcpd'
dhcp.odhcpd.leasetrigger='/usr/sbin/odhcpd-update'
dhcp.odhcpd.loglevel='4'
dhcp.odhcpd.piofolder='/tmp/odhcpd-piofolder'
nslookup openwrt.org 189.233.14.29
;; connection timed out; no servers could be reached

Is there anything else I should be looking into?
Thanks in advance!

Proxmox (and other virtualized methods) adds a lot of additional variables -- everything from the physical port handling to the host OS and supervisor/hypervisor bridging of the ports to the VM, and a number of other elements that can cause serious difficulties in terms of getting a working setup.

I would remove all the extra complexity of your router and run bare metal. Either x86 on the device that is currently running the VMs, or use the travel router (assuming you have a means of converting the fiber to copper).

you mention vlan and virtio, if you are somehow using a vlan id inside a vm and then also on the host creating a vlan interface using the same id involving the same interface, only the one in the host will work I think

No kidding.
As a last straw I did a factory reset on the ONT and connected through it with my phone, my laptop and the VM.

Even though the VM gets both its own IPv4 and IPv6 address from the ONT's DHCP server, everything but the VM is currently able to navigate with just the ONT as a gateway.

Frankly at this point I'm more curious about what led to the perfect storm (given there's no apparent difference between the two sites), but the way things are going it seems the root cause is closer to QEMU and Proxmox than to the rest of the networking stack using them.

To a degree that doesn't sound like a bad idea, but at the same time there are several solutions that I wanted to run under the same device and IP address like Caddy, the ESPHome Builder or UniFi OS that are unlikely to behave properly under the bare Docker stack in OpenWRT, much less FreeBSD if I were to go back to OPNsense for this machine.

If I truly needed to run everything bare metal, the computer I intend to employ for that is absurdly overspecced for doing only that on a SOHO network. At this time, I lack a media converter too, but given I might need one either way to debug the SFP module, it doesn't sound as too bad of an idea to grab one anyway.

Maybe doing some PCIe forwarding to directly control the onboard networking from within the VM could provide different results, but given not even USB devices seem to be passing through cleanly, perhaps another approach is worth considering.

That's right, though I only have the VLAN ID setup within the VM's hardware properties, so at least on my side there's no QinQ configuration for the gateway.

qm config 101
affinity: 0,2,4,6,8,10
agent: 1
balloon: 1
bios: ovmf
boot: order=virtio0
cores: 6
cpu: host,flags=+aes
efidisk0: local-zfs:vm-101-disk-0,efitype=4m,ms-cert=2023,pre-enrolled-keys=1,size=1M
machine: q35,viommu=virtio
memory: 16384
meta: creation-qemu=10.1.2,ctime=1765847625
name: Router
net0: virtio=BC:24:11:71:71:2B,bridge=vmbr0,mtu=1,queues=6
net1: virtio=BC:24:11:E3:DB:1E,bridge=vmbr1,mtu=1,queues=6,tag=881
numa: 0
onboot: 1
ostype: other
parent: Before_Redo
scsihw: virtio-scsi-single
serial0: socket
smbios1: uuid=0d71e6f2-31cd-460f-a211-5d1aa5c23009
sockets: 1
startup: order=1
tpmstate0: local-zfs:vm-101-disk-1,size=4M,version=v2.0
virtio0: local-zfs:vm-101-disk-2,iothread=1,size=256G
vmgenid: 51f45d13-e09c-4054-be20-6c782c2ff13a

I don’t think I mean QinQ, say you have a switch it has your lan tagged as 881 , and then you have say eth0 on your host

If you create a vlan interface for 881 off eth0, or put eth0 in vmbr0 and then make a vlan interface 881 off vmbr0 it will cut off the VM from using that vlan id through that interface. I’m not familiar with promox, I work around it by creating veth interfaces and creating the vlan tag on one of them.

I guess that this seems like an unlikely issue for you.

The idea is not necessarily to scrap your entire idea of using that machine for VMs, but rather to eliminate variables. Running bare metal (at least temporarily) greatly simplifies the entire setup and provides way more insight into the situation -- namely if there are any issues with the general ISP setup/config (i.e. hardware with the fiber line and the ONT, functional testing of the ONT, and all the way up the wan config within OpenWrt). This only needs to be temporary.

Or, yes, a media converter so you can get to copper and then connect the travel router would be the other viable option here.

While I understand the reasons that people like to virtualize routers, I have a number of reasons that I don't recommend it, including the variables/complexity that are introduced, making troubleshooting really difficult.

Okay, I think I follow. I'll go ahead and flash an image to a USB drive and try to boot from it, worst case scenario that'll confirm the problem being in the abstraction and not the hardware. If everything goes according to plan, I'll report back with the results of the test.

1 Like

If you require drivers for your network hardware in that box, be sure to build a custom image with the necessary packages (using the firmware selector is the easiest way). This can avoid annoying chicken-or-egg situations.

I think I'm covered with the following package selection (the onboard networking is comprised of a pair of Intel I226 controllers, one being the LM variant and the other one the V variant):

base-files
bind-tools
ca-bundle
dnsmasq
dropbear
e2fsprogs
ethtool
firewall4
fstools
grub2-bios-setup
intel-microcode
kmod-amazon-ena
kmod-amd-xgbe
kmod-bnx2
kmod-button-hotplug
kmod-drm-i915
kmod-dwmac-intel
kmod-e1000
kmod-e1000e
kmod-forcedeth
kmod-fs-vfat
kmod-hid
kmod-hid-generic
kmod-igb
kmod-igc
kmod-input-evdev
kmod-ixgbe
kmod-nft-offload
kmod-r8169
kmod-tg3
kmod-usb-hid
kmod-usb-uhci
libc
libgcc
libustream-mbedtls
lm-sensors
logd
losetup
luci
mkf2fs
mtd
nano
netifd
nftables
odhcp6c
odhcpd-ipv6only
opkg
partx-utils
ppp
ppp-mod-pppoe
procd-ujail
resize2fs
rng-tools
smartmontools
tcpdump
uci
uclient-fetch
urandom-seed
urngd

After trying to build an image with these packages and booting off a USB drive, it seems that I can get to GRUB with no issues and control the menu with the keyboard, but as soon as I get to the OpenWRT dmesg output, the keyboard won't work even if the output recognizes when it's connected or disconnected.

Have you tried other USB ports and/or another keyboard?

Alternatively, you should be able to connect via ssh if you have the box connected to your network.

Yep, all the front and back ports appear to show the same behavior (kernel log shows device connection and disconnection, but no activity in the keyboard actually arrives at the screen).

I'd have loved to try that instead, but it seems that the system is either not including the RJ45 ports in the LAN bridge or detecting them in the first place.

What happens if you just try to boot the completely standard x86/64 image (i.e. no customization)?

So far it doesn't seem to have made a difference. All I see on screen is the path the device uses but neither the input nor the NICs work for now.

[41.482001] input: Logitech USB Receiver as /devices/pci0000:00/0000:00:14.0/usb3/3-8/3-8:1.1/0003:046D:C547.0005/input/input5

There's also a pair of entries for the keyboard and mouse profiles, though since I had to type that by hand, I'd prefer to avoid doing so whenever possible.

If it helps anything at all, I know for a fact that the keyboard I'm using does work both on the same computer's BIOS and GRUB2, plus EDK II through Coreboot and GRUB2 on another device (Lenovo ThinkCentre M900 Tiny), not to mention when using Proxmox in both of the aforementioned devices.