Kernel 5.4.96 broke untagged DSA VLANs on my mvebu routers

Some of the DSA related change from patch found its way into the 5.4.97 push...

1 Like

5.4.97 didn't fix the issue for me -- I only mentioned 5.4.96 because that is where the issue started. I'm a little bit surprised nobody else is complaining, it makes me wonder if the issue isn't really isolated to my setup somehow.

I saw that patch series you linked and although it seems like a similar issue I'm not sure it addresses this. I'm certainly a DSA/switching noob, however, so I'll give it a try without my patch reversion if it gets pulled into trunk.

I recommend you just apply that whole patchset that @anomeome linked.
I have a version rebased for 5.4.97 here.

I applied the patch on 5.4.97 and I get the same effect:

I have 3 network bridges that are configured as untagged VLANs using DSA. The first bridge that is set up works, traffic ends up being dropped on all the others (no network is set up, the activity lights are completely inactive).

Reverting the single patch noted in the first post is still the fix.

Maybe you have indeed stumbled upon a new issue... fun!

From the text of that patch it almost seems like something isn't configured right in the hardware and the bug that is fixed was allowing the switch to still work.

commit bc4e7277cc93b12b6c51b803c32b89700437d607
Author: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
Date:   Mon Jan 25 13:41:16 2021 +0100

    net: switchdev: don't set port_obj_info->handled true when -EOPNOTSUPP
    
    commit 20776b465c0c249f5e5b5b4fe077cd24ef1cda86 upstream.
    
    It's not true that switchdev_port_obj_notify() only inspects the
    ->handled field of "struct switchdev_notifier_port_obj_info" if
    call_switchdev_blocking_notifiers() returns 0 - there's a WARN_ON()
    triggering for a non-zero return combined with ->handled not being
    true. But the real problem here is that -EOPNOTSUPP is not being
    properly handled.
    
    The wrapper functions switchdev_handle_port_obj_add() et al change a
    return value of -EOPNOTSUPP to 0, and the treatment of ->handled in
    switchdev_port_obj_notify() seems to be designed to change that back
    to -EOPNOTSUPP in case nobody actually acted on the notifier (i.e.,
    everybody returned -EOPNOTSUPP).
    
    Currently, as soon as some device down the stack passes the check_cb()
    check, ->handled gets set to true, which means that
    switchdev_port_obj_notify() cannot actually ever return -EOPNOTSUPP.
    
    This, for example, means that the detection of hardware offload
    support in the MRP code is broken: switchdev_port_obj_add() used by
    br_mrp_switchdev_send_ring_test() always returns 0, so since the MRP
    code thinks the generation of MRP test frames has been offloaded, no
    such frames are actually put on the wire. Similarly,
    br_mrp_switchdev_set_ring_role() also always returns 0, causing
    mrp->ring_role_offloaded to be set to 1.
    
    To fix this, continue to set ->handled true if any callback returns
    success or any error distinct from -EOPNOTSUPP. But if all the
    callbacks return -EOPNOTSUPP, make sure that ->handled stays false, so
    the logic in switchdev_port_obj_notify() can propagate that
    information.
    
    Fixes: 9a9f26e8f7ea ("bridge: mrp: Connect MRP API with the switchdev API")
    Fixes: f30f0601eb93 ("switchdev: Add helpers to aid traversal through lower devices")
    Reviewed-by: Petr Machata <petrm@nvidia.com>
    Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk>
    Link: https://lore.kernel.org/r/20210125124116.102928-1-rasmus.villemoes@prevas.dk
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>
    Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

@SkewedZeppelin I take it that patch for 5.4.97 is an updated version for the last one

Have you checked to see if issue exists in 5.10.x

I have not. If it moves into "testing" on trunk I will make the switch.

There has been some similar history on this in the post-DSA era: luci issue 2798

mv88e6085 f1072004.mdio-mii:10 lan0: configuring for phy/gmii link mode
mv88e6085 f1072004.mdio-mii:10: p0: hw VLAN 1 already used by br-guest
mv88e6085 f1072004.mdio-mii:10 lan1: configuring for phy/gmii link mode
mv88e6085 f1072004.mdio-mii:10: p1: hw VLAN 1 already used by br-guest
mv88e6085 f1072004.mdio-mii:10 lan2: configuring for phy/gmii link mode
mv88e6085 f1072004.mdio-mii:10: p2: hw VLAN 1 already used by br-guest
mv88e6085 f1072004.mdio-mii:10 lan3: configuring for phy/gmii link mode
mv88e6085 f1072004.mdio-mii:10: p3: hw VLAN 1 already used by br-guest

Similar behavior is in my bootlog, both working and non-working versions. But I think clearing up this configuration so it doesn't send the -EONOTSUPP flag may be wise -- the question is how.

Sorry for the bump here, but 5.10 kernel was moved to mvebu for testing so I gave it a shot -- TL;DR is that the behavior is the same.

Only the first wired VLAN set up gets activity, all others fail. If I revert the patch (from upstream kernel) equivalent to 5.4.96, all is good.

Could you report it to linux-stable mailing list?

Could you report it to linux-stable mailing list?

If this was an issue with the kernel I would expect others would see this same issue. I'm <5% sure I have VLANs set up appropriately with the conversion to DSA and I don't have enough knowledge about what the switch driver needs to do here.

Are you using external scripts, or netifd to set things up.

I have the same problem and I let netifd handle everything.

I did however stumble upon a workaround while hunting down things to try on google last night:
Setting vlan_default_pvid on the bridge devices seems to help. I now have this in /etc/hotplug.d/iface/01-hack-vids:

[ "$ACTION" != ifup -a "$ACTION" != ifupdate ] && exit 0
if [ "$INTERFACE" = "foo" ]; then
	vid=1
elif [ "$INTERFACE" = "bar" ]; then
	vid=2
elif [ "$INTERFACE" = "baz" ]; then
	vid=3
else
	exit 0
fi

ip link set dev "br-$INTERFACE" type bridge vlan_default_pvid "$vid"

I have basically zero idea what I'm doing here, but at least with this things seem to work for now (that's on 5.10 testing kernel btw.)

1 Like

I let netifd handle everything -- I'm not sure where adding scripts is going to interfere and I can't find where network setup documentation has been updated with respect to DSA so I was waiting (patiently until something broke).

I'm not sure it's the entire solution but it definitely directly addresses this issue. The problem is vlan ids all default to the same value causing the warnings I posted a few steps above. The patch in question changed the logic in how the switch driver is passing messages -- prior to this the "setup handled" flag would get passed even if a "not supported" flag is returned from a deeper function call but after the patch this does not happen. I'm guessing the driver for these VLANs never finishes setting them up.

Although DSA isn't explicitly mentioned here I think the very last paragraph hints at how you can get rid of your script and fix this problem directly in setup: openwrt vlan switch configuration (specifically the dot-notation on ifname creating VLAN IDs 106 and 204 in this case.

config interface 'lan'
	option type 'bridge'
	option ifname 'eth0.106'
	option proto 'static'
	option ipaddr '192.168.1.1'
	option netmask '255.255.255.0'
 
config interface 'wan'
	option ifname 'eth0.204'
	option proto 'dhcp'

I have tested and this indeed clears the warning/error message in the kernel logs. I need to rebuild without deleting this patch to make sure it truly fixes this issue.

My bad -- I mostly ignored the documentation (linked page) because it's still addressing switch setup using the "switch" section in /etc/config/network. The VLAN information section is still applicable other than that.

EDIT: I still don't see how to configure tagged vs. untagged without the "switch" section. I don't see how to specify independently how cpu and lan/wan ports get specified as tagged or untagged.

2 Likes

I was playing around, trying to ferret out information on what can be done with netifd, as things have changed since I last poked around. In case it is of any use here is a snippet:

/etc/config/network

...

config device
	option type 'bridge'
	option name 'switch0'

config bridge-vlan
	option device 'switch0'
	option vlan '30'
	option ports 'lan3 lan4'
	option alias 'foo'

config bridge-vlan
	option device 'switch0'
	option vlan '20'
	option ports 'lan2 lan4'
	option alias 'bar'

config bridge-vlan
	option device 'switch0'
	option vlan '10'
	option ports 'lan1 lan4:*'
#	option ports 'lan1 lan2 lan3 lan4'
#	option ports 'lan1 lan2 lan3 lan4:t'
#	option ports 'lan1 lan2 lan3 lan4:*'
#	option ports 'lan1 lan2 lan3 lan4:t*'
	option alias 'lan'

###################################
config interface 'lan'
	option ifname 'switch0.lan'
#	option ifname 'switch0.foo'
	option proto 'static'
	option ipaddr '192.168.10.1'
	option netmask '255.255.255.0'
	option ip6assign '60'

...

root@mamba:/etc/config# bridge vlan
port              vlan-id  
lan4              10 PVID Egress Untagged
                  20 Egress Untagged
                  30 Egress Untagged
lan3              30 PVID Egress Untagged
lan2              20 PVID Egress Untagged
lan1              10 PVID Egress Untagged
switch0           10
                  20
                  30

just some info, I do not have anything connected to this device to test what is actually occurring. Some info to be found at PR4307 also.

1 Like

That looks very interesting, thanks. I'll have to read up and experiment some with this.

As a side note: I just tried your earlier suggestion and used: option ifname 'lan3.2' and that also appears to work for my usecase without any additional scripts

@anomeome

I believe the piece I'm missing is configuring the switch as a device via netifd. I'm not completely sure how to set it up as in a few cases where routers are buried on an internal network I'm pulling in the wan port as part of a VLAN, effectively using the router with 5 "lan" ports. Since these are on two different CPUs I'm guessing I need two switch devices and I'll need to investigate the syntax for this.

The thread you linked is informative and it's possible the information is in there. I likely won't have time to play with this until the weekend.

I think I have this resolved. I can run a warning/error free boot with working VLANs from trunk (no patch as in prior posts) with the following setup:

/etc/config/network

config device
    option type 'bridge'
    option name 'switch0'

config bridge-vlan
    option device 'switch0'
    option vlan '1'
    option ports 'lan1 lan2 lan3'

config bridge-vlan
    option device 'switch0'
    option vlan '2'
    option ports 'lan4'

config bridge-vlan
    option device 'switch0'
    option vlan '3'
    option ports 'wan'

config interface 'lan'
    option ifname 'switch0.1'
    option type 'bridge'
    option proto 'static'
    option ipaddr '10.10.1.1'
    option netmask '255.255.255.0'

config interface 'guest'
    option ifname 'switch0.2'
    option type 'bridge'
    option proto 'dhcp'

config interface 'iot'
    option ifname 'switch0.3'
    option type 'bridge'
    option proto 'static'
    option ipaddr '10.10.3.1'
    option netmask '255.255.255.0'

No changes to wireless config, which attach to all 3 networks. I have a few other configurations where wan is bridged with a lan port, I have yet to test that. Also, per the PR anomeome linked for now we still need "bridge" specified in the interface config although it appears at some point this will be going away.

This syntax is slightly different than that shown in the PR -- I couldn't get the PR syntax to work (using list for lan1, lan2, etc).

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.