How can I clear a NO_DEVICE error on an interface? (OpenWRT 18.06.5 on MT7628 with BG96 USB modem)

detly · December 4, 2021, 2:21pm

Summary: I'm finding that network interfaces related to devices that are power cycled are getting "stuck" in a state where they have a NO_DEVICE error, after which ifup will simply not work, at all, until the entire network stack is restarted with service network restart. I'm looking for a better way to clear the errors on these "stuck" devices.

I have an MT7628-based device (VoCore2) that also has a Quectel BG96 USB modem connected to it. It runs an OpenWRT 18.06.5 custom build. The modem's power can be controlled via GPIO pins. As part of some application code I'm writing, I need to be able to power cycle the modem and bring it up using ifup bg96 without stopping other network devices.

Most of the time, it works fine. What I've found is that under some circumstances, ifup will seem to stop "working" on this device. When this happens, I stop my application code and try a few operations manually, and this is what I find.

ifup bg96 succeeds (exit status 0) without error message (I assume because it just dispatches the real work via ubus)
the bg96 interface never comes up and no new routes are added
there is no mention of this interface, or of ifup, in the syslog (as viewed with logread)
ifstatus shows the following:

{
	"up": false,
	"pending": false,
	"available": false,
	"autostart": true,
	"dynamic": false,
	"proto": "qmi",
	"data": {
		
	},
	"errors": [
		{
			"subsystem": "interface",
			"code": "NO_DEVICE"
		}
	]
}

At this point if I run service network restart, the network service will restart and once again, ifup will work. But I cannot simply restart the entire network in my application code every time I power cycle the device. Unfortunately service network reload does not work.

I assume that ifup is failing because of the NO_DEVICE error persisting on the interface. If that's correct, then I'm looking for a less drastic way to clear the error and try to bring the interface up again. Does such a thing exist?

I dug around in OpenWRT's netifd source, and found some code relating to clearing errors. But I also see this comment:

/* don't flush the errors in case the configured protocol handler matches the
    running protocol handler and is having the last error capability */

Is the "last error capability" the thing that's causing me trouble? Can I disable it somehow? What is it for? Searching around it seems like it's related to PPP interfaces, but the BG96 uses QMI, so...?

Or am I going down the wrong path with the last error stuff?

Here is the network interface configuration for the bg96 device:

network.bg96=interface
network.bg96.auto='true'
network.bg96.proto='qmi'
network.bg96.device='/dev/bg96_gsm'

The device /dev/bg96_gsm is a symlink to one of the /dev/cdc-wdm0 devices, set up by a hotplug script that checks the USB vendor/device IDs. It definitely exists and is ready for use when I run the commands above manually. It's possible the application code does things fast enough to cause OpenWRT to think the device doesn't exist yet, and so (as well as avoiding it in the first place) I'm looking for a way to recover from that without restarting the whole network service.

Any pointers or advice appreciated.

detly · December 6, 2021, 2:58am

The plot thickens:

I can't find anywhere in netifd where this is actually checked, except for the purpose of displaying via ubus. It certainly doesn't seem to block an interface being brought up. I assume, then, that a protocol script checks this before proceeding, but I also can't find any evidence of that happening.
The errors should be flushed when an interface is brought up, unless the protocol is configured with lasterror=1. It doesn't look like QMI does this though.

Edit: in the QMI protocol script (uqmi/files/lib/netifd/proto/qmi.sh) there's this, which seems to be how the errors appear. I'm still no wiser as to how to clear them.

device="$(readlink -f $device)"
[ -c "$device" ] || {
	echo "The specified control device does not exist"
	proto_notify_error "$interface" NO_DEVICE
	proto_set_available "$interface" 0
	return 1
}

It kind of seems like if I can just set available to true somehow, it'll come good again. But I can't see how to do this.

detly · December 6, 2021, 3:36am

After more study of the netifd and QMI protocol code, I managed to reverse engineer a command that seems to work by telling netifd to make the device 'available' again:

ubus call network.interface.bg96 notify_proto '{"action": 5, "available": true}'

I'm not sure if there's a more canonical way to do this kind of manipulation, so if there is (or if there's an issue with this command that I'm not aware of), please let me know!