Summary: I'm finding that network interfaces related to devices that are power cycled are getting "stuck" in a state where they have a NO_DEVICE
error, after which ifup
will simply not work, at all, until the entire network stack is restarted with service network restart
. I'm looking for a better way to clear the errors on these "stuck" devices.
I have an MT7628-based device (VoCore2) that also has a Quectel BG96 USB modem connected to it. It runs an OpenWRT 18.06.5 custom build. The modem's power can be controlled via GPIO pins. As part of some application code I'm writing, I need to be able to power cycle the modem and bring it up using ifup bg96
without stopping other network devices.
Most of the time, it works fine. What I've found is that under some circumstances, ifup
will seem to stop "working" on this device. When this happens, I stop my application code and try a few operations manually, and this is what I find.
-
ifup bg96
succeeds (exit status 0) without error message (I assume because it just dispatches the real work viaubus
) - the
bg96
interface never comes up and no new routes are added - there is no mention of this interface, or of
ifup
, in the syslog (as viewed withlogread
) -
ifstatus
shows the following:
{
"up": false,
"pending": false,
"available": false,
"autostart": true,
"dynamic": false,
"proto": "qmi",
"data": {
},
"errors": [
{
"subsystem": "interface",
"code": "NO_DEVICE"
}
]
}
At this point if I run service network restart
, the network service will restart and once again, ifup
will work. But I cannot simply restart the entire network in my application code every time I power cycle the device. Unfortunately service network reload
does not work.
I assume that ifup
is failing because of the NO_DEVICE
error persisting on the interface. If that's correct, then I'm looking for a less drastic way to clear the error and try to bring the interface up again. Does such a thing exist?
I dug around in OpenWRT's netifd
source, and found some code relating to clearing errors. But I also see this comment:
/* don't flush the errors in case the configured protocol handler matches the
running protocol handler and is having the last error capability */
Is the "last error capability" the thing that's causing me trouble? Can I disable it somehow? What is it for? Searching around it seems like it's related to PPP interfaces, but the BG96 uses QMI, so...?
Or am I going down the wrong path with the last error stuff?
Here is the network interface configuration for the bg96
device:
network.bg96=interface
network.bg96.auto='true'
network.bg96.proto='qmi'
network.bg96.device='/dev/bg96_gsm'
The device /dev/bg96_gsm
is a symlink to one of the /dev/cdc-wdm0
devices, set up by a hotplug script that checks the USB vendor/device IDs. It definitely exists and is ready for use when I run the commands above manually. It's possible the application code does things fast enough to cause OpenWRT to think the device doesn't exist yet, and so (as well as avoiding it in the first place) I'm looking for a way to recover from that without restarting the whole network service.
Any pointers or advice appreciated.