Master-nginx: Boot hangs at loading kernel after sysupgrade

I have multiple mvebu routers, various makes and models. After sysupgrade, several of these won't boot (or boot strangely).

I hooked up serial to one and after a successful sysupgrade message I get this:

umount: can't unmount /dev: Resource busy
umount: can't unmount /tmp: Resour

I'm guessing this last is cut short due to a reboot, so there may have been more messages coming. In any case, the restart proceeds until "loading kernel" and hangs there. The kernel info is correct, new build date, etc and checksums OK.

I'm seeing this on WRT1900ACS and WRT3200ACM. Interestingly, not on WRT1200AC which I believe is same hardware as WRT1900ACS, just 2 antennas instead of 4.

Any ideas on how to troubleshoot? I tried flashing both img and bin, same result.

After some config trial and error I get a loading kernel -- see the diff below, I'm not sure which is the culprit or why I was messing with these. Either they were fat-fingered or they've been there a while and something changed making one of them matter. (I'm guessing the latter because I check diffconfigs after most upgrades and didn't notice these as changes.)

However, the boot is now hanging at procd: ubus. I have added all new users (ntp, logd, ubus) to /etc/passwd, shadow and group. I also added the socket config to /etc/config/rpcd just in case (see newest img upgrade failed). Any other config changes that matter? For the record I sysupgraded without keeping changes and the boot is clean -- if it comes down to it I can reconfigure from scratch, I just figured this should have been a straightforward change.

I still get the umount errors (see orig message) when sysupgrading -- not sure if this is related.

Config diff, from non-booting to booting kernel:

< CONFIG_KERNEL_DEBUG_LL=y
< CONFIG_KERNEL_DEBUG_LL_UART_NONE=y
< # CONFIG_KERNEL_KALLSYMS is not set
< # CONFIG_KERNEL_MEMCG_SWAP is not set
> # CONFIG_KERNEL_SWAP is not set

Increasing debug level for boot logging reveals this culprit:

[   12.508718] procd: Finished hotplug exec instance, pid=1139
[   12.514507] procd: Launched hotplug exec instance, pid=1144
sh: write error: No such device
[   12.525157] procd: Finished hotplug exec instance, pid=1144
[   13.009273] procd: Coldplug complete
[   13.012878] procd: Change state 1 -> 2
[   13.016665] procd: - watchdog -
[   13.019833] procd: Ping
[   13.022304] procd: Opened watchdog with timeout 30s
[   13.027267] procd: - ubus -
[   13.030172] procd: Create service ubus
[   13.034022] procd: Start instance ubus::instance1
[   13.038907] procd: Started instance ubus::instance1[1149]
[   13.044412] procd: running /etc/init.d/ubus running
[   13.049384] procd: glob failed on /etc/init.d/ubus
[   13.054238] procd: Instance ubus::instance1 exit with error code 65280 after 0 seconds
[   13.079366] procd: Connection to ubus failed

The router from this point continually retries to start ubus, failing each time.

I see no /etc/init.d/ubus on any of my routers, even the ones that boot past this point. I also don't see this file installed or created in the package/system/ubus/Makefile. Any idea what's going on here?

in future, be a good sport and state clearly what versions you are working with... cheers...

Thanks for the response. I will update the original post to state I'm building from master and this seems related to the updates over the last few weeks to run logd, ubus, ntp as non-root.

So I had linked to the post in your first response indicating I had made those changes. I use nginx not uhttpd so I believe the only one that matters is the rpcd config. In that case I just pre-updated the config as it doesn't seem to matter for builds prior to the non-root upgrades. With a uci get rpcd.@rpcd[0].socket I get a return of /var/run/ubus/ubus.sock. On the off-chance it matters I updated the uhttpd config to include the socket change, with no luck.

I'm guessing there's something else legacy that matters in my config because I get a clean boot if I don't keep changes.

fgrep -r sock /sbin/ubusd /etc/ /rom | grep 'run/ubus'; find /tmp/ | grep ubus.sock; grep BUILD_ID /etc/os-release

Results -- this is from the version from which I am upgrading (I modified the /rom to /rom/etc as the command never finishes with just /rom):

fgrep -r sock /sbin/ubusd /etc/ /rom/etc/ | grep '/run/ubus'; find /tmp/ | grep ubus.sock; grep BUILD_ID /etc/os-release
fgrep: /etc/localtime: No such file or directory
/sbin/ubusd:/var/run/ubus.sock
/etc/config/rpcd:       option socket '/var/run/ubus/ubus.sock'
/etc/config/uhttpd:     option ubus_socket '/var/run/ubus/ubus.sock'
fgrep: /etc/ppp/resolv.conf: No such file or directory
/etc/nginx/conf.d/luci.locations:        ubus_socket_path /var/run/ubus.sock;
/etc/nginx/luci_uwsgi.conf:        ubus_socket_path /var/run/ubus/ubus.sock;
/rom/etc/config/rpcd:   option socket /var/run/ubus.sock
fgrep: /rom/etc/localtime: No such file or directory
fgrep: /rom/etc/ppp/resolv.conf: No such file or directory
/rom/etc/uci-defaults/60_nginx-luci-support:        ubus_socket_path /var/run/ubus.sock;
/tmp/run/ubus.sock
BUILD_ID="r14721+18-4ff7bdfeeb"

The uci-defaults for nginx updates a file that I do not reference in my nginx config -- I reference the luci_uwsgi.conf file in /etc/nginx.

ok, so that's half of the picture...

that's unchartered territory...

what is a non-root upgrade? and why exactly does it not matter?

either way... please upgrade without keeping settings then manually re-apply what you need to... no point digging a hole to china when you can fly...

I was referring to the series of commits that modify ntp, logd, ubusd to run as non-root. It doesn't seem to matter because even with that line in the config, prior to these commits the ubus socket is still at /var/run/ubus.sock.

I may eventually bite the bullet and do what you're suggesting but I'm also inclined to understand what's going on here. Not being familiar with how procd, ubus etc interact the code is hard to walk through. I can't figure out why procd is reporting that ubus is trying to access /etc/init.d/ubus -- whether it does that all the time regardless and the failure to find it doesn't matter, etc.

Is it the fail on /etc/init.d/ubus that is the problem or is it something else that fails and emits the error code? I can find where the message is printed and what calls it, but I can't find the definitions for what 65280 means.

Below is what happens during a "working" boot. It fails the /etc/init.d/ubus access but looks like there are follow-on items that include reading random numbers.

[   12.374312] procd: - ubus -
[   12.377142] procd: Create service ubus
[   12.380939] procd: Start instance ubus::instance1
[   12.385835] procd: Started instance ubus::instance1[1108]
[   12.391306] procd: running /etc/init.d/ubus running
[   12.398885] procd: glob failed on /etc/init.d/ubus
[   12.406008] random: ubusd: uninitialized urandom read (4 bytes read)
[   12.428580] random: ubusd: uninitialized urandom read (4 bytes read)
[   12.435054] random: ubusd: uninitialized urandom read (4 bytes read)
[   12.442179] procd: Connected to ubus, id=24685d4c
[   12.446939] procd: Change state 2 -> 3

From ubus code: ubusd_id.c


  40 void ubus_init_id_tree(struct avl_tree *tree)
  41 {
  42         if (random_fd < 0) {
  43                 random_fd = open("/dev/urandom", O_RDONLY);
  44                 if (random_fd < 0) {
  45                         perror("open");
  46                         exit(1);
  47                 }
  48         }
  49 
  50         avl_init(tree, ubus_cmp_id, false, NULL);
  51 }

This appears to call above code, accouting for 2 of the randoms: ubusd_obj.c

  227 static void __constructor ubusd_obj_init(void)
 228 {
 229         ubus_init_id_tree(&objects);
 230         ubus_init_id_tree(&obj_types);
 231         ubus_init_string_tree(&path, false);
 232         ubusd_event_init();
 233         ubusd_acl_init();
 234         ubusd_monitor_init();
 235 }

The third area that calls the init_id routine doesn't appear to get specifically called by any code but is a constructor so gets called automatically when ubus is started I'm guessing. These are the only places /dev/random is read so either this fails or something before the random reads that doesn't emit any debug text.

So I have figured out the files (possibly not all) that matter here.

  1. Sysupgrading when /etc/shadow, /etc/passwd and /etc/group are included fails to boot, halting when procd starts ubus.
  2. Sysupgrading without keeping changes, and replacing just passwd/shadow/group with those contained in the sysupgrade also fails to boot in the same manner. I replaced these one at a time subsequently: group didn't matter (booted fine after replacing), replacing passwd caused the boot to hang. I did not reflash and try shadow.
  3. Sysupgrading without keeping changes, and replacing everything except these 3 files boots fine.

The only real difference in passwd/shadow is an additional user I add for samba.

Additional testing: editing the files to add the samba user I mentioned above works fine -- I add the user and the router reboots just fine. There seem to only be issues with overwriting passwd (and possibly shadow) through copy/move or tar. Strange?

As mentioned above I'm flashing from r14721+18-4ff7bdfeeb to latest master, right now r14887+17-a47279154e.

So I believe I have this resolved. I still don't understand which commit caused the issue but keeping files was broken at r14721 and as of r14898+17-c02096361c it seems to no longer be an issue.

To upgrade and get a booting router:

  • save off a config, external to router
  • sysupgrade -n (not keeping files)
  • ssh into router, transfer config over and extract to a temp location
  • in the temp location, move /etc/passwd,shadow,groups to a different temp location
  • cp all remaining config files recursively to their proper locations
  • edit passwd/shadow/groups manually to add info from the files reserved from the config

This then boots and is once again sysupgradable keeping config.

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.