Ubus stops working - Failed to connect to ubus

Hi,

I am little bit clueless now. After few minutes on my Archer C7 v4, running OpenWrt Git master, the ubusd stops working. When I want to run ubus list, it writes Failed to connect to ubus. I am not able to get logread (same error) unless it was running early from startup. I tried to debug it to get the point where it is stuck, so I added unstripped /lib/libc.so and /sbin/ubusd libraries, but backtrace shows:

root@master-c7:/etc/config# gdb /sbin/ubusd 606
GNU gdb (GDB) 8.3.1
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "mips-openwrt-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /sbin/ubusd...
Attaching to program: /sbin/ubusd, process 606
Reading symbols from /lib/ld-musl-mips-sf.so.1...
0x77ea3718 in __syscall6 (n=4313, f=16, e=0, d=1, c=10, b=2011685896, a=4) at ./arch/mips/syscall_arch.h:114
114     ./arch/mips/syscall_arch.h: No such file or directory.
(gdb) bt
#0  0x77ea3718 in __syscall6 (n=4313, f=16, e=0, d=1, c=10, b=2011685896, a=4) at ./arch/mips/syscall_arch.h:114
#1  epoll_pwait (fd=4, ev=0x77e7e408, cnt=10, to=1, sigs=0x77e86030) at src/linux/epoll.c:27
#2  0x77ea37a4 in epoll_wait (warning: GDB can't find the start of the function at 0x5557afff.

    GDB is unable to find the start of the function at 0x5557afff
and thus can't determine the size of that function's stack frame.
This means that GDB may be unable to access that stack frame, or
the frames below it.
    This problem is most likely caused by an invalid program counter or
stack pointer.
    However, if you think GDB should simply search farther back
from 0x5557afff for code which looks like the beginning of a
function, you can increase the range of the search using the `set
heuristic-fence-post' command.
fd=<optimized out>, ev=<optimized out>, cnt=<optimized out>, to=<optimized out>) at src/linux/epoll.c:36
#3  0x5557b000 in ?? ()

So basically nothing interresting. Ubusd unfortunatelly has no logging, so how to actually debug this? Why ubusd could be rejecting connections?

So basically nothing interresting.

One usually get better (and meaningful) results with gdb server.

Ubusd unfortunatelly has no logging

Spitting some DEBUG() is usually the fastest way of debugging. I usually resort to gdb if everything else fails or need to see the stuff in the memory.

I recompiled with CONFIG_PKG_ASLR_PIE_NONE=y, because the gdb did not see all the libraries and this could be the reason. Now I got better info - I wanted to check uloop status:

...
Reading symbols from /sbin/ubusd...
Attaching to program: /sbin/ubusd, process 607
Reading symbols from /lib/libubox.so...
Reading symbols from /lib/libblobmsg_json.so...
Reading symbols from /usr/lib/libjson-c.so.4...
Reading symbols from /lib/libgcc_s.so.1...
Reading symbols from /lib/ld-musl-mips-sf.so.1...
0x77eb0ef0 in __syscall6 (n=4313, f=16, e=0, d=1, c=20, b=2011743240, a=4) at ./arch/mips/syscall_arch.h:114
114     ./arch/mips/syscall_arch.h: No such file or directory.
(gdb) print cur_fds
$1 = {{fd = 0x414000 <server_fd>, events = 1}, {fd = 0x9594c0, events = 1}, {fd = 0x959720, events = 1}, {
    fd = 0x91c190, events = 1}, {fd = 0x91b810, events = 1}, {fd = 0x91b350, events = 1}, {fd = 0x91ae90,
    events = 1}, {fd = 0x91a770, events = 1}, {fd = 0x91a050, events = 1}, {fd = 0x919930, events = 1}, {
    fd = 0x919210, events = 1}, {fd = 0x918890, events = 1}, {fd = 0x918170, events = 1}, {fd = 0x917a50,
    events = 1}, {fd = 0x917350, events = 1}, {fd = 0x916950, events = 1}, {fd = 0x77f35d00, events = 1}, {
    fd = 0x945c60, events = 1}, {fd = 0x91e530, events = 1}, {fd = 0x91dbb0, events = 1}}
(gdb) print cur_fd
$2 = 0
(gdb) print cur_nfds
$3 = 0

The value cur_nfds = 0 is suspicious, maybe gdb is wrong again... I will check it during time when it works fine (after restart) - later today.

Note: I increased the size of uloop array from 10 to 20 just to be sure.

I will definitely try some debug logging later.

Something really interesting:

root@master-c7:~# netstat -np | grep ubusd | wc -l
netstat: showing only processes with your user ID
1015
root@master-c7:~# netstat -np | grep dnsmasq | wc -l
netstat: showing only processes with your user ID
1014

The connections to ubusd are created by dnsmasq, so there might be some error in dnsmasq, which keeps connections to ubusd.

Recently there was an update for dnsmasq from 2.80 to 2.81rc, so there might be a bug there.

1 Like

Ok, one resource leak fixed in dnsmasq itself in ubus_init - missing ubus_free. I will create a fix later. Next issues on list are why ubus_add_object failed and why same dnsmasq is started three times during boot (same config, same arguments...)...

Patch for dnsmasq: http://lists.thekelleys.org.uk/pipermail/dnsmasq-discuss/2020q1/013900.html

Reason for ubus_add_object failing is that the dnsmasq object already exists. There is no network suffix like in hostapd and wpa_supplicant cases (see ubus list output to see what I mean - single dnsmasq for first instance, but multiple hostapd.* and wpa_supplicant.*). The --enable-ubus option in /etc/init.d/dnsmasq is useless when you have more dnsmasq instances, the result is that dnsmasq for each query retries to open the connection to ubus and fails.

Solution could be either to remove --enable-ubus from OpenWrt init script, or enhance dnsmasq command line by allowing to specify ubus name (to include some unique suffix).