Procd process is deadlocked

In the process of using 1907, I found that procd occasionally had deadlocks. The following is my system information

NAME="OpenWrt"
VERSION="19.07.0-rc1"
ID="openwrt"
ID_LIKE="lede openwrt"
PRETTY_NAME="OpenWrt 19.07.0-rc1"
VERSION_ID="19.07.0-rc1"
HOME_URL="https://openwrt.org/"
BUG_URL="https://bugs.openwrt.org/"
SUPPORT_URL="https://forum.openwrt.org/"
BUILD_ID="r10649-c4fdb377a2"
OPENWRT_BOARD="ath79/nand"
OPENWRT_ARCH="mips_24kc"
OPENWRT_TAINTS="busybox"
OPENWRT_DEVICE_MANUFACTURER="OpenWrt"
OPENWRT_DEVICE_MANUFACTURER_URL="https://openwrt.org/"
OPENWRT_DEVICE_PRODUCT="Generic"
OPENWRT_DEVICE_REVISION="v0"
OPENWRT_RELEASE="OpenWrt 19.07.0-rc1 r10649-c4fdb377a2"

While booting to the firewall, the system stopped booting, and the following processes were running. I waited 30 minutes, but the processes remained unchanged

  PID USER       VSZ STAT COMMAND
    1 root      1572 S    /sbin/procd
    2 root         0 SW   [kthreadd]
    4 root         0 IW<  [kworker/0:0H]
    6 root         0 IW<  [mm_percpu_wq]
    7 root         0 SW   [ksoftirqd/0]
    8 root         0 IW   [kworker/u2:1]
   89 root         0 SW   [oom_reaper]
   90 root         0 IW<  [writeback]
   92 root         0 SW   [kcompactd0]
   93 root         0 IW<  [crypto]
   95 root         0 IW<  [kblockd]
  109 root         0 IW<  [watchdogd]
  122 root         0 IW   [kworker/0:1]
  130 root         0 SW   [kswapd0]
  195 root         0 SW   [spi0]
  305 root         0 IW<  [ipv6_addrconf]
  321 root         0 SW   [ubi_bgt0d]
  326 root         0 IW<  [kworker/0:1H]
  396 root         0 IW   [kworker/0:2]
  426 root         0 SW   [scsi_eh_0]
  427 root         0 IW<  [scsi_tmf_0]
  428 root         0 SW   [usb-storage]
  493 root         0 SW   [ubifs_bgt0_1]
  575 root      1208 S    /sbin/ubusd
  705 root      1028 S    /sbin/urngd
  973 root         0 IW<  [cfg80211]
 1020 root         0 IW<  [ath10k_wq]
 1021 root         0 IW<  [ath10k_aux_wq]
 1929 root      1080 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 1931 root      1352 S    /bin/sh /etc/rc.common /etc/rc.d/S19firewall boot
 1943 root      2032 R    fw3 -q start
 2189 root      1684 S    /sbin/netifd
 2235 root         0 IW   [kworker/u2:0]
 2301 root      1272 S    /bin/sh /sbin/hotplug-call iface
 2502 dnsmasq   2752 S    /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
 2614 root      3520 S    /usr/bin/gltertf
 2646 root      1276 S    /bin/sh /sbin/hotplug-call iface
 2647 root      1348 S    /bin/sh /etc/rc.common /etc/init.d/firewall enabled
 2662 root      1228 S    flock 1000
 2704 root      1788 S    /usr/sbin/hostapd -s -P /var/run/wifi-phy1.pid -B /var/run/hostapd-phy1.conf
 2707 root      1768 S    /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
 3125 root      1148 S    /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
 3202 root      1236 S    -ash
 6640 root      1248 S    /sbin/logd -S 64
15851 root      1228 R    ps

I saw the process flock 1000 and found it in the procd_lock function in /lib/functions/procd.sh

how occasionally is occasionally?

i'm thinking you'll need to test this on rc2... now that it's out...

It's a very small chance, about a 1 in 100 chance.
I tried to change flock 1000 to flock -n 1000, which is currently being tested.
I'll test rc2
Thanks.

something like this in /etc/init.d/debugboot may assist to track down issues... ( seems hotplug / interface related )

#!/bin/sh /etc/rc.common
START=9
boot() {
export DEBUG=1
export INIT_TRACE=1
export PROCD_DEBUG=1
}

/etc/init.d/debugboot enable

Should I add these options under each initialization process?
The last time I had a problem, I found that the system stopped at the place where the firewall was initialized. I thought I should add debugging information before and after the firewall was initialized

Depends when / where your trying to narrow it to... as they are verbose with all the json stuff...

Feel free to try anywhere... ( i.e. PROCD_DEBUG just in /etc/init.d/firewall )

Ok, I see
Thanks

I found that the initialization process competes with fw3 for the same mutex, causing a deadlock.


After the DEBUG option is added, there is a prompt message.

Thu Dec  5 03:25:33 2019 user.notice root: warning: procd flock for firewall failed
1 Like

you've got another thread re: ath9 issues? ... ( disable wifi or similar as a simplest means to check correlations... )

but yes... i agree that the presented failure mode is problematic...

procd flocks like crazy on init... and a good guess would be that some "async-esq" patch or backport is incomplete / incompatible with your failure scenario...

i notice master init.d/firewall has this... ( boot() )

        # Be silent on boot, firewall might be started by hotplug already,
        # so don't complain in syslog.
        QUIET=-q

FWIW... so at least on master... both init/procd and hotplug both start "firewall" ... in which case a hacky workaround would be an exit/(sleep && reload)/while ! flock sleep.... in boot() if running... or similar...

1 Like

Yes, my ATH9K crashes probabilistically in use, especially in STA mode, but I don't think there's a particular correlation between them.
The deadlock caused by procd is at the user level and, moreover, during system startup.
ATH9K is in normal use of the system, and from the LOG point of view, it is an interrupt exception.
I'm trying to do all kinds of things, and the biggest problem is that I'm having a hard time replicating these two problems, basically, every 24 hours
I can try disabling all hotplug events. Thanks for your advice

1 Like

for testing init perhaps... but i'd be unethical if I don't state that this will likely break things and potentially lead to an insecure system long/medium/short term :wink: ( but i'm sure you knew this already... just have to say it in case someone else reads )

1 Like

I'm just trying to locate the problem, it's important to have hot plug events in the system, and I'll update here when I find the problem.

1 Like