In the process of using 1907, I found that procd occasionally had deadlocks. The following is my system information
OPENWRT_RELEASE="OpenWrt 19.07.0-rc1 r10649-c4fdb377a2"
While booting to the firewall, the system stopped booting, and the following processes were running. I waited 30 minutes, but the processes remained unchanged
PID USER VSZ STAT COMMAND
1 root 1572 S /sbin/procd
2 root 0 SW [kthreadd]
4 root 0 IW< [kworker/0:0H]
6 root 0 IW< [mm_percpu_wq]
7 root 0 SW [ksoftirqd/0]
8 root 0 IW [kworker/u2:1]
89 root 0 SW [oom_reaper]
90 root 0 IW< [writeback]
92 root 0 SW [kcompactd0]
93 root 0 IW< [crypto]
95 root 0 IW< [kblockd]
109 root 0 IW< [watchdogd]
122 root 0 IW [kworker/0:1]
130 root 0 SW [kswapd0]
195 root 0 SW [spi0]
305 root 0 IW< [ipv6_addrconf]
321 root 0 SW [ubi_bgt0d]
326 root 0 IW< [kworker/0:1H]
396 root 0 IW [kworker/0:2]
426 root 0 SW [scsi_eh_0]
427 root 0 IW< [scsi_tmf_0]
428 root 0 SW [usb-storage]
493 root 0 SW [ubifs_bgt0_1]
575 root 1208 S /sbin/ubusd
705 root 1028 S /sbin/urngd
973 root 0 IW< [cfg80211]
1020 root 0 IW< [ath10k_wq]
1021 root 0 IW< [ath10k_aux_wq]
1929 root 1080 S /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
1931 root 1352 S /bin/sh /etc/rc.common /etc/rc.d/S19firewall boot
1943 root 2032 R fw3 -q start
2189 root 1684 S /sbin/netifd
2235 root 0 IW [kworker/u2:0]
2301 root 1272 S /bin/sh /sbin/hotplug-call iface
2502 dnsmasq 2752 S /usr/sbin/dnsmasq -C /var/etc/dnsmasq.conf.cfg01411c -k -x /var/run/dnsmasq/dnsmasq.cfg01411c.pid
2614 root 3520 S /usr/bin/gltertf
2646 root 1276 S /bin/sh /sbin/hotplug-call iface
2647 root 1348 S /bin/sh /etc/rc.common /etc/init.d/firewall enabled
2662 root 1228 S flock 1000
2704 root 1788 S /usr/sbin/hostapd -s -P /var/run/wifi-phy1.pid -B /var/run/hostapd-phy1.conf
2707 root 1768 S /usr/sbin/hostapd -s -P /var/run/wifi-phy0.pid -B /var/run/hostapd-phy0.conf
3125 root 1148 S /usr/sbin/dropbear -F -P /var/run/dropbear.1.pid -p 22 -K 300 -T 3
3202 root 1236 S -ash
6640 root 1248 S /sbin/logd -S 64
15851 root 1228 R ps
I saw the process flock 1000 and found it in the procd_lock function in /lib/functions/procd.sh
how occasionally is occasionally?
i'm thinking you'll need to test this on rc2... now that it's out...
It's a very small chance, about a 1 in 100 chance.
I tried to change flock 1000 to flock -n 1000, which is currently being tested.
I'll test rc2
something like this in /etc/init.d/debugboot may assist to track down issues... ( seems hotplug / interface related )
Should I add these options under each initialization process?
The last time I had a problem, I found that the system stopped at the place where the firewall was initialized. I thought I should add debugging information before and after the firewall was initialized
Depends when / where your trying to narrow it to... as they are verbose with all the json stuff...
Feel free to try anywhere... ( i.e. PROCD_DEBUG just in /etc/init.d/firewall )
I found that the initialization process competes with fw3 for the same mutex, causing a deadlock.
After the DEBUG option is added, there is a prompt message.
Thu Dec 5 03:25:33 2019 user.notice root: warning: procd flock for firewall failed
you've got another thread re: ath9 issues? ... ( disable wifi or similar as a simplest means to check correlations... )
but yes... i agree that the presented failure mode is problematic...
procd flocks like crazy on init... and a good guess would be that some "async-esq" patch or backport is incomplete / incompatible with your failure scenario...
i notice master init.d/firewall has this... ( boot() )
# Be silent on boot, firewall might be started by hotplug already,
# so don't complain in syslog.
FWIW... so at least on master... both init/procd and hotplug both start "firewall" ... in which case a hacky workaround would be an exit/(sleep && reload)/while ! flock sleep.... in boot() if running... or similar...
Yes, my ATH9K crashes probabilistically in use, especially in STA mode, but I don't think there's a particular correlation between them.
The deadlock caused by procd is at the user level and, moreover, during system startup.
ATH9K is in normal use of the system, and from the LOG point of view, it is an interrupt exception.
I'm trying to do all kinds of things, and the biggest problem is that I'm having a hard time replicating these two problems, basically, every 24 hours
I can try disabling all hotplug events. Thanks for your advice
for testing init perhaps... but i'd be unethical if I don't state that this will likely break things and potentially lead to an insecure system long/medium/short term ( but i'm sure you knew this already... just have to say it in case someone else reads )
I'm just trying to locate the problem, it's important to have hot plug events in the system, and I'll update here when I find the problem.