[BUG] rpcd service does not process all states correctly

While developing the luci package, I came across a strange feature of the demon rpcd.
To demonstrate this strange behavior, I asked ChatGPT to write a simple script to test the operation of rpcd daemon:

#!/bin/sh

EXE_DIR=$( cd "$(dirname "$0")" 2>/dev/null && pwd )

RPC_URL="http://127.0.0.1/ubus"

usage() {
	echo "Usage:"
	echo "  $0 -p <password> -c <command> [-s]"
	echo
	echo "Options:"
	echo "  -p  root password"
	echo "  -c  command string (no quotes inside!)"
	echo "  -s  run via 'sh -c'"
	echo "  -T  run test with N iterations"
	exit 1
}

PASSWORD=""
CMDLINE=""
USE_SH=0
TEST_MAX=0

while getopts "p:c:sT:" opt; do
	case "$opt" in
		p) PASSWORD="$OPTARG" ;;
		c) CMDLINE="$OPTARG" ;;
		s) USE_SH=1 ;;
		T) TEST_MAX="$OPTARG" ;;
		*) usage ;;
	esac
done

[ "$PASSWORD" = "@" ] && PASSWORD=xxxxxxxxxxx
[ -z "$PASSWORD" ] && usage
[ -z "$CMDLINE" ] && usage

LOGIN_JSON=$( cat <<EOF
{
	"jsonrpc": "2.0",
	"id": 1,
	"method": "call",
	"params": [
		"00000000000000000000000000000000",
		"session",
		"login",
		{
			"username": "root",
			"password": "$PASSWORD"
		}
	]
}
EOF
)

LOGIN_REPLY=$( curl -s "$RPC_URL" -H 'Content-Type: application/json' -d "$LOGIN_JSON" )

SESSION=$( echo "$LOGIN_REPLY" | sed -n 's/.*"ubus_rpc_session"[[:space:]]*:[[:space:]]*"\([^"]*\)".*/\1/p' )

if [ -z "$SESSION" ]; then
	echo "ERROR: login failed"
	echo "$LOGIN_REPLY"
	exit 2
fi

if [ "$USE_SH" -eq 1 ]; then
	# sh -c "string"
	CMD="/bin/sh"
	PARAMS_JSON="\"-c\",\"$CMDLINE\""
else
	# split by spaces
	set -- $CMDLINE
	CMD="$1"
	shift
	PARAMS_JSON=""
	for a in "$@"; do
		PARAMS_JSON="$PARAMS_JSON,\"$a\""
	done
	PARAMS_JSON="${PARAMS_JSON#,}"
fi

EXEC_JSON=$( cat <<EOF
{
	"jsonrpc": "2.0",
	"id": 2,
	"method": "call",
	"params": [
		"$SESSION",
		"file",
		"exec",
		{
			"command": "$CMD",
			"params": [ $PARAMS_JSON ]
		}
	]
}
EOF
)

TEST_ITER=0
while true; do
	TEST_ITER=$((TEST_ITER + 1))
	TEST_RESP=$( curl -s "$RPC_URL" -H 'Content-Type: application/json' -d "$EXEC_JSON" )
	echo "$TEST_RESP"
	if ! echo "$TEST_RESP" | grep -q '"result":\[0' ; then
		echo "ERROR: response with error code. Iteration = $TEST_ITER"
		exit 1
	fi
	[ $TEST_ITER -ge $TEST_MAX ] && break
done

echo ""

To test, you need to run the following command:

./rpctest.sh -p <password> -s -c 'start-stop-daemon -S -b -p /tmp/rpctest.pid -x sleep -- 1 '

So, on a 4-core mt7986A, this script always runs instantly and produces the following result:

{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}

But on a single-core mt7621, in 30% of cases you have to wait more than 30 seconds for a response and get this response:

{"jsonrpc":"2.0","id":2,"result":[7]}

But this command always runs successfully:

> ./rpctest.sh -p @ -s -c 'start-stop-daemon -S -b -p /tmp/rpctest.pid -x sleep -- 1 ; awk -V '
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0,"stdout":"GNU Awk 5.3.2, API 4.0\nCopyright (C) 1989, 1991-2025 Free Software Foundation.\n\nThis program is free software; you can redistribute it and/or modify\nit under the terms of the GNU General Public License as published by\nthe Free Software Foundation; either version 3 of the License, or\n(at your option) any later version.\n\nThis program is distributed in the hope that it will be useful,\nbut WITHOUT ANY WARRANTY; without even the implied warranty of\nMERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the\nGNU General Public License for more details.\n\nYou should have received a copy of the GNU General Public License\nalong with this program. If not, see http://www.gnu.org/licenses/.\n"}]}

But this command will no longer always be executed successfully:

> ./rpctest.sh -p @ -s -c 'start-stop-daemon -S -b -p /tmp/rpctest.pid -x sleep -- 1 ; echo 123 '
{"jsonrpc":"2.0","id":2,"result":[7]}

> ./rpctest.sh -p @ -s -c 'start-stop-daemon -S -b -p /tmp/rpctest.pid -x sleep -- 1 ; echo 123 '
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0,"stdout":"123\n"}]}

Or same:

> ./rpctest.sh -p @ -s -c '( exec >/dev/null 2>&1 ; sleep 1 ) &'
{"jsonrpc":"2.0","id":2,"result":[7]}

> ./rpctest.sh -p @ -s -c '( exec >/dev/null 2>&1 ; sleep 1 ) &'
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}

i think its someth else, is it exactly 1 vs multi-core?

Yes. mt7621st vs mt7986A

ping @systemcrash , @jow , @nbd , @efahl

1 Like

Haha, not me, I have no idea how any of that LuCI stuff works. I'm just an innocent ignorant bystander.

2 Likes

do you get the same behaviour via ubus calls to the same (requires the callee registered there)

Is the behaviour in ubus or rpcd? rpcd just connects to the ubus socket.

This usually implies bus timeout. But not always.

I didn't understand your question. Could you explain it with a script?

So, take a look at the script that runs the testing.
Constancy should be consistent even on an old single-core processor.

1 Like

When you find something weird like this, chances are, you’re the first, and you know most about the problem, and if you keep digging, you eventually fix it.

I don’t know why it does what it does, SMP isn’t my strong-point. Might want to strace it.

you may need to use a deferred request,I ran into a similar issue with the esp32's ..which are dual core ?? the actual problem was with the underlying socket protocol of the esp32(in my case) however, using a deffer.req, allows my code to function without errors, updates are still not consistent time wise by they always come in eventually ..consistently.

I had to learn how it works myself. And in the process, a problem emerged:

We can insert a sleep(1) before the ustream_poll functions and everything will always work reliably.
But that's not our method!
We'll write an additional function:

static int
rpc_prepare_stream(struct ustream_fd * str, int timeout_ms)
{
    int t_microsec = (timeout_ms >= 1000) ? 999 * 1000 : timeout_ms * 1000;
    fd_set rfds;
    struct timeval tv = { 0, t_microsec };
    FD_ZERO(&rfds);
    FD_SET(str->fd.fd, &rfds);
    int rsel = select(str->fd.fd + 1, &rfds, NULL, NULL, &tv);
    if (rsel > 0 && !FD_ISSET(str->fd.fd, &rfds)) {
        return -1000000 - rsel;
    }
    return rsel;
}

Now we can insert the following lines of code in place of the sleep(1):

    int orc = rpc_prepare_stream(&c->opipe, 999);
    ULOG_INFO("opipe stream prepare = %d \n", orc);
    int erc = rpc_prepare_stream(&c->epipe, 999);
    ULOG_INFO("epipe stream prepare = %d \n", erc);
    
    //sleep(1);  // ADDED FOR FIX BUG

    ustream_poll(&c->opipe.stream);
    ustream_poll(&c->epipe.stream);

After this, everything always works correctly!

But if 999 is replaced with 0, then sometimes we can see something like this in the log:

Wed Feb  4 18:09:07 2026 kern.debug RPCD: RPCD: file_exec...
Wed Feb  4 18:09:07 2026 kern.debug RPCD: RPCD: file_exec: pid = 14423  timeout = 5000
Wed Feb  4 18:09:07 2026 kern.debug RPCD: RPCD: file_exec: execv...
Wed Feb  4 18:09:07 2026 kern.debug RPCD: RPCD: file_exec_process_cb: pid = 14423
Wed Feb  4 18:09:07 2026 kern.debug RPCD: RPCD: opipe FLAGS = 0 , STATUS = 128
Wed Feb  4 18:09:07 2026 kern.debug RPCD: RPCD: epipe FLAGS = 0 , STATUS = 128
Wed Feb  4 18:09:07 2026 kern.debug RPCD: opipe stream fd = 8, r.data_bytes = 0
Wed Feb  4 18:09:07 2026 kern.debug RPCD: epipe stream fd = 20, r.data_bytes = 0
Wed Feb  4 18:09:07 2026 kern.debug RPCD: opipe stream prepare = 1
Wed Feb  4 18:09:07 2026 kern.debug RPCD: epipe stream prepare = 0
Wed Feb  4 18:09:12 2026 kern.debug RPCD: RPCD: file_exec_timeout_cb: pid = 14423
Wed Feb  4 18:09:12 2026 kern.debug RPCD: RPCD: file_exec_reply: pid = 14423  rv = 7

And in this case, the RPC request hangs for 30 seconds (i have 5 seconds set in the settings for testing).

1 Like

And I also noticed this...
If we add logging to these functions:

static void
rpc_file_exec_opipe_state_cb(struct ustream *s)
{
    struct rpc_file_exec_context *c =
        container_of(s, struct rpc_file_exec_context, opipe.stream);

    ULOG_INFO("opipe state: opipe.eof = %d, epipe.eof = %d \n", c->opipe.stream.eof, c->epipe.stream.eof);

    if (c->opipe.stream.eof && c->epipe.stream.eof)
        rpc_file_exec_reply(c, UBUS_STATUS_OK);
}

static void
rpc_file_exec_epipe_state_cb(struct ustream *s)
{
    struct rpc_file_exec_context *c =
        container_of(s, struct rpc_file_exec_context, epipe.stream);

    ULOG_INFO("epipe state: opipe.eof = %d, epipe.eof = %d \n", c->opipe.stream.eof, c->epipe.stream.eof);

    if (c->opipe.stream.eof && c->epipe.stream.eof)
        rpc_file_exec_reply(c, UBUS_STATUS_OK);
}

then after these changes in the syslog, when a problematic situation arises, we can see the following:

Wed Feb  4 20:24:24 2026 kern.debug RPCD: RPCD: file_exec...
Wed Feb  4 20:24:24 2026 kern.debug RPCD: RPCD: file_exec: pid = 23739  timeout = 5000
Wed Feb  4 20:24:24 2026 kern.debug RPCD: RPCD: file_exec: execv...
Wed Feb  4 20:24:24 2026 kern.debug RPCD: opipe state: opipe.eof = 0, epipe.eof = 0
Wed Feb  4 20:24:24 2026 kern.debug RPCD: RPCD: file_exec_process_cb: pid = 23739
Wed Feb  4 20:24:24 2026 kern.debug RPCD: RPCD: opipe FLAGS = 0 , STATUS = 128
Wed Feb  4 20:24:24 2026 kern.debug RPCD: RPCD: epipe FLAGS = 0 , STATUS = 128
Wed Feb  4 20:24:24 2026 kern.debug RPCD: opipe stream fd = 8, r.data_bytes = 0
Wed Feb  4 20:24:24 2026 kern.debug RPCD: epipe stream fd = 20, r.data_bytes = 0
Wed Feb  4 20:24:24 2026 kern.debug RPCD: opipe stream prepare = 1
Wed Feb  4 20:24:24 2026 kern.debug RPCD: epipe stream prepare = 0
Wed Feb  4 20:24:24 2026 kern.debug RPCD: opipe state: opipe.eof = 1, epipe.eof = 0
Wed Feb  4 20:24:24 2026 kern.debug RPCD: epipe state: opipe.eof = 1, epipe.eof = 0
Wed Feb  4 20:24:29 2026 kern.debug RPCD: RPCD: file_exec_timeout_cb: pid = 23739
Wed Feb  4 20:24:29 2026 kern.debug RPCD: RPCD: file_exec_reply: pid = 23739  rv = 7

If executed correctly, we see the following in the syslog:

Wed Feb  4 20:26:07 2026 kern.debug RPCD: opipe stream prepare = 1
Wed Feb  4 20:26:07 2026 kern.debug RPCD: epipe stream prepare = 1
Wed Feb  4 20:26:07 2026 kern.debug RPCD: opipe state: opipe.eof = 1, epipe.eof = 1
Wed Feb  4 20:26:07 2026 kern.debug RPCD: RPCD: file_exec_reply: pid = 23950  rv = 0

A very very rare situation:

Wed Feb  4 20:42:56 2026 kern.debug RPCD: opipe stream prepare = 0
Wed Feb  4 20:42:56 2026 kern.debug RPCD: epipe stream prepare = 0
Wed Feb  4 20:42:56 2026 kern.debug RPCD: opipe state: opipe.eof = 0, epipe.eof = 0
Wed Feb  4 20:42:56 2026 kern.debug RPCD: epipe state: opipe.eof = 0, epipe.eof = 0
Wed Feb  4 20:43:01 2026 kern.debug RPCD: RPCD: file_exec_timeout_cb: pid = 26309
Wed Feb  4 20:43:01 2026 kern.debug RPCD: RPCD: file_exec_reply: pid = 26309  rv = 7

I added additional logging to the libubox library and this is what I got in a problematic situation:

Thu Feb  5 13:41:13 2026 kern.debug RPCD: RPCD: file_exec...
Thu Feb  5 13:41:13 2026 kern.debug RPCD: RPCD: file_exec: pid = 19743  timeout = 5000
Thu Feb  5 13:41:13 2026 kern.debug RPCD: RPCD: file_exec: execv...
Thu Feb  5 13:41:13 2026 kern.debug RPCD: STATE_CB_opipe: opipe.eof = 0, epipe.eof = 0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: RPCD: file_exec_process_cb: pid = 19743, stat = 0x0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: opipe stream: fd = 8, eof = 0, r.data_bytes = 0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: epipe stream: fd = 20, eof = 0, r.data_bytes = 0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: opipe stream: select(0) = 1, revents = 0x10
Thu Feb  5 13:41:13 2026 kern.debug RPCD: epipe stream: select(0) = 0, revents = 0x00
Thu Feb  5 13:41:13 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 8 ...
Thu Feb  5 13:41:13 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 8 => END: len = 0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 ...
Thu Feb  5 13:41:13 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11
Thu Feb  5 13:41:13 2026 kern.debug RPCD: RPCD: file_exec_process_cb: END:: opipe.eof=1, epipe.eof=0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: STATE_CB_opipe: opipe.eof = 1, epipe.eof = 0
Thu Feb  5 13:41:13 2026 kern.debug RPCD: STATE_CB_epipe: opipe.eof = 1, epipe.eof = 0
Thu Feb  5 13:41:18 2026 kern.debug RPCD: RPCD: file_exec_timeout_cb: pid = 19743
Thu Feb  5 13:41:18 2026 kern.debug RPCD: RPCD: file_exec_reply: pid = 19743  rv = 7

ERROR: 11 - EAGAIN
After this error, we will never see EOF on the pipe.

This is what the "corrected function" looks like now:

static void ustream_fd_read_pending(struct ustream_fd *sf, bool *more)
{
	struct ustream *s = &sf->stream;
	int buflen = 0;
	ssize_t len;
	char *buf;

    ULOG_INFO("USTREAM: ustream_fd_read_pending: fd = %d ... \n", sf ? sf->fd.fd : -1);
 
    int trynum = 0;
	do {
		if (s->read_blocked)
			break;

		buf = ustream_reserve(s, 1, &buflen);
		if (!buf)
			break;

		len = read(sf->fd.fd, buf, buflen);
		if (len < 0) {
			if (errno == EINTR)
				continue;

            if (errno == EAGAIN) {
                ULOG_INFO("USTREAM: ustream_fd_read_pending: fd = %d => ERROR: %d   (trynum = %d) \n", sf->fd.fd, errno, trynum);
                trynum++;
                if (trynum > 1000000)
                    return;
                continue;
            }
			if (errno == EAGAIN || errno == ENOTCONN) {
                ULOG_INFO("USTREAM: ustream_fd_read_pending: fd = %d => ERROR: %d \n", sf->fd.fd, errno);
				return;
            }
			len = 0;
		}

		if (!len) {
			if (!s->eof)
                ustream_state_change(s);
			s->eof = true;
			ustream_fd_set_uloop(s, false);
            ULOG_INFO("USTREAM: ustream_fd_read_pending: fd = %d => END: len = %d \n", sf->fd.fd, (int)len);
			return;
		}

		ustream_fill_read(s, len);
		*more = true;
	} while (1);
    ULOG_INFO("USTREAM: ustream_fd_read_pending: fd = %d => end \n", sf ? sf->fd.fd : -1);
}

In problematic situations there are never more than 16 reading attempts:

Thu Feb  5 14:48:56 2026 kern.debug RPCD: RPCD: file_exec...
Thu Feb  5 14:48:56 2026 kern.debug RPCD: RPCD: file_exec: execv...
Thu Feb  5 14:48:56 2026 kern.debug RPCD: RPCD: file_exec: pid = 26010  timeout = 5000
Thu Feb  5 14:48:56 2026 kern.debug RPCD: STATE_CB_opipe: opipe.eof = 0, epipe.eof = 0
Thu Feb  5 14:48:56 2026 kern.debug RPCD: RPCD: file_exec_process_cb: pid = 26010, stat = 0x0
Thu Feb  5 14:48:56 2026 kern.debug RPCD: opipe stream: fd = 8, eof = 0, r.data_bytes = 0
Thu Feb  5 14:48:56 2026 kern.debug RPCD: epipe stream: fd = 20, eof = 0, r.data_bytes = 0
Thu Feb  5 14:48:56 2026 kern.debug RPCD: opipe stream: select(0) = 1, revents = 0x10
Thu Feb  5 14:48:56 2026 kern.debug RPCD: epipe stream: select(0) = 0, revents = 0x00
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 8 ...
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 8 => END: len = 0
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 ...
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 0)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 1)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 2)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 3)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 4)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 5)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 6)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 7)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 8)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 9)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 10)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 11)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 12)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 13)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 14)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => ERROR: 11   (trynum = 15)
Thu Feb  5 14:48:56 2026 kern.debug RPCD: USTREAM: ustream_fd_read_pending: fd = 20 => END: len = 0
Thu Feb  5 14:48:56 2026 kern.debug RPCD: RPCD: file_exec_process_cb: END:: opipe.eof=1, epipe.eof=1
Thu Feb  5 14:48:56 2026 kern.debug RPCD: STATE_CB_opipe: opipe.eof = 1, epipe.eof = 1
Thu Feb  5 14:48:56 2026 kern.debug RPCD: RPCD: file_exec_reply: pid = 26010  rv = 0

I made this patch:

In all cases, this patch works as intended and does not break the current implementation.

Almost the same thing was done in libuv:

I just wait for the correct status on the pipe in advance to start reading.

3 Likes

Improved version of the patch:

There is also a fix for the useless wait for the process to finish when it is known that the process has already completed.

As it turns out, the problem occurs everywhere.
It's only most often detected on very slow single-core systems.

I slightly modified the testing script that I published in the first post of currect topic:

In the testing script, I added the number of iterations for testing and an exit after catching an error response.

Here is the result of running the script on a 4-core mt7986A CPU:

> ./rpctest.sh -p @ -s -c '( exec >/dev/null 2>&1 ; echo 123 ) & ' -T 3000000
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}
.....
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}
{"jsonrpc":"2.0","id":2,"result":[0,{"code":0}]}
{"jsonrpc":"2.0","id":2,"result":[7]}
ERROR: response with error code. Iteration = 2178
1 Like

@remittor are you planning on creating a PR to have this fixed in OpenWrt tree?

I'm not planning on making a pull request, as there are many possible solutions.
I've just implemented a better fix:

The project maintainer clearly has his own views on how to solve this bug.

1 Like

Bug report: https://github.com/openwrt/rpcd/issues/26

1 Like