Documentation on using procd-ujail / ujail?

Torxgewinde · November 10, 2022, 7:24pm

Hello,
Is there some documentation on using ujail and explanations on the command line switches?

So far, I found the internal help:

ujail <options> -- <binary> <params ...>
  -d <num>	show debug log (increase num to increase verbosity)
  -S <file>	seccomp filter config
  -C <file>	capabilities drop config
  -c		set PR_SET_NO_NEW_PRIVS
  -n <name>	the name of the jail
  -e <var>	import environment variable
namespace jail options:
  -h <hostname>	change the hostname of the jail
  -N		jail has network namespace
  -f		jail has user namespace
  -F		jail has cgroups namespace
  -r <file>	readonly files that should be staged
  -w <file>	writeable files that should be staged
  -p		jail has /proc
  -s		jail has /sys
  -l		jail has /dev/log
  -u		jail has a ubus socket
  -U <name>	user to run jailed process
  -G <name>	group to run jailed process
  -o		remont jail root (/) read only
  -R <dir>	external jail rootfs (system container)
  -O <dir>	directory for r/w overlayfs
  -T <size>	use tmpfs r/w overlayfs with <size>
  -E		fail if jail cannot be setup
  -y		provide jail console
  -J <dir>	create container from OCI bundle
  -i		start container immediately
  -P <pidfile>	create <pidfile>

Warning: by default root inside the jail is the same
and he has the same powers as root outside the jail,
thus he can escape the jail and/or break stuff.
Please use seccomp/capabilities (-S/-C) to restrict his powers

If you use none of the namespace jail options,
ujail will not use namespace/build a jail,
and will only drop capabilities/apply seccomp filter.

and a bit of background from John Crispins message:

OpenWrt service hardening and jailing
=====================================

Current firmware builds have the problem, that a lot of services are
running as root. This is especially critical for those services exposed
to the network. Once an attacker has managed to compromise such a
service he has full control over the system. Even if we change all
processes to run as non root, there is still the risk of root privilege
escalation using for example a bug in a kernel syscall implementation.
Patching all services that we run (we have over 1000 packages) is not
possible. The problem is far to diverse and distributed.  Linux has many
on board features to reduce these attack vectors. Lets create a solution
that allows us to use these features without touching all services.

Adding Seccomp support to procd
===============================

Seccomp is a Berkley Packet Filter (BPF) base syscall filter. Whenever a
service does a syscall, a BPF filter runs and checks if the syscall is
allowed. Currently the filters are setup to allow whitelisted syscalls.
All other syscalls get handled by the default policy. This can be.

* kill the service (policy = 0)
* return -1 and set errno to n (policy = n)

In order to define a syscall filter for a service, it must as a
prerequisite be a procd managed service. Once this is done we simply add
the following line to the init.d script

-> procd_set_param seccomp /etc/seccomp/<service_name>.json

This will tell procd that when the service is started to use the filter
rules defined inside the json file. As we do not want to patch every
service to support seccomp natively we use the following LD_PRELOAD
trick. The service is started with /lib/libpreload-seccomp.so preloading
the __libc_start_main() function of the libc. The preloaded function
does the following things.

1) get a pointer to the real __libc_start_main()
2) get a pointer to the real main() function of the service
3) call the real __libc_start_main() but pass it a pointer to a dummy
main() function
4) inside the dummy main() we read the seccomp filter from json and apply it
5) call the real main()

Doing this allows us to run the dynamic loader (ld.so) and libc_init
code before we apply the seccomp filter. This reduces the amount of
whitelisted syscall required to run the service.

We have extended the kernels seccomp implementation and added an
additional return action that works exactly like the errno action
described above but also prints a line to the kernel log reporting which
process trigger the exception and what syscall number is missing. This
could also be achieved using the trap action but that would require
additional stack parsing inside user-land, which is ugly.


Enabling seccomp support
========================

As this procd feature is still new and under development it is not
enabled by default. In order to enable it you need to select the
following option inside "make menuconfig"

-> Global build settings  ---> [*] Enable seccomp support

The feature has so far been implemented for mips, i386 and x86_64. Other
architectures will follow shortly.


Creating a seccomp filter json file
===================================

It would be very tedious to dig through all packages and figure out
which syscalls are used, specially as many of them are hidden inside the
libc functions that get used. To make things easier we have written a
trivial strace like tool called utrace that will automagically create
the json file for you. Simply calling

-> /sbin/utrace /bin/echo foo

will create a file called /tmp/echo.json.$pid The syscalls are ordered
with the one called most often listed first, so that it is the first
inside the actual filter rules set inside the kernel. To make things
even easier an extra init.d command called "trace" was added. By calling

-> /etc/init.d/<service_name> trace

you can make procd start the service in trace mode. Once the service is
stopped the json file is written to /tmp/.


Adding process jailing to procd
===============================

This feature uses Linux namespaces. If you do not know what namespaces
are (there are currently 5 of them) please refer to this LWN article.

-> http://lwn.net/Articles/531114/

In a nutshell a namespace is a container that separates various aspects
of the user-land from the rest of the system. Namespaces are the base
feature that allows us to run virtual containers using projects such as LXC.

The first namespace that we use is the mount namespace. Once we have
spawned our service inside a mount namespace it cannot see an mounts
outside the container. This in turn allows us to use the pivot_root()
syscall and effectively creating a separate rootfs for the service. As
we do not want the service to see all files on the system, we stage the
required ones into the container. The jailing tool will automatically
detect the libraries needed to run the service, all other files need to
be defined inside the procd init script. The following 3 new commands
are used for this.

* procd_add_jail <name> <features>
	this will tell procd to create a jail and call it <name>. it will also
bring enable the following
* procd_add_jail_mount <file1> <file2> ...
	procd will add the listed files as readonly to the container
* procd_add_jail_mount_rw <file1> <file2> ...
	procd will add the listed files as writable to the container

Due to lack of better option we currently create a tmpfs and then do a
rebind mount for each of the files. Readonly files get a "-o remount,ro"
applied. Once all files are mounted we also run "-o remount,ro" on the
whole tmpfs. This has the big disadvantage that the mount table is
bloated with lots of single mounts. In a future step we intended to
create a new filesystem called jailfs that works similarly to overlayfs
and does the magic inside a single mount, but more on that later.


Loading the jail
================

Instead of spawning the process directly, procd uses the /sbin/ujail
tool. This will do the following things

1) create the readonly tmpfs with all the files inside
2) call the clone() syscall which spawns the namespaces and executes the
ujail stub running inside the container
3) the stub will then call pivot_root() and jump into the tmpfs
4) finally the actual service is spawned

It is possible to tell the stub to also apply a seccomp filter to the
service using the same method mentioned above. In addition to the mount
namspace we also enable the process namespace. this makes sure that our
stub runs inside the container with pid 1, the service with pid 2 and
both are not able to see any other pids running on the system.

Enabling the UTS namespace allows us to set a separate hostname inside
the container.

There are 2 more namespaces for which support will be added soon. These
are the user namespace which will make the service think it is running
as uid/gid 0, but in reality it is not root but a different user (the
namespace uses an offset), such a nobody with the according restrictions
applied. The fifth namespace is the network namespace which will add
virtual network interfaces to the container. But more on that later.

Having this chain of processes allows procd to manage the jailed process
just like any other process. All the features like restartig the service
when the config has changed work as if the jail was not loaded.


Resource usage overhead and performance hit
===========================================

The actual code used to bring up the setup is minimal and currently
weighs in at around 1,5k lines of code. The initial measurement showed
that a container has a ram footprint of around 150KB. The overhead of
seccomp is hard to measure but most likely linear. The more syscalls get
called, the higher the performance hit is. However, as almost all
architectures already have a BPF JIT inside the kernel these filters get
converted to native code that is executed. The utrace tool will sort the
syscalls by the number of invocations, leading the most used syscall
being the first one in the filter chain. With routers becoming faster,
having more ram and recently being multi core the tradeoff between a
little performance hit and a more advanced security is something that
the user has too evaluate based on the scenario where the router is deployed


Next steps
==========

With the core functionality now being implemented a whole wealth of
things are still missing

* user namespace - this will allow us to separate the users inside the
jail from the primary system. A privilege escalation will not grant the
attacker the same rights as without the container. It also allows us to
run services that require root without them actually being root.

* network namespace - this still needs a bit of brainstorming. The
network namespace would require netifd to track and manage the various
virtual interfaces of the containers and setup routing and firewalling
for these. A far simpler solution might be to use a LD_PRELOAD library
that hooks into the bind() and connect() syscalls and filters which IPs,
ports and protocola the container has access to

* jail delegation - currently each service has its own jail, ideally we
can group services together and have jails running more than one service
based on some set of rules.

* eBPF - seccomp uses BPF but there is also eBPF, which to my
understanding allows us to not only filter syscalls based on their
number but also gives the ability to look at the parameters passed to
the syscalls. This would allow us to kill exploits such as path
traversals. We could simply block the open() syscall from opening files
that have a ".." inside the path. eBPF also requires a piece of c code
to be written which then gets passed to llvm, for which there is backend
that will convert the code to a BPF filter. This requires us to add
support for the llvm backend to OpenWrt and also to write the filters.
This is far more complex than using normal BPL filters.

* jailfs - as mentioned before, we are considering to build a small
overlayfs like filesystem that will do the filtering of which files the
container can see. This would obsolete the use of the rebind mounts and
be far more lightweight. This still needs to be evaluated.

* ubus ACL - if the container has ubus support, it will also have access
to all features provided by ubus, which is almost equivalent to being
root. Adding support to ubus for adding permissions based on the callers
uid/gid is easy, the hard part is defining the format in which the ACL
rules are defined. We are still brainstorming on this one.

* cgroup support - we want to add support for cgroups to allow us to
limit the amount of ram resources and cpu cycles a container can use.
This way a exploited daemon cannot DoS the cpu by running "while(true)
{}” or trigger an OOM

* ssl certificate pinning - there is a iptables module that allows us to
pin SSL certificates. Adding the module is easy, but handling the
tracking and allowing/blocking the certificates still needs some
brainstorming and coding.

If there are features that we are not aware of yet or that we forgot to
list, then please let us know about them.

Comments and ideas are welcome ...

Is this the information we have, or am I missing something? Search engines did not reveal a good, single source of information.