Local builds are not 100% reliable

I follow the build procedure such as:

make menuconfig
make clean
make -j40 kernel_menuconfig
make -j40 defconfig download clean world

I use that extra clean at the beginning because it improves the chances of success. About 75% of the time there is no problem but the other 25% it randomly fails at random spots. Seems like some sort of race condition because if I increase -j even more like -j50 then it fails about 90% of the time. It fails in totally random places, almost never the same.

I want faster builds but I hate it failing randomly and ending up building it several times in a row before finally getting a good build.

Any hints?

PS. This build machine is enterprise-class and has ECC RAM and numerous redundancy on the drives and subsystems. It's unlikely a hardware problem as this machine builds lots of big projects without problems.

PPS. I'm working on generating logs of a bunch of failures to maybe pinpoint what is going on. It's a PITA though because of the random nature sometimes a whole bunch of stuff fails at the same time and the logs are interspersed so it's almost impossible to know what happened. Is there a way to generate individual project build logs?

Firstly, increasing -j values beyond cores+1 is silly and does not lead to faster results unless you're talking about distcc builds. Secondly, there is what I believe to be a race condition that cases tc-something to fail. Been happening consistently for months now. My solution is to build like this which happily restarts and finishes.

My shell init file defines MAKEFLAGS='-j33' so that is implied below.

make clean && make defconfig && ( nice -19 make || nice -19 make )
4 Likes

-j50 is not even using all available processors. I use -j40 due to it succeeding more often. Also, I use up to 25% more than nproc because it does in fact speed up the build a little (if you have fast disk).

This isn't just a specific package. It's totally random. Something innocuous like perl will fail to build (and/or a bunch of other stuff). Then do the exact same clean/build again and maybe no problem or maybe it fails on something totally unrelated. It's weird.

But maybe this hints at something. Is most of the official image building done on smaller machines? Because it does seem that using less parallel processes increases the build success more and more the less you have (but it's so sloowww...).

  1. am using make -j5 and 90% one take pass.
  2. if failed, no need clean or add V=s, just redo.

if success, it always got byte to byte exactly same code, nothing wrong.

1 Like

Hi,

I am familiar with your described scenario and I have seen it on a variety of builds crossing all architectures.

The problem is with a horse/cart race condition that is inherent to parallel builds.

While developers can make every effort to coax builds into occurring in a desired order, as a build is spread across a parallel build cluster it becomes increasingly difficult to insure that program A is not built until it's dependency library B is built. So, when you wind up your modern 64 core 10THZ super blaster and try to build OpenWrt in under 2 minutes, boom, it fails, and while it may seem random, heh, well it mostly is. This is a common trait of loosely orchestrated parallel builds. So many things are happening so fast that it is anyone's guess which will complete first. The probability of a problem does scale rapidly with the core count because it becomes easier for the process to for lack of a better term "get ahead of itself". This can only really be answered a couple of ways. One, accept that the big block 502 bored .090 over is going to totally bend the Yugo frame and lighten up on the lead foot (I know, its really hard for me to do too), or 2. Embark on a parallel build system that can cross reference everything to build against every other thing to build and then build a blocking/ordering process to orchestrate a fully dependency aware parallel build.

Option 1 is you cut down your -j number and it will decrease the build race crashes.

Option 2 is no easy task. Empirical determination and ordering of the full cascade of dependencies across 100s of applications and libraries. If you do take it on, I have a 48 core 96 thread box that can assure that you have completed the task successfully. Until then it seems to be pretty good at doing builds at around -j24. It still trips maybe 1 out of 5 times but it sure beats the -j equivalent. If I am in a hurry I will just do something like this: make -j 64 && make -j 32 && make -j 16 && make -j 8
although I usually don't even use the -j 8.

It really isn't weird once you understand that -j is really just a legitimized compiler fork bomb.

There was a time when a load average spiking to over 50 was deemed bad, to very bad. Now we have locomotive engines disguised as PCs, and we just do it on purpose, trading the pluses in saved time for the minuses of needing to know how to cope with the limits of massively parallel builds.

Best regards,
Dan

1 Like

I use this and my builds never fail.

#Rebuild Tool Chain
make -j$(($(nproc)+1)) toolchain/install

#Building System
make -j$(($(nproc)+1))

What is "nproc" on your machine?

Physical machine 32 x Intel(R) Xeon(R) CPU E5-2690 0 @ 2.90GHz (2 Sockets) and I have 24 V cores assigned to my builder.

Just to further clarify, the host is running on Proxmox VE 7.2.7 and my builder is Ubuntu 20.04.4 on kernel 4.4.0-122-generic

Actually, you have 16 cores, 32 threads, a subtle, but in this context, meaningful distinction. It is meaningful because the 2/1 thread to core ratio skews toward lower individual core sprint probability which in turn improves the likelihood of missing the race condition. It isn't a full 2/1 difference but the fact that the cores are shared among threads is a measurable factor.

I actually have experience with gigantic code base build systems so I might look in to this. I just don't have time these days.

I was assuming this wouldn't be a problem in a project like this that has been around for a while and usually when I experience an OpenWrt problem it's because I don't know what I'm doing. I guess it's still relatively uncommon to use large build machines.

To relate to @mTek : I use a similar approach.

  • Clone the source / or a have a clean working directory
  • make download
  • make -j$(($(nproc)+1)) toolchain/install
  • make -j$(($(nproc)+1))
  • Then making your local modifications i.e. via make menuconfig
  • And then building the final image.

Most of the time you probably would not want to build everything from scratch, and even when you do, going through smaller stages increases the chance to not hit any race conditions. But this is also only based on personal experience. But it feels like that building the toolchain first, then a vanilla system, and then finally the extra stuff makes it more "reliable".

Here is my full build for the Belkin RT3200... I'm doing this for about 6 other devices and this flow works for me.

#!/bin/bash

#Setting Date and time
date="$(date '+%Y_%m_%d')"
time="$(date '+%H_%M_%S')"

#Variables for build system
device=rt3200
manufacturer=Belkin
version=22.03
tags=v22.03.0-rc6
src_location=https://github.com/openwrt/openwrt.git

#Create device build location
mkdir ~/openwrt_$device

#Changing directory
cd ~/openwrt_$device

#Create final directory
mkdir ~/openwrt_$device/$date
mkdir ~/openwrt_$device/$date/$time

#Removing old build files
rm -rf build_files

#Cloning Additional Files
git clone git@git.privaterepo.com:openwrt/devices.git build_files/

#Updating source
git clone --branch openwrt-$version --single-branch $src_location openwrt-$tags

#Changing Working Directory
cd ~/openwrt_$device/openwrt-$tags

#Switching to specific tag
git checkout tags/$tags

#Updating feeds
rm feeds.conf
cat feeds.conf.default >> feeds.conf
./scripts/feeds update -a
./scripts/feeds install -a

#Setting up customizations
mkdir ~/openwrt_$device/openwrt-$tags/files/

#Setting Build Options
cp -r ~/openwrt_$device/build_files/$manufacturer/$device/files/* ~/openwrt_$device/openwrt-$tags/files/
cp -r ~/openwrt_$device/build_files/$manufacturer/$device/.config ~/openwrt_$device/openwrt-$tags/.config

make olddefconfig

#Rebuild Tool Chain
make -j$(($(nproc)+1)) toolchain/install

#Building System
make -j$(($(nproc)+1))

cp -R ~/openwrt_$device/openwrt-$tags/bin  ~/openwrt_$device/$date/$time/

#Removing Openwrt Directory
rm -rf ~/openwrt_$device/openwrt-$tags