How to debug a do_page_fault/SIGSEGV?

I managed to compile a package that I'm using with openwrt 19.07.2 with 23.05.4 (using the prebuild sdk).
Unfortunately it segfaults

[12834.446270] do_page_fault(): sending SIGSEGV to svd for invalid read access from 0003bd20
[12834.453440] epc = 0003bd20 in svd[400000+11000]
[12834.457357] ra  = 0003bd20 in svd[400000+11000]

I rebuilt the package with debug info and without stipping, but even using the gdb from the sdk on my pc, I cannot see which line corresponds to either 4000000 or 11000 (with "Info line *").
I cannot install gdb on the device (not enough space), but I think it'd be enough if I could determine which source line is causing the segfault.

You can install anything in /tmp if you pay attention to opkg help text and config files

1 Like

That seemed promising, unfortunately it crashed the router as soon as I started the program in gdb :frowning_face:

Add zram-swap?
Whats in ubus call system board ?
gdb is few megs install plus few megs run, both should fit RAM along with debugged program.

{
        "kernel": "5.15.162",
        "hostname": "OpenWrt",
        "system": "Danube rev 1.5",
        "model": "Astoria Networks ARV7518PW",
        "board_name": "arcadyan,arv7518pw",
        "rootfs_type": "squashfs",
        "release": {
                "distribution": "OpenWrt",
                "version": "23.05.4",
                "revision": "r24012-d8dd03c46f",
                "target": "lantiq/xway",
                "description": "OpenWrt 23.05.4 r24012-d8dd03c46f"
        }
}

64MB, likely no space for 2xgdb and debuggable program, Openwrt boots into something like 25-30MB ram, any extra package may be last.

Now I modified the program so that it compiles with a newer version of sofia-sip (where I suspect the segfault lies) but it's weird: it stops without even producing a segfault in dmesg.
Oh, well, I don't have a pressing need to use 23.05.4 (which, even with a stripped down image, seems too big for this device) but I'd have liked to keep my program up to date (this one btw: https://github.com/olivluca/danube-voip)

svd is the culprit...

gcc ? are you using -g ?

gcc -g -o bestapp bestapp.c 

Anything during the build raise errors or warnings ?

-Warn -Wextra -Werror 

may help you spot something during the build.

have you enough room for valgrind ?



Name:
    valgrind
Version:
    3.18.1-1
Description:
    Valgrind is an award-winning suite of tools for debugging and\\ profiling Linux programs. With the tools that come with Valgrind,\\ you can automatically detect many memory management and threading\\ bugs, avoiding hours of frustrating bug-hunting, making your\\ programs more stable. You can also perform detailed profiling,\\ to speed up and reduce memory use of your programs.\\ \\
Installed size:
    1495kB
Dependencies:
    libc, libpthread, librt
Categories:
    development
Repositories:
    base
Architectures:
    aarch64_cortex-a53, aarch64_cortex-a72, aarch64_generic, arm_cortex-a15_neon-vfpv4, arm_cortex-a7, arm_cortex-a7_neon-vfpv4, arm_cortex-a8_vfpv3, arm_cortex-a9, arm_cortex-a9_neon, i386_pentium4, mips_24kc, mips_mips32, mipsel_24kc, mipsel_24kc_24kf, mipsel_74kc, mipsel_mips32, powerpc_464fp, powerpc_8540, x86_64
OpenWrt release:
    OpenWrt-22.03.0
File size:
    1493kB
License:
    GPL-2.0+
Maintainer:
    Felix Fietkau
Bug report:
    Bug reports
Source code:
    Sources

if so, recompile with -g and run it against valgrind ... this will give you the file and line number of the fault.

without -g

==48500== Memcheck, a memory error detector
==48500== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==48500== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==48500== Command: ./crash bang
==48500== 
==48500== Use of uninitialised value of size 8
==48500==    at 0x10919D: main (in /home/hostle/dbg/crash)
==48500== 
==48500== Invalid read of size 4
==48500==    at 0x10919D: main (in /home/hostle/dbg/crash)
==48500==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==48500== 
==48500== 
==48500== Process terminating with default action of signal 11 (SIGSEGV)
==48500==  Access not within mapped region at address 0x0
==48500==    at 0x10919D: main (in /home/hostle/dbg/crash)
==48500==  If you believe this happened as a result of a stack
==48500==  overflow in your program's main thread (unlikely but
==48500==  possible), you can try to increase the size of the
==48500==  main thread stack using the --main-stacksize= flag.
==48500==  The main thread stack size used in this run was 8388608.
==48500== 
==48500== HEAP SUMMARY:
==48500==     in use at exit: 8 bytes in 1 blocks
==48500==   total heap usage: 1 allocs, 0 frees, 8 bytes allocated
==48500== 
==48500== LEAK SUMMARY:
==48500==    definitely lost: 0 bytes in 0 blocks
==48500==    indirectly lost: 0 bytes in 0 blocks
==48500==      possibly lost: 0 bytes in 0 blocks
==48500==    still reachable: 8 bytes in 1 blocks
==48500==         suppressed: 0 bytes in 0 blocks
==48500== Rerun with --leak-check=full to see details of leaked memory
==48500== 
==48500== Use --track-origins=yes to see where uninitialised values come from
==48500== For lists of detected and suppressed errors, rerun with: -s
==48500== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)

with -g

hostle@dani:~/dbg$ valgrind ./crash bang
==48373== Memcheck, a memory error detector
==48373== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==48373== Using Valgrind-3.18.1 and LibVEX; rerun with -h for copyright info
==48373== Command: ./crash bang
==48373== 
==48373== Use of uninitialised value of size 8  <-- 8 bytes ... generally a pointer
==48373==    at 0x10919D: main (cb.c:18)
==48373== 
==48373== Invalid read of size 4   <-- 4 bytes ..probably an uninitialized integer ptr
==48373==    at 0x10919D: main (cb.c:18)
==48373==  Address 0x0 is not stack'd, malloc'd or (recently) free'd
==48373== 
==48373== 
==48373== Process terminating with default action of signal 11 (SIGSEGV)
==48373==  Access not within mapped region at address 0x0
==48373==    at 0x10919D: main (cb.c:18)
==48373==  If you believe this happened as a result of a stack
==48373==  overflow in your program's main thread (unlikely but
==48373==  possible), you can try to increase the size of the
==48373==  main thread stack using the --main-stacksize= flag.
==48373==  The main thread stack size used in this run was 8388608.
==48373== 
==48373== HEAP SUMMARY:
==48373==     in use at exit: 8 bytes in 1 blocks
==48373==   total heap usage: 1 allocs, 0 frees, 8 bytes allocated
==48373== 
==48373== LEAK SUMMARY:
==48373==    definitely lost: 0 bytes in 0 blocks
==48373==    indirectly lost: 0 bytes in 0 blocks
==48373==      possibly lost: 0 bytes in 0 blocks
==48373==    still reachable: 8 bytes in 1 blocks
==48373==         suppressed: 0 bytes in 0 blocks
==48373== Rerun with --leak-check=full to see details of leaked memory
==48373== 
==48373== Use --track-origins=yes to see where uninitialised values come from
==48373== For lists of detected and suppressed errors, rerun with: -s
==48373== ERROR SUMMARY: 2 errors from 2 contexts (suppressed: 0 from 0)
Segmentation fault (core dumped)
2 Likes

It is 8/64 device

Yes, svd is the program I'm maintaining, but the culprit could also be sofia-sip.
I started with my version of sofia-sip which gave the segfault (either svd or sofia-sip).
Now I "fixed" svd so that it compiles with the one from the telephony feed (removing the --disable-stun since I need it) and svd simply stops with no error reported.
I fixed all the compilation warnings for svd (mostly harmless %d instead of %ld formatting options), I didn't touch sofia-sip (I assume it's somehow been tested since it's provided in the feeds) which throws some warnings too.
I managed to install valgrid in ram (as well as sofia-sip and svd, with debugging information they're too big to fit in flash) but unfortunately, just like gdb, it crashes the router.

Just FYI, this is the situation of the router where everything is running smoothly,

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/root                 2816      2816         0 100% /rom
tmpfs                    29556      1200     28356   4% /tmp
/dev/mtdblock5            3456      3332       124  96% /overlay
overlayfs:/overlay        3456      3332       124  96% /
tmpfs                      512         0       512   0% /dev


compared to the one with openwrt 23.05.4

Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/root                 4864      4864         0 100% /rom
tmpfs                    28188        64     28124   0% /tmp
/dev/mtdblock5             704       640        64  91% /overlay
overlayfs:/overlay         704       640        64  91% /
tmpfs                      512         0       512   0% /dev

I'm keeping my attempts to build it for 23.05.4 in a different repository.

Of 704 you need 192 to maintain it writable.

Even with a minimal image with valgrind integrated (make image PROFILE=arcadyan_arv7518pw PACKAGES="libatomic1 valgrind") it eventually crashes.
I still have to install additional packages in ram (libopenssl3, zlib, sofia-sip, svd, kmod-ltq-ifxos, kmod-ltq-tapi, kmod-ltq-vmmc) because they don't fit in flash, so that may be it.
Probably I should just give up.

Edit: I tried to bundle gdb instead of valgrind but it is too big.

OK, you dont have memory or space to debug it. Crash address is 11000 bytes from start of svd file, probably you can locate function and guess invalid input that could trigger zero in one parameter that should be ?offset? to ?mallocd mem?

That was with sofia-sip 1.12, with 1.13 svd just stops with no SIGSEGV, so without a working gdb I don't know how to see where things go wrong (unless I sprinkle the code with good old fashioned printf).
Even with the original version (sofia-sip 1.12) I loaded svd into gdb (the mips cross gdb in the sdk) on the host and info line *11000 (the only gdb command I know :sweat_smile:) just told me it had no information about that address.

Yes, indeed it looks like dead end. If you manage to repeat problem on x86(/64) somehow it may help with debuggers.

Hard to say whats causing the crash without any type of debug info. Printf may be your only way of finding the offending line. However, I suspect it is crashing due to space. From what you describe, I am guessing your simply out of space, likely a request for memory can not be fulfilled, in which the original author didn't forsee this happening, and return from a malloc is likely not being evaluated for success return (free memory) and so the program dyes a horrible death, instead of exiting gracefully and cleaning up.

Edit..
Just had a look at the picture you posted... seems the one that is running smooth as you say, is in fact using more memory then the suspect. This leads me to believe you may be missing a dependent of some sort or it is in dispute with an upstream change made to the firmware

1 Like

Well, the new one, without running anything apart from the basic openwrt tasks, is using 18M of ram, the old one, with svd and luci, is using 16M.

Old one
root@voip:~# free

              total        used        free      shared  buff/cache   available
Mem:          59112       16184       28956        1200       13972       23932
Swap:             0           0           0

New one

root@OpenWrt:~# free
              total        used        free      shared  buff/cache   available
Mem:          56376       18180       23760          64       14436       18660
Swap:             0           0           0

I'm not 100% sure, but I think that memory allocations are checked and a diagnostic message is printed when one fails.

yes sir ... a segfault and a core dump lol. You could try and save the core dump, and run it in gdb seperately