Problems with PPPoE and unenforced reboots

Dear All,

I just updated an old wndr 3700v2 to OpenWrt SNAPSHOT, ath79 r10012-5c742c86ef (thanks to @hnyman's excellent builds), things looked just fine. But it turns out that due to my flaky DSL connection (line stability changed recently from rock solid to several link losses per day) I am seeing issues with unemforced reboots. I use a BT homehub 5a as my brdged modem (using OpenWrt 18.6) and this acts as expected, it will reaquire sync once the dslam is willing to renegotiate again (and due to recently enabled vectoring the training takes a while). The OpenWrt router acts as expected intially, it will detect the missing PPPoE connection and will shut it down after my configured 12*5 seconds, it will then go sending PADI packets attempting to reconnect again. So far so good, but the very moment the dsl-modem reaches sync status again, the openwrt router simply reboots. I have observed this sequence 3 times today, the last time by disconnecting the dsl-line from the modem and waiting a minute and it seems that changing from ath79 r9933-1325e74e0c to r10012 made the difference from unproblematic automatic reconnects to unintended reboots. I wonder whether https://git.openwrt.org/?p=openwrt/openwrt.git;a=commit;h=06403981e1f2daf6bbcf05f5f2d2205d68240861 an update to ppp might be involved here.

I would love to hear whether anybody else has any ppp/pppoe stability issues, and will, assuming I can confirm a reliable method to force the reboots, try to put this in the issue tracker/the mailing list.

Okay, so I tried:

root@nacktmulle:~# cat /etc/os-release 
NAME="OpenWrt"
VERSION="SNAPSHOT"
ID="openwrt"
ID_LIKE="lede openwrt"
PRETTY_NAME="OpenWrt SNAPSHOT"
VERSION_ID="snapshot"
HOME_URL="http://openwrt.org/"
BUG_URL="http://bugs.openwrt.org/"
SUPPORT_URL="http://forum.lede-project.org/"
BUILD_ID="r10113-085f66de9a"
LEDE_BOARD="ath79/generic"
LEDE_ARCH="mips_24kc"
LEDE_TAINTS="no-all busybox"
LEDE_DEVICE_MANUFACTURER="OpenWrt"
LEDE_DEVICE_MANUFACTURER_URL="http://openwrt.org/"
LEDE_DEVICE_PRODUCT="Generic"
LEDE_DEVICE_REVISION="v0"
LEDE_RELEASE="OpenWrt SNAPSHOT r10113-085f66de9a"

And my, disconnect-the-phone-line-for-1-minute test did not result in a reboot of the pppoe-terminating OpenWrt device. So I am cautiously optimistic that my issue was transient and is solved now, but will keep my eyes open.

1 Like

So, it seems I was too fast, one more unsolicted reboot at the same time the modem resynced

Modem:

root@BTHH5A:~# uptime
 08:30:08 up 4 days,  9:08,  load average: 0.94, 0.23, 0.08
root@BTHH5A:~# /etc/init.d/dsl_control status
ATU-C Vendor ID:                          Broadcom 192.85
ATU-C System Vendor ID:                   Broadcom
Chipset:                                  Lantiq-VRX200
Firmware Version:                         5.9.1.4.0.7
API Version:                              4.17.18.6
XTSE Capabilities:                        0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2
Annex:                                    B
Line Mode:                                G.993.5 (VDSL2 with down- and upstream vectoring)
Profile:                                  17a
Line State:                               UP [0x801: showtime_tc_sync]
Forward Error Correction Seconds (FECS):  Near: 0 / Far: 40
Errored seconds (ES):                     Near: 205 / Far: 2385
Severely Errored Seconds (SES):           Near: 61 / Far: 719
Loss of Signal Seconds (LOSS):            Near: 3 / Far: 0
Unavailable Seconds (UAS):                Near: 4361 / Far: 4361
Header Error Code Errors (HEC):           Near: 0 / Far: 0
Non Pre-emtive CRC errors (CRC_P):        Near: 84 / Far: 0
Pre-emtive CRC errors (CRCP_P):           Near: 0 / Far: 0
Power Management Mode:                    L0 - Synchronized
Latency [Interleave Delay]:               0.13 ms [Fast]   0.0 ms [Fast]
Data Rate:                                Down: 103.963 Mb/s / Up: 31.996 Mb/s
Line Attenuation (LATN):                  Down: 11.7 dB / Up: 9.7 dB
Signal Attenuation (SATN):                Down: 11.7 dB / Up: 9.5 dB
Noise Margin (SNR):                       Down: 8.5 dB / Up: 13.7 dB
Aggregate Transmit Power (ACTATP):        Down: 0.2 dB / Up: 14.5 dB
Max. Attainable Data Rate (ATTNDR):       Down: 112.820 Mb/s / Up: 39.880 Mb/s
Line Uptime Seconds:                      8218
Line Uptime:                              2h 16m 58s

So sync exists since 2h 16, but the modem is up for 4 days.

PPPoE-Terminating router:

root@router:~# uptime
 15:57:44 up  2:19,  load average: 0.00, 0.02, 0.00

So it seems the router again crashed/rebooted around the time of the resync. I guess I will keep a shell with logread -f running and perform a few more resync tests later today.

Okay, logread -f did not show anything interesting, except that the crash only happened shortly after the pppoe link was reestablished. Need to figure out how to better diagnose this....

Okay, I believe https://bugs.openwrt.org/index.php?do=details&task_id=2297#comment6478 is exactly my issue:

cat /sys/kernel/debug/crashlog > /tmp/last_crash_log ; cat /tmp/last_crash_log
[...]
<6>[   41.238185] pppoe-wan: renamed from ppp0
<5>[   49.987165] random: crng init done
<6>[21118.066455] pppoe-wan: renamed from ppp0
<0>[21130.716522] usercopy: kernel memory overwrite attempt detected to c14d3000 (kmalloc-128) (336 bytes)
<1>[21130.725707] Kernel bug detected[#1]:
<1>[21130.729277] CPU: 0 PID: 10369 Comm: fw3 Not tainted 4.14.123 #0
<1>[21130.735178] task: 839bcb00 task.stack: 815fa000
<1>[21130.739695] $ 0   : 00000000 00000001 00000058 00000000
<1>[21130.744916] $ 4   : 8050cd58 8050cd58 805140b0 00005570
<1>[21130.750137] $ 8   : 00000000 00000000 00000001 00000000
<1>[21130.755358] $12   : 00000129 b824e4e6 00000128 00000000
<1>[21130.760579] $16   : c14d3000 00000150 00000000 c14d3150
<1>[21130.765801] $20   : 006df000 00000178 00000000 77f08000
<1>[21130.771021] $24   : 00000003 8026d2d4                  
<1>[21130.776243] $28   : 815fa000 815fbdc0 00000000 80139210
<1>[21130.781465] Hi    : 00000000
<1>[21130.784332] Lo    : 19d60000
<1>[21130.787215] epc   : 80139210 __check_object_size+0x1b0/0x1e0
<1>[21130.792866] ra    : 80139210 __check_object_size+0x1b0/0x1e0
<1>[21130.798506] Status: 1000fc03	KERNEL EXL IE 
<1>[21130.802680] Cause : 10800024 (ExcCode 09)
<1>[21130.806681] PrId  : 00019374 (MIPS 24Kc)
<1>[21130.810593] Modules linked in: ath9k ath9k_common pppoe ppp_async ath9k_hw ath pptp pppox ppp_mppe ppp_generic mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_FLOWOFFLOAD xt_DSCP xt_CT xt_CLASSIFY usbserial slhc owl_loader nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack_netlink iptable_raw iptable_mangle iptable_filter ipt_ah ipt_ECN ip_tables crc_ccitt compat sch_cake sch_tbf sch_ingress sch_htb sch_hfsc em_u32 cls_u32
<1>[21130.881448]  cls_tcindex cls_route cls_matchall cls_fw cls_flow cls_basic act_skbedit act_mirred ledtrig_usbport xt_set ip_set_list_set ip_set_hash_netportnet ip_set_hash_netport ip_set_hash_netnet ip_set_hash_netiface ip_set_hash_net ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6t_NPT ip6t_MASQUERADE nf_nat_masquerade_ipv6 nf_nat nf_conntrack nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 msdos ip_gre gre ifb sit tunnel4 ip_tunnel tun vfat fat ntfs hfsplus cifs nls_utf8 nls_iso8859_15 nls_iso8859_1 nls_cp850 nls_cp437 nls_cp1250 sha256_generic
<1>[21130.952017]  sha1_generic md5 md4 hmac ecb des_generic usb_storage ohci_platform ohci_hcd ehci_platform sd_mod scsi_mod ehci_hcd gpio_button_hotplug ext4 mbcache jbd2 usbcore nls_base usb_common crc16 aead crypto_null cryptomgr crc32c_generic crypto_hash
<1>[21130.974527] Process fw3 (pid: 10369, threadinfo=815fa000, task=839bcb00, tls=77f16eb8)
<1>[21130.982418] Stack : 80520000 804bb474 804c3bb4 c14d3000 804b9c74 00000150 00000000 00000150
<1>[21130.990764]         c14d3000 006df028 8051e748 8325deec 82593000 82593040 c14d1000 80500000
<1>[21130.999112]         80500000 80520000 00000041 833ed910 8051e748 83017288 815fbf08 77e89640
<1>[21131.007458]         00000000 006e00c0 80c78040 8051e748 00000000 00000000 6e617400 00000000
<1>[21131.015806]         00000000 00000000 00000000 00000000 00000000 00000000 00000015 00000000
<1>[21131.024153]         ...
<1>[21131.026596] Call Trace:
<1>[21131.029037] [<80139210>] __check_object_size+0x1b0/0x1e0
<1>[21131.034388] [<8325deec>] xt_copy_counters_from_user+0xac/0x1bc [x_tables]
<1>[21131.041185] [<833ed910>] ipt_register_table+0x508/0xd80 [ip_tables]
<1>[21131.047436] Code: 02003825  0c02ce6c  2484b48c <000c000d> 8fb30028  8fb20024  8fb10020  8fb0001c  03e00008 
<1>[21131.057178] 
<4>[21131.059398] ---[ end trace 7c45c619bb1b90b8 ]---

and https://patchwork.ozlabs.org/patch/1112640/ might be the solution, now all I need is to go and test whether that really solves my problem. Which would be easier if I would actually build my own firmwares, instead of simply piggybacking on @hnyman's excellent builds for wndr3X00, I guess it slowly gets to the point where I need to have a go at this....

1 Like

So build a trunk build for the wndr3700v2 with the linked patch, and it seems much stabler. Two hard dsl-line disconnects that in the recent past caused a reboot pretty reliably, did not do so after applying the patch. Since the crash was not 100% guaranteed, I will monitor the router for a few days, but I have a hunch that the issue might be squashed by @hauke's patch.

As of 4.14.125 the patch seems to be part of the official Linux kernel and hence should not need to be carried by OpenWrt any more. I will still wait a few days, though before declaring success... (and will only try >=4.4.125 after the test with the patched r10199 either showed a crash or survived for say a week).