X86 router, occasional high load, connection loss, PCIe link lost, device now detached

Hi All,

Background:

Longtime lurker and user posting a troubleshooting question because I'm at a loss as to where to look next for a fix. I've got an x86 router running OpenWrt 19.07.7, a Dell T1600 specifically, its built in NIC is an Intel 82579LM and I've got two four port pci-e 82575GB NICs. I've got a few Mikrotik hap ac^2’s in AP bridge mode connected to the pci-e NICs and the WAN comes in on the integrated NIC. All pci-e ports are a bridged LAN. APs should be outside the scope of this question but including it for completeness.

Issue:

It can be a few days or an hour after a reboot that the load increases as high as 2.5 as reported by luci. Network is very quiet so the load is indicative of some issue. Checking the running procs shows nothing out of the ordinary. I lose internet connection and even connecting to the router via luci is intermittent and very slow at best. SSH seems to be OK. System and kernel logs are below, parsing the logs I found


kern.err kernel: [787860.921875] igb 0000:03:00.1 eth1: PCIe link lost, device now detached

Seeing backtraces in the logs surely is a bad sign, but I don’t understand the cause of losing PCIe link. I've reinstalled OpenWRT a couple times, the issue persists. A reboot will fix the issue temporarily for as long as 9 days or as short as 1 hour.

Any pointers on where to debug next? Any important info I'm leaving out that I can add to help find the issue? Post character limit is preventing full logs so posting a selection. Thanks for the assistance.

Network info:


root@OpenWrt:~# cat /etc/config/network

config interface 'loopback'

option ifname 'lo'

option proto 'static'

option ipaddr '127.0.0.1'

option netmask '255.0.0.0'

config globals 'globals'

option ula_prefix 'fd78:1725:c466::/48'

config interface 'lan'

option type 'bridge'

option proto 'static'

option ipaddr '192.168.1.1'

option netmask '255.255.255.0'

option ip6assign '60'

option ifname 'eth0 eth1 eth2 eth3 eth4 eth5 eth6 eth7'

config interface 'wan'

option proto 'dhcp'

option ifname 'eth8'

config interface 'wan6'

option proto 'dhcpv6'

option reqaddress 'try'

option reqprefix 'auto'

option ifname 'eth8'

System log at Mon Mar 22 02:36:49 2021 PCIe link lost, device now detached and it goes downhill from there:


Mon Mar 22 02:13:00 2021 daemon.info dnsmasq-dhcp[3884]: DHCPACK(br-lan) 192.168.1.172 ce:2c:4f:87:97:95 bobsiPhone

Mon Mar 22 02:21:30 2021 daemon.info dnsmasq-dhcp[3884]: DHCPREQUEST(br-lan) 192.168.1.201 92:be:95:63:30:39

Mon Mar 22 02:21:30 2021 daemon.info dnsmasq-dhcp[3884]: DHCPACK(br-lan) 192.168.1.201 92:be:95:63:30:39 bills-iPhone

Mon Mar 22 02:36:49 2021 kern.err kernel:   [787860.921875] igb 0000:03:00.1 eth1: PCIe link lost, device now detached

Mon Mar 22 02:36:50 2021 kern.err kernel: [787861.961340] igb 0000:03:00.0 eth0: PCIe link lost, device now detached

Mon Mar 22 02:36:53 2021 kern.warn kernel: [787865.645361] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:36:59 2021 kern.warn kernel: [787871.645379] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:05 2021 kern.warn kernel: [787877.645457] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:11 2021 kern.warn kernel: [787883.645208] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:17 2021 kern.warn kernel: [787889.645186] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:23 2021 kern.warn kernel: [787895.645180] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:29 2021 kern.warn kernel: [787901.645174] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:35 2021 kern.warn kernel: [787907.645158] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:41 2021 kern.warn kernel: [787913.645159] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:47 2021 kern.warn kernel: [787919.645144] igb 0000:03:00.0: clearing Rx timestamp hang

Mon Mar 22 02:37:51 2021 kern.err kernel: [787923.585016] INFO: rcu_sched self-detected stall on CPU

Mon Mar 22 02:37:51 2021 kern.err kernel: [787923.590207] 3-...: (6000 ticks this GP) idle=4a6/140000000000001/0 softirq=6492872/6492872 fqs=2997

Mon Mar 22 02:37:51 2021 kern.err kernel: [787923.595024] INFO: rcu_sched detected stalls on CPUs/tasks:

Mon Mar 22 02:37:51 2021 kern.err kernel: [787923.599452] (t=6002 jiffies g=1699568 c=1699567 q=73)

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.610265] NMI backtrace for cpu 3

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.613817] CPU: 3 PID: 46 Comm: kworker/3:1 Not tainted 4.14.221 #0

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.620217] Hardware name: Dell Inc. Precision T1600/06NWYK, BIOS A15 03/14/2013

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.627654] Workqueue: events 0xffffffffa0037ad0 [igb@ffffffffa0030000+0x20000]

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.635010] Call Trace:

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.637527] <IRQ>

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.639611] 0xffffffff81719620

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.642814] 0xffffffff8170a92a

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.646017] ? 0xffffffff810432b0

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.649394] 0xffffffff8170a991

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.652598] 0xffffffff81043304

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.655801] 0xffffffff8171763e

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.659004] 0xffffffff810c7aa5

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.662210] ? 0xffffffff810a4dd6

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.665588] ? 0xffffffff810a5012

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.668966] 0xffffffff810ca48a

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.672169] 0xffffffff810d7161

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.675376] 0xffffffff810cadaf

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.678579] 0xffffffff810caf56

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.681782] 0xffffffff818024f2

Mon Mar 22 02:37:51 2021 kern.warn kernel: [787923.684985]

kernel log at 257078.573553 NIC link down and goes downhill from there:


[ 14.875377] br-lan: port 8(eth7) entered forwarding state

[ 16.766174] e1000e: eth8 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

[ 16.773763] IPv6: ADDRCONF(NETDEV_CHANGE): eth8: link becomes ready

[257078.573553] igb 0000:0a:00.0 eth6: igb: eth6 NIC Link is Down

[257078.579798] br-lan: port 7(eth6) entered disabled state

[257482.882868] igb 0000:0a:00.0 eth6: igb: eth6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX/TX

[257482.892711] br-lan: port 7(eth6) entered blocking state

[257482.898068] br-lan: port 7(eth6) entered forwarding state

[257517.662346] igb 0000:0a:00.0 eth6: igb: eth6 NIC Link is Down

[257517.668630] br-lan: port 7(eth6) entered disabled state

[257522.402795] igb 0000:0a:00.0 eth6: igb: eth6 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: RX

[257522.412431] br-lan: port 7(eth6) entered blocking state

[257522.417788] br-lan: port 7(eth6) entered forwarding state

[433709.562269] conntrack: generic helper won't handle protocol 47. Please consider loading the specific helper module.

[787860.921875] igb 0000:03:00.1 eth1: PCIe link lost, device now detached

[787861.961340] igb 0000:03:00.0 eth0: PCIe link lost, device now detached

[787865.645361] igb 0000:03:00.0: clearing Rx timestamp hang

[787871.645379] igb 0000:03:00.0: clearing Rx timestamp hang

[787877.645457] igb 0000:03:00.0: clearing Rx timestamp hang

[787883.645208] igb 0000:03:00.0: clearing Rx timestamp hang

[787889.645186] igb 0000:03:00.0: clearing Rx timestamp hang

[787895.645180] igb 0000:03:00.0: clearing Rx timestamp hang

[787901.645174] igb 0000:03:00.0: clearing Rx timestamp hang

[787907.645158] igb 0000:03:00.0: clearing Rx timestamp hang

[787913.645159] igb 0000:03:00.0: clearing Rx timestamp hang

[787919.645144] igb 0000:03:00.0: clearing Rx timestamp hang

[787923.585016] INFO: rcu_sched self-detected stall on CPU

[787923.590207] 3-...: (6000 ticks this GP) idle=4a6/140000000000001/0 softirq=6492872/6492872 fqs=2997

[787923.595024] INFO: rcu_sched detected stalls on CPUs/tasks:

[787923.599452] (t=6002 jiffies g=1699568 c=1699567 q=73)

[787923.610265] NMI backtrace for cpu 3

[787923.613817] CPU: 3 PID: 46 Comm: kworker/3:1 Not tainted 4.14.221 #0

[787923.620217] Hardware name: Dell Inc. Precision T1600/06NWYK, BIOS A15 03/14/2013

[787923.627654] Workqueue: events 0xffffffffa0037ad0 [igb@ffffffffa0030000+0x20000]

[787923.635010] Call Trace:

[787923.637527] <IRQ>

[787923.639611] 0xffffffff81719620

[787923.642814] 0xffffffff8170a92a

[787923.646017] ? 0xffffffff810432b0

[787923.649394] 0xffffffff8170a991

[787923.652598] 0xffffffff81043304

[787923.655801] 0xffffffff8171763e

[787923.659004] 0xffffffff810c7aa5

[787923.662210] ? 0xffffffff810a4dd6

[787923.665588] ? 0xffffffff810a5012

[787923.668966] 0xffffffff810ca48a

[787923.672169] 0xffffffff810d7161

[787923.675376] 0xffffffff810cadaf

[787923.678579] 0xffffffff810caf56

[787923.681782] 0xffffffff818024f2

[787923.684985] 0xffffffff81801b10

[787923.688189] </IRQ>

[787923.690358] RIP: 0010:0xffffffff8171476f

[787923.694338] RSP: 0018:ffffc900001b3cf8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10

[787923.701942] RAX: 0000000037b9bc5c RBX: ffff8882235b0de0 RCX: 0008a88737b8fbec

[787923.709112] RDX: 000000000008a887 RSI: 0000000000000003 RDI: 0000000000025c17

[787923.716287] RBP: ffffc900001b3cf8 R08: 0000000000000000 R09: 0000000000000311

[787923.723458] R10: ffffc900001b3e48 R11: 0000000000000000 R12: 0000000000001186

[787923.730632] R13: 0000000000002001 R14: ffffc900001b3dbe R15: 0000000000000000

[787923.737809] 0xffffffff817146bc

[787923.741015] 0xffffffffa0043dda [igb@ffffffffa0030000+0x20000]

[787923.746894] 0xffffffffa00408b5 [igb@ffffffffa0030000+0x20000]

[787923.752779] 0xffffffffa00400db [igb@ffffffffa0030000+0x20000]

[787923.758660] 0xffffffffa004541f [igb@ffffffffa0030000+0x20000]

[787923.764539] 0xffffffffa0045a31 [igb@ffffffffa0030000+0x20000]

[787923.770418] 0xffffffffa0043cd7 [igb@ffffffffa0030000+0x20000]

[787923.776299] ? 0xffffffff810ca1dd

[787923.779676] 0xffffffffa0041521 [igb@ffffffffa0030000+0x20000]

[787923.785555] 0xffffffffa003666b [igb@ffffffffa0030000+0x20000]

[787923.791433] 0xffffffffa0037b01 [igb@ffffffffa0030000+0x20000]

[787923.797312] 0xffffffff810969e4

[787923.800515] 0xffffffff81096d4b

[787923.803718] 0xffffffff8109bbb2

[787923.806921] ? 0xffffffff81096b40

[787923.810296] ? 0xffffffff8109baa0

[787923.813673] 0xffffffff81800205

[787923.816880] 3-...: (6000 ticks this GP) idle=4a6/140000000000001/0 softirq=6492872/6492872 fqs=2997

[787923.826124] (detected by 0, t=6002 jiffies, g=1699568, c=1699567, q=73)

[787923.832873] Sending NMI from CPU 0 to CPUs 3:

[787923.837314] NMI backtrace for cpu 3

[787923.837314] CPU: 3 PID: 46 Comm: kworker/3:1 Not tainted 4.14.221 #0

[787923.837314] Hardware name: Dell Inc. Precision T1600/06NWYK, BIOS A15 03/14/2013

[787923.837315] Workqueue: events 0xffffffffa0037ad0 [igb@ffffffffa0030000+0x20000]

[787923.837315] task: ffff88822496af00 task.stack: ffffc900001b0000

[787923.837316] RIP: 0010:0xffffffff8171476f

[787923.837316] RSP: 0018:ffffc900001b3cf8 EFLAGS: 00000246

[787923.837316] RAX: 00000000663b97d1 RBX: ffff8882235b0de0 RCX: 0008a887663980ce

[787923.837317] RDX: 000000000008a887 RSI: 0000000000000003 RDI: 0000000000025c17

[787923.837317] RBP: ffffc900001b3cf8 R08: 0000000000000000 R09: 0000000000000311

[787923.837317] R10: ffffc900001b3e48 R11: 0000000000000000 R12: 0000000000001265

[787923.837318] R13: 0000000000002001 R14: ffffc900001b3dbe R15: 0000000000000000

[787923.837318] FS: 0000000000000000(0000) GS:ffff88822dcc0000(0000) knlGS:0000000000000000

[787923.837318] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033

[787923.837318] CR2: 00007ffec3614008 CR3: 0000000001e08005 CR4: 00000000000606e0

[787923.837319] Call Trace:

[787923.837319] 0xffffffff817146bc

[787923.837319] 0xffffffffa0043dda [igb@ffffffffa0030000+0x20000]

[787923.837319] 0xffffffffa00408b5 [igb@ffffffffa0030000+0x20000]

[787923.837320] 0xffffffffa00400db [igb@ffffffffa0030000+0x20000]

[787923.837320] 0xffffffffa004541f [igb@ffffffffa0030000+0x20000]

[787923.837320] 0xffffffffa0045a31 [igb@ffffffffa0030000+0x20000]

[787923.837321] 0xffffffffa0043cd7 [igb@ffffffffa0030000+0x20000]

[787923.837321] ? 0xffffffff810ca1dd

[787923.837321] 0xffffffffa0041521 [igb@ffffffffa0030000+0x20000]

[787923.837321] 0xffffffffa003666b [igb@ffffffffa0030000+0x20000]

[787923.837322] 0xffffffffa0037b01 [igb@ffffffffa0030000+0x20000]

[787923.837322] 0xffffffff810969e4

[787923.837322] 0xffffffff81096d4b

[787923.837322] 0xffffffff8109bbb2

[787923.837322] ? 0xffffffff81096b40

[787923.837323] ? 0xffffffff8109baa0

[787923.837323] 0xffffffff81800205

[787923.837323] Code: 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 55 48 89 e5 65 8b 35 d5 a9 8f 7e 0f ae e8 0f 31 48 c1 e2 20 48 89 d1 48 09 c1 0f ae e8 0f 31 <48> c1 e2 20 48 09 d0 48 89 c2 48 29 ca 48 39 fa 73 15 f3 90 65

[787925.645139] igb 0000:03:00.0: clearing Rx timestamp hang

[787931.645139] igb 0000:03:00.0: clearing Rx timestamp hang

[787937.645113] igb 0000:03:00.0: clearing Rx timestamp hang

[787943.645117] igb 0000:03:00.0: clearing Rx timestamp hang

[787949.645099] igb 0000:03:00.0: clearing Rx timestamp hang

[787955.645164] igb 0000:03:00.0: clearing Rx timestamp hang

[787961.645106] igb 0000:03:00.0: clearing Rx timestamp hang

[787967.645099] igb 0000:03:00.0: clearing Rx timestamp hang

[787973.645084] igb 0000:03:00.0: clearing Rx timestamp hang

[787979.645065] igb 0000:03:00.0: clearing Rx timestamp hang

[787985.645053] igb 0000:03:00.0: clearing Rx timestamp hang

[787991.645074] igb 0000:03:00.0: clearing Rx timestamp hang

[787997.645065] igb 0000:03:00.0: clearing Rx timestamp hang

[788003.645151] igb 0000:03:00.0: clearing Rx timestamp hang

[788009.645043] igb 0000:03:00.0: clearing Rx timestamp hang

[788015.645059] igb 0000:03:00.0: clearing Rx timestamp hang

[788021.645039] igb 0000:03:00.0: clearing Rx timestamp hang

[788027.645031] igb 0000:03:00.0: clearing Rx timestamp hang

[788033.645027] igb 0000:03:00.0: clearing Rx timestamp hang

[788039.644997] igb 0000:03:00.0: clearing Rx timestamp hang

[788045.645002] igb 0000:03:00.0: clearing Rx timestamp hang

[788051.645002] igb 0000:03:00.0: clearing Rx timestamp hang

[788057.644990] igb 0000:03:00.0: clearing Rx timestamp hang

[788063.644970] igb 0000:03:00.0: clearing Rx timestamp hang

[788069.644978] igb 0000:03:00.0: clearing Rx timestamp hang

[788075.644976] igb 0000:03:00.0: clearing Rx timestamp hang

[788081.644970] igb 0000:03:00.0: clearing Rx timestamp hang

[788087.644954] igb 0000:03:00.0: clearing Rx timestamp hang

[788093.644937] igb 0000:03:00.0: clearing Rx timestamp hang

[788099.644951] igb 0000:03:00.0: clearing Rx timestamp hang

[788103.634811] INFO: rcu_sched self-detected stall on CPU

[788103.640028] 3-...: (23964 ticks this GP) idle=4a6/140000000000001/0 softirq=6492872/6492872 fqs=11973

Start by using another OS/distro, if you want something simple and quick try opnsense or pfsense (different code base and quick to setup). If you're seeing the same issues you most likely have a hardware issue, if it works you're probably looking at a software issue of some sort. You should be able to at least temporary run these off a USB stick (resonably fast one).

2 Likes

PasteBin?

  • What's plugged into eth0 and eth1?
  • At any time, are you plugging, unplugging or powering equipment on those ports?

:+1:

Yea, I'd test and see if it happened elsewhere.

1 Like