PPPoE disconnects v17.01.0

Hey,

I've got a Linksys WRT1900ac v1 and i've noticed that my pppoe connection dies when the following circumstance occurs:

PC 1 connected via Lan - Gaming Overwatch (online)

PC 2 connected via Wlan - Live Video Streaming

After a random amount of time (between a few seconds and 10 minutes) pppoe drops.

If i revert to stock i don't get disconnects however i get the occasional spike (probably due to shit Qos on the stock linksys FW)

Here are the logs if it can help:

Mon Mar 13 12:25:00 2017 cron.info crond[1189]: USER root pid 4091 cmd /sbin/fan_ctrl.sh
Mon Mar 13 12:29:12 2017 daemon.info hostapd: wlan0: STA 32:b5:c2:74:65:12 WPA: group key handshake completed (RSN)
Mon Mar 13 12:29:12 2017 daemon.info hostapd: wlan0: STA 32:b5:c2:15:79:75 WPA: group key handshake completed (RSN)
Mon Mar 13 12:29:12 2017 daemon.info hostapd: wlan0: STA 30:b5:c2:6e:2d:59 WPA: group key handshake completed (RSN)
Mon Mar 13 12:29:12 2017 daemon.info hostapd: wlan0: STA a0:32:99:03:ea:12 WPA: group key handshake completed (RSN)
Mon Mar 13 12:30:00 2017 cron.info crond[1189]: USER root pid 4096 cmd /sbin/fan_ctrl.sh
Mon Mar 13 12:30:15 2017 daemon.info hostapd: wlan1: STA dc:0b:34:f1:f3:da WPA: group key handshake completed (RSN)
Mon Mar 13 12:30:36 2017 daemon.info dnsmasq-dhcp[4033]: DHCPINFORM(br-lan) 192.168.1.110 32:b5:c2:74:65:12
Mon Mar 13 12:30:36 2017 daemon.info dnsmasq-dhcp[4033]: DHCPACK(br-lan) 192.168.1.110 32:b5:c2:74:65:12 L-TLV-13000550
Mon Mar 13 12:35:00 2017 cron.info crond[1189]: USER root pid 4102 cmd /sbin/fan_ctrl.sh
Mon Mar 13 12:36:10 2017 daemon.info pppd[3918]: No response to 5 echo-requests
Mon Mar 13 12:36:10 2017 daemon.notice pppd[3918]: Serial link appears to be disconnected.
Mon Mar 13 12:36:10 2017 daemon.info pppd[3918]: Connect time 34.6 minutes.
Mon Mar 13 12:36:10 2017 daemon.info pppd[3918]: Sent 22159848 bytes, received 490167042 bytes.
Mon Mar 13 12:36:10 2017 daemon.notice netifd: Network device 'pppoe-wan' link is down
Mon Mar 13 12:36:10 2017 daemon.notice pppd[3918]: Connection terminated.
Mon Mar 13 12:36:10 2017 daemon.notice netifd: Interface 'wan' has lost the connection
Mon Mar 13 12:36:10 2017 daemon.warn dnsmasq[4033]: no servers found in /tmp/resolv.conf.auto, will retry
Mon Mar 13 12:36:10 2017 daemon.info pppd[3918]: Sent PADT
Mon Mar 13 12:36:10 2017 daemon.info pppd[3918]: Exit.
Mon Mar 13 12:36:10 2017 daemon.notice netifd: Interface 'wan' is now down
Mon Mar 13 12:36:10 2017 daemon.notice netifd: Interface 'wan' is setting up now
Mon Mar 13 12:36:11 2017 daemon.info pppd[4141]: Plugin rp-pppoe.so loaded.
Mon Mar 13 12:36:11 2017 daemon.info pppd[4141]: RP-PPPoE plugin version 3.8p compiled against pppd 2.4.7
Mon Mar 13 12:36:11 2017 daemon.notice pppd[4141]: pppd 2.4.7 started by root, uid 0
Mon Mar 13 12:36:11 2017 daemon.err pppd[4141]: Interface eth1 has MTU of 1492 -- should be at least 1500.
Mon Mar 13 12:36:11 2017 daemon.err pppd[4141]: This may cause serious connection problems.
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4033]: exiting on receipt of SIGTERM
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: started, version 2.76 cachesize 150
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-DNSSEC no-ID loop-detect inotify
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: DNS service limited to local subnets
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq-dhcp[4202]: DHCP, IP range 192.168.1.100 -- 192.168.1.249, lease time 12h
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: using local addresses only for domain lan
Mon Mar 13 12:36:13 2017 daemon.warn dnsmasq[4202]: no servers found in /tmp/resolv.conf.auto, will retry
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: read /etc/hosts - 4 addresses
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: read /tmp/hosts/odhcpd - 1 addresses
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq[4202]: read /tmp/hosts/dhcp.cfg02411c - 2 addresses
Mon Mar 13 12:36:13 2017 daemon.info dnsmasq-dhcp[4202]: read /etc/ethers - 0 addresses
Mon Mar 13 12:36:16 2017 daemon.info pppd[4141]: PPP session is 2707
Mon Mar 13 12:36:16 2017 daemon.warn pppd[4141]: Connected to 00:30:88:1b:d8:b7 via interface eth1
Mon Mar 13 12:36:16 2017 kern.info kernel: [19640.322205] pppoe-wan: renamed from ppp0
Mon Mar 13 12:36:16 2017 daemon.info pppd[4141]: Using interface pppoe-wan
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: Connect: pppoe-wan <--> eth1
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: PAP authentication succeeded
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: peer from calling number 00:30:88:1B:D8:B7 authorized
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: local IP address 79.181.77.211
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: remote IP address 79.181.64.1
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: primary DNS address 62.219.186.7
Mon Mar 13 12:36:16 2017 daemon.notice pppd[4141]: secondary DNS address 192.117.235.237
Mon Mar 13 12:36:16 2017 daemon.notice netifd: Network device 'pppoe-wan' link is up
Mon Mar 13 12:36:16 2017 daemon.notice netifd: Interface 'wan' is now up
Mon Mar 13 12:36:16 2017 daemon.info dnsmasq[4202]: reading /tmp/resolv.conf.auto
Mon Mar 13 12:36:16 2017 daemon.info dnsmasq[4202]: using local addresses only for domain lan
Mon Mar 13 12:36:16 2017 daemon.info dnsmasq[4202]: using nameserver 62.219.186.7#53
Mon Mar 13 12:36:16 2017 daemon.info dnsmasq[4202]: using nameserver 192.117.235.237#53
Mon Mar 13 12:36:17 2017 user.notice firewall: Reloading firewall due to ifup of wan (pppoe-wan)
Mon Mar 13 12:36:19 2017 daemon.info dnsmasq[4202]: exiting on receipt of SIGTERM
Mon Mar 13 12:36:19 2017 daemon.info dnsmasq[4318]: started, version 2.76 cachesize 150
Mon Mar 13 12:36:19 2017 daemon.info dnsmasq[4318]: compile time options: IPv6 GNU-getopt no-DBus no-i18n no-IDN DHCP no-DHCPv6 no-Lua TFTP no-conntrack no-ipset no-auth no-DNSSEC no-ID loop-detect inotify
Mon Mar 13 12:36:19 2017 daemon.info dnsmasq[4318]: DNS service limited to local subnets
Mon Mar 13 12:36:19 2017 daemon.info dnsmasq-dhcp[4318]: DHCP, IP range 192.168.1.100 -- 192.168.1.249, lease time 12h

Your router is sending "echo requests" to the device at the other end of the line, and not receiving any answer:
Mon Mar 13 12:36:10 2017 daemon.info pppd[3918]: No response to 5 echo-requests

You may want to send more echo requests before giving up, or even disabling it completely, but then PPPoE will not restart the connection when it goes down. Have a look at the "keepalive" parameter in your configuration.

Cool that's some great info there!

Is it possible my connection does not crash on the stock firmware because there are more echo requests configured there?
Any idea how I can check what the setting is in the stock FW?

Edit: In my stock FW there is a "redial period" setting which is set to 20 seconds, I guess thats the equivalent for the echo request? But how many echo requests are 20 seconds? Must be quite a lot O_o
Do you think this issue could be an indication of a faulty modem?

I guess "redial period" is the period between "redials" (or re-connections); I do not think this is the same. Perhaps the original firmware sends more echo requests (or more separated) before giving up, perhaps it does not send them at all.

After sniffing around the internet there seems to be a bug with LuCi interface these are the default settings for the echo failure threshold (notice they are set to 0)

+LCP echo failure threshold - set to 0 which ignores failures (default)
+LCP echo interval - set to 5 (default)

however somehow the pppoe config ignores the value 0 and just uses the value 5 instead, this is what causes the pppoe disconnects...
Apparently setting the echo failure threshold to a large number (like 500) should resolve the issue.

This happens when you are on shitty copper wiring (like me) that is prone to interference.

I'll test this workaround and will update this thread if it is resolved (fingers crossed), so other poor souls won't go through the same ordeal as i had to.
And maybe LEDE developers can fix this stupid bug in the future.

Yesterday i tested the LEDE firmware for my WRT1900ac with the new parameters for the LCP echo threshhold (i set it to 500) and tested it for a few hours with gaming + open stream at the same time.
The result was absolutely no disconnects! HOWEVER my online game was much less responsive and I had 1-5 second long lags and rubber banding while sqm with cake was doing the qos.
After this test i switched back to the stock Linksys FW and ran the same test again with an overloaded network and had a much more fluent gaming experience with absolutely no lags, hiccups etc.

So my conclusion can only be that the LEDE FW is much less forgiving than the Linksys when it comes to coping with a mediocre pppoe line.
Ofc the LEDE can be probably tweaked so it will perform better but i lack the knowledge to do that.

Bottom line the stock FW does its job well however it only supports qos on downstream (Linksys removed upstream qos a few FW releases ago go figure....) so as soon as someone uploads anything in the network my ping goes to hell....

Well, that or sqm has issues on routers where the ingress is coming via a switch port... Could you try again with setting the download shaper speed in luci-app-sq to zero, please? We have a standing issue currently when running on machines where a switch buffers the ingress traffic, and cake might resort to extreme dropping to alleviate that problem which in turn might take out sufficient consecutive LCP packets to disconnect the PPPoE link... just a wild hypothesis though...

I'll try to do what you said and i'll revert the LCP setting to default?

Should i set both downstream and upstream to 0?

Will that not render SQM useless?

You are right, setting both directions to 0 will effectively disable sqm. So in your case the hypothesis is that ingress shaping goes pear shaped, so by disabling that you should still get decent upstream debloating and at least not additional delay on downstream. Effectively this testing whether the ingess shaping is causing your issues.

About LCP that should not matter.... Once the unwanted dropping is disabled.

Was a solution for this problem ever found? I think im experiencing the same thing on 18.06.5

2 Likes