SQM, Bufferbloat, Maxed Bandwidth, (hfsc) Kernel log error, locks router



Helping out a non-profit organization work through some long-standing (WiFi) networking issues, I was asked to come in and evaluate their existing network setup: primarily wireless clients (10-15) daily, a combination of workstations and external device clients ( phone, tablet, laptop ) connecting to a single (1) WiFi router:

Service provider: Time Warner (aka Spectrum) Cable internet, business class, "Turbo" 20/2, with a separate Arris modem providing 2 line Telephony into building,

Existing issues cleaned up to-date:

Missing 75 ohm terminator on (3) way splitter in primary entrance point / electrical closet,
Inspecting, re-seating, & re-tightening all barrel connectors inside building,
Replacing any/all suspect network patch cables with non-molded connector ends,

Internet modem is ubee DW365, a basic 8 channel down / 4 channel up DOCSIS 3.0 unit, DNS off, DHCP off, Firewall set to (low), modem static IP is,

At start, modem log was showing several hundred (correctable) errors in connection stats, following above cabling maintenance & modem restart, modem log currently shows zero (0) errors in stats, after 72 hours running.

Existing WiFi router on-site: Netgear R7000 Nighthawk - currently disconnected for testing,

Replacement test router installed on-site: Linksys WRT1900ACS v2, purchased 5/2017,
Bridging to ubee modem using Static IP, internal (LAN) IP is,

Running Davidc502 LEDE Snapshot from his site:

Kernel version 4.9.34
WiFi driver

NEW 06/28/2017 New builds have been release and are ready for download.


QOS settings are on, and set for 20/2 bandwidth,

SQM QOS setting on, and set for 87.5 % Ingress / Egress: 17,920 / 1,792

Queue Disipline is: cake, piece_of_cake,

Link Layer Adaptation is: Ethernet, Per Packet overhead is: 18,

per recommended settings in: https://lede-project.org/docs/howto/sqm


The Linksys WRT1900ACS router is responsible for handling all WiFi traffic currently.

Primarily (5) fixed workstation clients, (4) WiFi, (1) LAN, (3) WiFi Printers,

(7 - 10) External Devices: Phone, Tablet, Laptop , ranging from older (Android) phones to newest iPhone / iPad (5G) models.

Symptoms, and Issue in Kernel Log occurs when:

(1) One workstation, a Wireless N client, is Uploading High-Definition videos to YouTube: this workstation is maxing out Egress bandwidth,

(1) One workstation, a local LAN wired client, is playing YouTube videos to a wall-mounted HDTV screen, rendered in HD (auto) 1,440 P resolution,

I was observing Realtime Graphed (br-lan) traffic and was able to deduce / confirm these (2) workstations were throwing the error, e.g.: one workstation maxing outbound bandwidth, one workstation maxing out inbound bandwidth.

I did some internet searching, the error "appears" to be related to a module responsible for providing SQM QOS:

Last Kernel Log entry follows:

[62601.242717] ------------[ cut here ]------------
[62601.247378] WARNING: CPU: 0 PID: 3 at net/sched/sch_hfsc.c:1400 hfsc_dequeue+0x318/0x340 [sch_hfsc]
[62601.256471] Modules linked in: pppoe ppp_async pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm slhc nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_amanda nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp
[62601.328326] nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_h323 nf_conntrack_broadcast ts_kmp nf_conntrack_amanda mwifiex_sdio mwifiex iptable_mangle iptable_filter ipt_ah ipt_ECN ip_tables crc_ccitt fuse sch_cake em_nbyte act_ipt cls_basic sch_prio sch_pie sch_gred em_meta sch_dsmark sch_teql em_cmp act_police em_text sch_codel sch_sfq sch_fq sch_red act_connmark nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress mwlwifi mac80211 cfg80211 compat cryptodev xt_set ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables msdos bonding ifb tun vfat fat ntfs nls_utf8 nls_iso8859_1 nls_cp437 sha512_generic sha256_generic md5 hmac authenc ohci_pci uhci_hcd ohci_platform ohci_hcd gpio_button_hotplug
[62601.432034] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G W 4.9.34 #0
[62601.439372] Hardware name: Marvell Armada 380/385 (Device Tree)
[62601.445324] [] (unwind_backtrace) from [] (show_stack+0x10/0x14)
[62601.453102] [] (show_stack) from [] (dump_stack+0x7c/0x9c)
[62601.460355] [] (dump_stack) from [] (__warn+0xbc/0xec)
[62601.467259] [] (__warn) from [] (warn_slowpath_null+0x1c/0x24)
[62601.474867] [] (warn_slowpath_null) from [] (hfsc_dequeue+0x318/0x340 [sch_hfsc])
[62601.484134] [] (hfsc_dequeue [sch_hfsc]) from [] (__qdisc_run+0xe4/0x234)
[62601.492697] [] (__qdisc_run) from [] (__dev_queue_xmit+0x348/0x684)
[62601.500736] [] (__dev_queue_xmit) from [] (ip_finish_output2+0x220/0x280)
[62601.509297] [] (ip_finish_output2) from [] (ip_output+0x50/0xb0)
[62601.517073] [] (ip_output) from [] (ip_forward+0x364/0x3d0)
[62601.524413] [] (ip_forward) from [] (ip_rcv+0x258/0x2b8)
[62601.531493] [] (ip_rcv) from [] (__netif_receive_skb_core+0x6c4/0x8fc)
[62601.539795] [] (__netif_receive_skb_core) from [] (process_backlog+0x7c/0x11c)
[62601.548792] [] (process_backlog) from [] (net_rx_action+0xe8/0x2ac)
[62601.556831] [] (net_rx_action) from [] (__do_softirq+0xd0/0x204)
[62601.564608] [] (__do_softirq) from [] (run_ksoftirqd+0x2c/0x50)
[62601.572299] [] (run_ksoftirqd) from [] (smpboot_thread_fn+0x16c/0x184)
[62601.580600] [] (smpboot_thread_fn) from [] (kthread+0xd8/0xec)
[62601.588204] [] (kthread) from [] (ret_from_fork+0x14/0x3c)
[62601.595470] ---[ end trace 3c3ff5958f8d8bd0 ]---

The workstation uploading High-Definition videos to YouTube: this workstation is uploading (nn) video files, Egress bandwidth is pegged at (2m) for at least 60 minutes during the upload process.

The workstation, (LAN wired client), playing YouTube videos to a wall-mounted HDTV screen, is an 'average' business class workstation, running Windows 7 professional, using a dual-monitor setup, running a web browser (Google Chrome) in the 2nd monitor (HDTV) instance for rendering.

(If you're still reading, Thanks for hanging in there this far.)

When the Kernel Log error occurs, all clients are 'frozen' for a period of time, from one (1) to five (5) minutes. During this time, any/all internet access will fail. Internal DNS appears to be working properly, e.g. addresses resolve within a web browser, page rendering fails with a 'cannot contact server' error.


Research done / options discussed with management so far include:

  1. Replacing the Time Warner provided ubee DW365 modem (circa 2013) with a more modern modem, e.g. Netgear CM600 modem,

  2. -Splitting- Wifi traffic to (2) separate routers, one (1) for Corporate use, one (1) for Guest (External Device) use, and limiting connection bandwidth on the Guest router -> and giving the Guest router lower traffic priority (still researching how / the best way to do this)

  3. Reducing rendering resolution of the workstation playing YouTube videos from (Auto) 1440p to 720p, which shows a definite improvement in bandwidth reduction in Ingress: from 20m to 10/15m,

  4. Possibility of reducing resolution of videos taken / being uploaded to YouTube from 1440p ( newest iPhone) to 1080p, or 720p - although this option is being met with some resistance, which is understandable / not a preferred 'solution' to staff or management on-site.


Networking / connectivity issues have been occurring at this site for many months - I don't expect to solve all issues in a few days.

The LEDE router swap, cabling maintenance, etc. is being done as due-diligence, before contacting Time Warner / Spectrum for further diagnostics, as recommended by their 3rd-tier service techs on various forums (e.g. dslreports.com ),

The Linksys LEDE router, when working properly, is providing a much better client experience, based on feedback. When it 'chokes' though, it's taking down all users, typically at the busiest times of day, when online application updates & credit card processing are occurring, and needed.

I would appreciate any insight & observations / recommendations you may have to offer,


Could you post the output of:
cat /etc/config.sqm
tc -d qdisc
tc -s qdisc
here in the thread, please?

Interestingly, sqm-script auto-loads the hfsc module (sch_hfsc) but will only ever need it for a specific qos script, not piece_of_cake or simple, so you could edit /usr/lib/sqm/defaults.sh and replace the following line:
ALL_MODULES="act_ipt sch_$QDISC sch_ingress act_mirred cls_fw cls_flow cls_u32 sch_htb sch_hfsc"
ALL_MODULES="act_ipt sch_$QDISC sch_ingress act_mirred cls_fw cls_flow cls_u32 sch_htb"
That should take care of the error. If it does also solve your frozen states (which I doubt) please let me know, because that is justification for removing the relevant qos script from the main sqm-scripts and that would allow to remove sch_hfsc from the default ALL_MODULES for everybody.

This sounds like a job for cake's advanced isolation modes that in theory can give per internal host IP fairness, which should tackle your problem. Have a look at the last section of the sqm howto you cite above. But this will only ever work if sqm is instantiated on the true WAN interface that sees the external addresses. I hove my doubts: [quote="RobertH, post:1, topic:4909"]
modem static IP is

does not look like a real routable IPv4 address to me... What is the story behind that? Is TWC using carrier grade NAT and only hand out silly 192.168 addresses?

If sqm-scipts is the culprit this should not help much, no? But that seems like a decent tests to figure out where exactly the issue occurs; have you tried running mtr during your freeze episodes from inside the network to see how far you still can reach nodes?

You might be able to simply use a secondary SSID for Guests on the same radio as the corporate SSID, saving one router. But if get cake's IP isolation up and running, simply hooking up another router as full NAT router with its WAN port to the WRT1900ACS will get you one "internal" host address for all guests, which might be sufficient de-prioretization.

That sounds like a decent idea; but youtube traffic delivery can be quite bursty, so even then when the average rate is a better fit to your ingress bandwidth you still might see bursty latency variations caused by youtube.

That should be addressable by cake's isolation modes, but only if cake sees enough information, which often, after NAT it does not, and certainly not after double NAT...

Great, that leaves some time for experiments :wink:

Okay, that sounds like the issue are the freezes and inbetween things just work well enough, is that correct?

I would think trying to pinpoint the root cause of the freezes should have highest priority, followed by trying to get cake-s per-IP-fairness configured, as it seems like a good solution to some of your issues. But please note that some issues might be actually WIFI issues, so sqm-scripts might be the wrong place trying to fix those.

Best Regards

Hi moeller0,

Thanks for the quick reply.

This was how the TWC modem was configured initially - at the time of the LEDE router install, I did not want to make any 'substantial' changes that would prevent the (original) router on-premise from being re-connected, if things went badly.

But after re-reading the last section of the SQM how-to, I fully realize now that having the TWC cable modem set this way, double NAT'd, is not optimal, or needed:

I will set the cable modem to full (Bridge) mode when on-site tomorrow, to ensure the WAN interface is directly facing the Internet.

I have no empirical evidence at this point that switching out the ubee cable Modem will improve things, it's just a 'gut feeling' at this point. After searching the web / reading other owner experiences with this particular Brand / Model modem, I am still suspect of it - even though it's reporting zero errors at this point.

When I arrived this week to on-site review, I found that -none- of the networking equipment was properly Surge, or UPS protected - a conglomeration of various extension cords, basic power strips, etc. from varying outlets were employed to power a business-class HP LaserJet, a Shredder, phone system, etc. in the area where the network equipment is deployed - it was a mess. I'm still cleaning it up.

A new cable Modem, with proper firmware update support, and additional upstream channel capability would not make me unhappy here.

re: running MTR - No, I haven't - and that is an excellent idea, Thanks. When on-site Friday during the last freeze, I did run a Ping test, internal DNS on the LEDE router was resolving properly, but failed immediately, it wasn't getting past the LAN interface of

That was the extent of testing I could run, before having to hard-reset the router, unfortunately.

It looks like I got some more reading / training up to do personally as well, to quickly upgrade my skill set for more targeted troubleshooting. I've been a "Windows" guy for many years, my Linux skill set is novice, at best, at the moment.

This is what I was thinking as well, and I appreciate your insight here.

I picked up a 2nd Linksys Router today, a used EA8500, updated & running the latest stock firmware, and have been testing it with Media Prioritization set at 5 megabits / second for the External Devices, it seems to be working well.

I wish, "in a perfect world" .. :smile: Unfortunately, having the network lock-up, with a queue full of customers waiting, may not lend me any time for that, Sigh. But I will be devoting more time on-site this week to monitor the network, and will have the ability to do better observation & testing.

That is correct.

Feedback has been very positive regarding speed & responsiveness when the Router is working well: when the Router has choked, it's been in a high WiFi utilization situation each time, with a queue full of customers waiting.

Thanks, I will get & post the output of the config.sqm & the other items requested ASAP, and will keep the thread updated on any findings or developments as I proceed,

Kind Regards,

I am quite curious how that will work out, please post success and/or failure.

Please note that there are some issues with intel puma6 docsis modems that might or might not be fixed by firmware...

To elaborate, I would try to establish computers on all internal segments (wired LAN, the different WIFI radios) and rum mtr from all of those. IN addition I would use a smokeping service (like the free service supplied by dslreports, after a free registration) to test the router's reachability from the outside (dslreports produces nice plots for different time scales). Finally I would install luci-app-statistics and collectd-mod-ping and setup pings from the router itself to hosts like google.com and internal addresses (that need to be configured to respond to pings) to also get a picture of what the router believes is reachable. (luci-app-statistics, in case you do not know already, can be used with a number of collectd "plugins", that need to be installed independently, that can give nice diagnostics like collectd-mod-interfaces, collectd-mod-iwinfo).

There has been great work on the ath9K radios to get their bufferbloat down, so if your giest router uses ath9K you might want to set this up with LEDE as well, to profit from that nice work...

I guess there is never time to do it right, but there has always to be time to do it over...

I am an optimist, so take this as a good sign :wink:

Sure, take your time, I will also be intermittent in responding depending on my workload..

Best Regards


Lots of topology changes at this site, over the past several days.

I had to defer setting the ubee cable modem into bridge mode - I needed (IP) accessibility to the device, due to equipment placement, and could not loose that functionality:

using a direct wired port connection, to monitor modem status, is not feasible at the moment.

I also learned the guaranteed provisioned rate for the site was 15 / 2, -not- 20 / 2 as was originally conveyed.

The provisioned rate has been verified by customer invoice from the Cable provider.

On Sunday 7/9,

1). a LAN build-out was done to convert all critical Operations workstations to a cabled network back-bone segment ( ), using the Linksys WRT1900 ACS router, running LEDE.

This router has been dubbed "Operations". All WiFi radios in those workstations connected to this LAN segment have been turned off / disabled.

Both workstations noted above in the 1st post, (1) uploading high-def video files ("Marketing"), the other playing these high-def video files via Internet browser (aka: "Reception") are now on this wired, back-bone segment, with full bandwidth, and traffic managed by SQM QOS.

2). The (original) NetGear R7000 Nighthawk router that existed on-site was given a long-overdue firmware upgrade, partition sub-netted to ( ), and placed back into service as the "Staff Use" router.

No changes were made to the router SSID or password, to allow general staff, or guests, that previously had given access to the network continued access.

This router, running updated stock NetGear firmware, has been bandwidth-limited to ( 50% ) of typical capacity using the NetGear firmware's QOS function in the R7000 router ( 7 / .8 ).

The QOS traffic shaping function in the NetGear R7000 router is working excellent.

3). The Linksys EA8500 router has been tasked specifically for the Director & Management of the organization, to allow full-bandwidth capability,, while securing -> and prioritizing their WiFi traffic over general staff use ( 2. above ).

The 3rd router is partition sub-netted to ( ), and placed into service as the "Corporate" router.

In conjunction with the new (wired) back-bone LAN segment, I was able to locate this router in an area that provides better signal coverage to the Director's office.

I have a USB -> Serial TTL cable on order to update the firmware on this router, to LEDE, this week.


There were (4) devices that were hitting the WRT1900ACS LEDE router with continuous "Auth" / "DeAuth" requests last week: at that time, it was the only router providing connectivity service at the site.

(2) were iPhones (2) were HP printers, all connected via WiFi.

iPhone 1 was identified as the Director's iPhone: the entries were due to poor WiFi coverage in her office. Printer 1 was the Director's HP OfficeJet printer - again due to poor WiFi coverage.

Both have addressed by the new "Corporate" router, better location placement to provide signal, and changing the HP Officejet printer settings to ( Static IP ) address on the .. 5.1 segment.

iPhone 2 was a general staff iPhone user, which now connects to the bandwidth-limited "Staff Use" router above.

Printer 2 was a shared-use Operations / Management HP OfficeJet printer, which is now wired to the back-bone LAN segment, Wireless function on the printer is disabled, this printer is also now using a ( Static IP ) address on the ... 1.1 segment.


Following the changes above, additional entries appeared in the LEDE logs the following morning.

Again, these entries only appeared during high-bandwidth operations ( simultaneously uploading video files / playing videos via YouTube in web browser ) but were now occurring with both workstations on the local LAN segment.

After learning that the guaranteed provision rate ( 15/2 ) was different than originally setup in the router, I changed SQM QOS and QOS Ingress settings from (17,920) Ingress to (15,360).

  • The warnings have not re-appeared. -

The ubee cable modem continues to show zero (0) errors in the status page - my initial suspicion that the cable modem was in-part causing the issue was incorrect.

Improper SQM QOS / QOS settings appear to have been the cause of the issue.

SQM QOS traffic shaping in the LEDE router, now responsible for / providing management for LAN segment and .3.1 / .5.1 sub-nets seems to be working excellent:

General staff continues to have Internet access ability (albeit 2-3x slower), Operations staff on the LAN segment is able to conduct daily business functions without issue, Corporate / Management is also able to utilize full bandwidth, when needed.

It's only been a few days, but things are looking good so far: no lock-ups, or other issues.

I'll update this thread in a few days - hoping to get the EA8500 router up-leveled to LEDE this weekend.


You can create an additional interface on the routers wan ethernet port that talks directly to the cable modems private IP address. Here is what I have in /etc/config/network to access my dsl modem (eth1 is the physical port connected to the modem):

config interface 'WAN_4_MODEM'
option proto 'static'
option ifname 'eth1'
option ipaddr ''
option netmask ''
I believe that all cable modems live on, So with a bit of luck you could just add something similar (or add the new interface via the luci gui). For bonus points it might make sense to add that new interface to the firewalls wan zone.

So SQM was taking down the entire router just because the hardware was dropping or not keeping up with the packets being put out on the wire?

Well that's a bug if I ever heard one.


Your continued guidance, insight & help is most appreciated, Thanks.

I have a (2nd) EA8500 Router on order to configure to LEDE, the USB -> Serial TTL cable arrived this week. I am going to configure this "off-line" Router, using the recommendations you have made above, and (Disable) wireless in this unit: (swap) out the WRT1900ACS unit, and promote the configured EA8500 to run as the back-bone router.

The WRT1900ACS Router will be re-tasked as the "Corporate" router at that time.

Greetings, Thanks for the feedback.

I am reluctant to call anything a "bug" at this point, due to my over-provisioning the SQM QOS and QOS Ingress values to (116%) of the (guaranteed) provision rate from the cable provider initially,

The fact the LEDE firmware running on the unit is a "snapshot" build, and not a set production build,

The fact that the radio chipset drivers for the WRT1900ACS unit may not be "optimal", per the LEDE toH entry for the router,

The fact that 'incidental' reports from other WRT1900ACS owners have stated radio 'lockups' under heavy use / high utilization,

The fact that this unit is no longer in production from Linksys is also a "variable", at this point.

Edit: The Linksys site still shows this unit as production, and available. I was basing the above statement on various retail & internet sites discounting -> clearance-selling this model, and the # of "refurb" units for sale, e.g.: eBay.

There are just too many variables, including the initial incorrect QOS configuration, to point to any (1) root cause, at this point.

  • I will continue to update progress at this site: in the event a test-unit can be derived that may replicate the initial lock-up issue, or shed additional light that may help developers viewing this topic, as I proceed.



Just a quick update on this site:

The WRT1900ACS router continues to perform well, but is still receiving occasional:

"net/sched/sch_hfsc.c:1400 hfsc_dequeue+0x318/0x340 [sch_hfsc]" entries in the System Log, however the router isn't locking up, or otherwise 'stalling' when these occur now.

The WRT1900ACS router has been operational, running LEDE, for 12 days now.

I am planning on updating the firmware to the latest Davidc502 snapshot, once the EA8500 router is configured & qualified, this week.


I picked up (2) used EA8500 routers (from eBay), one was password (locked) out by the default Linksys firmware: the other was stock firmware accessible: but the 2,.4 / 5G radios were turned off, as received.

Both routers have been flashed to the latest LEDE snapshot as of 7/20/17, and are running well.

Going through the steps of opening up & configuring the EA8500 routers, reading through & following the topics listed in the [ LEDE User Guide ] pages, along with helpful insight from different threads on the Forum has helped greatly this week.. I'm getting there, slowly. Have a much better understanding of LEDE directories / file structure, and using Command line mode -> how it relates to the Luci GUI, so (I think?) that's progress.

Will do. Before I take the WRT1900ACS router off-line, I will pull this info., & the System and Kernel logs on Monday evening 7/24. ( If there is any other info. you would like to see, please let me know. )


Now that I have the EA8500 setup, running well, and qualified to install on-site - I will be setting up our DSL router in (bridge) mode to match the site & testing here today, * Thanks again for this info, much appreciated.

Otherwise, the site continues to run well: we are progressing forward on this project:: We will be installing a Roku Ultra player on the HDTV (the Roku Ultra has an Ethernet port), and setting up a DLNA server on the local LAN segment, to get the high-def video file traffic off the Internet connection / playing in-house this week.


While I beieve @weedy 's comment to be not too helpful (in finding or fixing the assumed bug) I would not rule out an qdisc problem. SQM-scripts itself really is just a convenience layer around the kernel's facilities and the iptables/iproute2 configuration tools, so a true sqm bug will lead to incorrectly set up shapers but will not lead to run-time problems as those that you observed, so no hard feelings from my side, it is just that calling these issues sqm bugs* is barking up the wrong tree :wink:

Best Regards

*) There certainly are sqm bugs left to squash, but these are different in phenotype and should not leads to kernel crashes...

Make this:
cat /etc/config.sqm ; tc -d qdisc ; tc -s qdisc
instead of the one merged command above, it will work better :wink:

Best Regards