Helping out a non-profit organization work through some long-standing (WiFi) networking issues, I was asked to come in and evaluate their existing network setup: primarily wireless clients (10-15) daily, a combination of workstations and external device clients ( phone, tablet, laptop ) connecting to a single (1) WiFi router:
Service provider: Time Warner (aka Spectrum) Cable internet, business class, "Turbo" 20/2, with a separate Arris modem providing 2 line Telephony into building,
Existing issues cleaned up to-date:
Missing 75 ohm terminator on (3) way splitter in primary entrance point / electrical closet,
Inspecting, re-seating, & re-tightening all barrel connectors inside building,
Replacing any/all suspect network patch cables with non-molded connector ends,
Internet modem is ubee DW365, a basic 8 channel down / 4 channel up DOCSIS 3.0 unit, DNS off, DHCP off, Firewall set to (low), modem static IP is 192.168.0.1,
At start, modem log was showing several hundred (correctable) errors in connection stats, following above cabling maintenance & modem restart, modem log currently shows zero (0) errors in stats, after 72 hours running.
Existing WiFi router on-site: Netgear R7000 Nighthawk - currently disconnected for testing,
Replacement test router installed on-site: Linksys WRT1900ACS v2, purchased 5/2017,
Bridging to ubee modem using Static IP 192.168.0.3, internal (LAN) IP is 192.168.1.1,
Running Davidc502 LEDE Snapshot from his site:
Kernel version 4.9.34
WiFi driver 10.3.4.0-20170606
NEW 06/28/2017 New builds have been release and are ready for download.
QOS settings are on, and set for 20/2 bandwidth,
SQM QOS setting on, and set for 87.5 % Ingress / Egress: 17,920 / 1,792
Queue Disipline is: cake, piece_of_cake,
Link Layer Adaptation is: Ethernet, Per Packet overhead is: 18,
per recommended settings in: https://lede-project.org/docs/howto/sqm
The Linksys WRT1900ACS router is responsible for handling all WiFi traffic currently.
Primarily (5) fixed workstation clients, (4) WiFi, (1) LAN, (3) WiFi Printers,
(7 - 10) External Devices: Phone, Tablet, Laptop , ranging from older (Android) phones to newest iPhone / iPad (5G) models.
Symptoms, and Issue in Kernel Log occurs when:
(1) One workstation, a Wireless N client, is Uploading High-Definition videos to YouTube: this workstation is maxing out Egress bandwidth,
(1) One workstation, a local LAN wired client, is playing YouTube videos to a wall-mounted HDTV screen, rendered in HD (auto) 1,440 P resolution,
I was observing Realtime Graphed (br-lan) traffic and was able to deduce / confirm these (2) workstations were throwing the error, e.g.: one workstation maxing outbound bandwidth, one workstation maxing out inbound bandwidth.
I did some internet searching, the error "appears" to be related to a module responsible for providing SQM QOS:
Last Kernel Log entry follows:
[62601.242717] ------------[ cut here ]------------
[62601.247378] WARNING: CPU: 0 PID: 3 at net/sched/sch_hfsc.c:1400 hfsc_dequeue+0x318/0x340 [sch_hfsc]
[62601.256471] Modules linked in: pppoe ppp_async pppox ppp_generic nf_nat_pptp nf_conntrack_pptp nf_conntrack_ipv6 iptable_nat ipt_REJECT ipt_MASQUERADE xt_time xt_tcpudp xt_tcpmss xt_statistic xt_state xt_recent xt_policy xt_nat xt_multiport xt_mark xt_mac xt_limit xt_length xt_hl xt_helper xt_esp xt_ecn xt_dscp xt_conntrack xt_connmark xt_connlimit xt_connbytes xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_HL xt_DSCP xt_CT xt_CLASSIFY ums_usbat ums_sddr55 ums_sddr09 ums_karma ums_jumpshot ums_isd200 ums_freecom ums_datafab ums_cypress ums_alauda ts_fsm ts_bm slhc nf_reject_ipv4 nf_nat_tftp nf_nat_snmp_basic nf_nat_sip nf_nat_redirect nf_nat_proto_gre nf_nat_masquerade_ipv4 nf_nat_irc nf_conntrack_ipv4 nf_nat_ipv4 nf_nat_h323 nf_nat_amanda nf_nat nf_log_ipv4 nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_tftp
[62601.328326] nf_conntrack_snmp nf_conntrack_sip nf_conntrack_rtcache nf_conntrack_proto_gre nf_conntrack_irc nf_conntrack_h323 nf_conntrack_broadcast ts_kmp nf_conntrack_amanda mwifiex_sdio mwifiex iptable_mangle iptable_filter ipt_ah ipt_ECN ip_tables crc_ccitt fuse sch_cake em_nbyte act_ipt cls_basic sch_prio sch_pie sch_gred em_meta sch_dsmark sch_teql em_cmp act_police em_text sch_codel sch_sfq sch_fq sch_red act_connmark nf_conntrack act_skbedit act_mirred em_u32 cls_u32 cls_tcindex cls_flow cls_route cls_fw sch_tbf sch_htb sch_hfsc sch_ingress mwlwifi mac80211 cfg80211 compat cryptodev xt_set ip_set_list_set ip_set_hash_netiface ip_set_hash_netport ip_set_hash_netnet ip_set_hash_net ip_set_hash_netportnet ip_set_hash_mac ip_set_hash_ipportnet ip_set_hash_ipportip ip_set_hash_ipport ip_set_hash_ipmark ip_set_hash_ip ip_set_bitmap_port ip_set_bitmap_ipmac ip_set_bitmap_ip ip_set nfnetlink ip6t_REJECT nf_reject_ipv6 nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables x_tables msdos bonding ifb tun vfat fat ntfs nls_utf8 nls_iso8859_1 nls_cp437 sha512_generic sha256_generic md5 hmac authenc ohci_pci uhci_hcd ohci_platform ohci_hcd gpio_button_hotplug
[62601.432034] CPU: 0 PID: 3 Comm: ksoftirqd/0 Tainted: G W 4.9.34 #0
[62601.439372] Hardware name: Marvell Armada 380/385 (Device Tree)
[62601.445324]  (unwind_backtrace) from  (show_stack+0x10/0x14)
[62601.453102]  (show_stack) from  (dump_stack+0x7c/0x9c)
[62601.460355]  (dump_stack) from  (__warn+0xbc/0xec)
[62601.467259]  (__warn) from  (warn_slowpath_null+0x1c/0x24)
[62601.474867]  (warn_slowpath_null) from  (hfsc_dequeue+0x318/0x340 [sch_hfsc])
[62601.484134]  (hfsc_dequeue [sch_hfsc]) from  (__qdisc_run+0xe4/0x234)
[62601.492697]  (__qdisc_run) from  (__dev_queue_xmit+0x348/0x684)
[62601.500736]  (__dev_queue_xmit) from  (ip_finish_output2+0x220/0x280)
[62601.509297]  (ip_finish_output2) from  (ip_output+0x50/0xb0)
[62601.517073]  (ip_output) from  (ip_forward+0x364/0x3d0)
[62601.524413]  (ip_forward) from  (ip_rcv+0x258/0x2b8)
[62601.531493]  (ip_rcv) from  (__netif_receive_skb_core+0x6c4/0x8fc)
[62601.539795]  (__netif_receive_skb_core) from  (process_backlog+0x7c/0x11c)
[62601.548792]  (process_backlog) from  (net_rx_action+0xe8/0x2ac)
[62601.556831]  (net_rx_action) from  (__do_softirq+0xd0/0x204)
[62601.564608]  (__do_softirq) from  (run_ksoftirqd+0x2c/0x50)
[62601.572299]  (run_ksoftirqd) from  (smpboot_thread_fn+0x16c/0x184)
[62601.580600]  (smpboot_thread_fn) from  (kthread+0xd8/0xec)
[62601.588204]  (kthread) from  (ret_from_fork+0x14/0x3c)
[62601.595470] ---[ end trace 3c3ff5958f8d8bd0 ]---
The workstation uploading High-Definition videos to YouTube: this workstation is uploading (nn) video files, Egress bandwidth is pegged at (2m) for at least 60 minutes during the upload process.
The workstation, (LAN wired client), playing YouTube videos to a wall-mounted HDTV screen, is an 'average' business class workstation, running Windows 7 professional, using a dual-monitor setup, running a web browser (Google Chrome) in the 2nd monitor (HDTV) instance for rendering.
(If you're still reading, Thanks for hanging in there this far.)
When the Kernel Log error occurs, all clients are 'frozen' for a period of time, from one (1) to five (5) minutes. During this time, any/all internet access will fail. Internal DNS appears to be working properly, e.g. addresses resolve within a web browser, page rendering fails with a 'cannot contact server' error.
Research done / options discussed with management so far include:
Replacing the Time Warner provided ubee DW365 modem (circa 2013) with a more modern modem, e.g. Netgear CM600 modem,
-Splitting- Wifi traffic to (2) separate routers, one (1) for Corporate use, one (1) for Guest (External Device) use, and limiting connection bandwidth on the Guest router -> and giving the Guest router lower traffic priority (still researching how / the best way to do this)
Reducing rendering resolution of the workstation playing YouTube videos from (Auto) 1440p to 720p, which shows a definite improvement in bandwidth reduction in Ingress: from 20m to 10/15m,
Possibility of reducing resolution of videos taken / being uploaded to YouTube from 1440p ( newest iPhone) to 1080p, or 720p - although this option is being met with some resistance, which is understandable / not a preferred 'solution' to staff or management on-site.
Networking / connectivity issues have been occurring at this site for many months - I don't expect to solve all issues in a few days.
The LEDE router swap, cabling maintenance, etc. is being done as due-diligence, before contacting Time Warner / Spectrum for further diagnostics, as recommended by their 3rd-tier service techs on various forums (e.g. dslreports.com ),
The Linksys LEDE router, when working properly, is providing a much better client experience, based on feedback. When it 'chokes' though, it's taking down all users, typically at the busiest times of day, when online application updates & credit card processing are occurring, and needed.
I would appreciate any insight & observations / recommendations you may have to offer,