[Solved] How to track down and identify high use machines?

I am running OpenWRT 18.06.1. I am by no means a networking expert so I'm a little lost on how to do something.

Every now and then our internet speed takes a hit. By that I mean everything is really slow. I'll do a speedtest with a wired device and it'll get back 10000 ms ping and 1.78 speed when I should be getting 200.

I am not sure if its our ISP or some device on the network that's hogging the connection.

Before I accuse my ISP I thought maybe there is some way to monitor all the device traffic to find if any device is causing the problem.

Any advice/recommendations on what I should do and/or look for?

name value
model TP-Link Archer C7 v2
Architecture Qualcomm Atheros QCA9558 ver 1 rev 0
Firmware Version OpenWrt 18.06.1 r7258-5eb055306f / LuCI openwrt-18.06 branch (git-18.228.31946-f64b152)
Kernel Version 4.9.120

First off, an Archer c7 is unlikely to do 200Mbps with SQM, but you should be running SQM.

Second, you can look at traffic graphs in LUCI to see where your bandwidth is being used.

Third, you can packet capture to see what kind of traffic you have using tcpdump

Edit
Fourth, are you using WiFi for these tests? Broadcast or multicast traffic can saturate the airtime relatively easily, perhaps someone is using a multicast stream over wifi?

Thanks. I am not sure what SQM is but I will read up on how to enable it.

I know right now I am getting 200 Mbps.

SQM is a system for distributing bandwidth fairly and keeping queue delays small even under load, but it requires CPU power proportional to bandwidth, so at 200mbps the C7 may not handle it

Sounds like you are experiencing BufferBloat, so do install SQM and set it up per the guides. All those random latency spikes will go away.

The C7 will sort of handle 200Mbps with SQM, but you'll be out of CPU, so don't run a ton of other packages on that box.

SQM is totally worth it.

I should have added that I was noticing the network issues with the stock TP-Link firmware. That is why I installed OpenWRT -- so I could see if a) the problem continues and if it does then b) why.

Right now I just have OpenWRT with whatever packages it came with and the Dynamic DNS package. I am planning on installing and configuring Stubby per https://candrews.integralblue.com/2018/08/dns-over-tls-on-openwrt-18-06/ so I can use Cloudfare DNS over TLS.

Don't plan on running any other modules. I will give SQM a try. Even if I only get 100 Mbps that is fine with me. Our internet activities rarely require anything more than that.

FYI, I'm a C7 user, and have a 300/30mbit cable connection. I see C7's run out of CPU at about 100-120mbit. You can check by SSL into the router and running Top to watch the CPU and Idle %.

Quick notes:
Running SQM only on the egress (upload) side gets me a lot of the benefit, and in my case is a easy 30mbit load. You select 0mbits for the ingress(download). This is normally how I've been running my router.

If I run my ingress at 100mbit, I get stellar performance, but, I'm sacrificing 200mbit of speed.

I've also noticed that the same settings on 18.06.1 seem to load the router more, as if that version uses up more CPU for some reason, I have to go down to 60-80mbit to not hit 0%. I would recommend using the 17.01.04 version instead on a C7, it still performs like the above. Have a link out on this, but few comments resulted.

If you are able to run a ath79 image (c7v2 does), there is some throughput to be bought with flowoffload.

While switching to ath79 is the way forward either way, ar71xx snapshots are on kernel 4.14 (and with that supporting flow-offloading) as well, so performance (among ath79 and ar71xx master snapshots) should be virtually the same.

  • You want something like ntop however it's way too resource hungry for your device. My guess would be that you need at least a quad ARM or better to achieve acceptable peformance.
    https://www.ntop.org/products/traffic-analysis/ntop/

  • nlbwmon (perferably using the LuCI frontend) will give you an idea but it's a bit unstable in my experience making it eat all CPU time available or at least one core after a while. Still, I think it'll give you good idea.

  • Doing custom QoS using HFSC on upload only should't be that taxing if you have around 30mbit upload but it will be a bit trickey. If I'm not mistaken the old QoS scripts uses HFSC but I haven't looked in detail..

  • 100-120mbit sounds a bit low, I'd expect around 150-200mbit+ without SQM and acceleration if you disable mssfix and using -O2 which at least in the past gave you a small performance boost. Disabling MIPS16 will probably give you a small boost (at least it did in the past) at the expensive of size on top of O2.

This is all very helpful information guys. Thanks so much. I won't have time to play with SQM until next weekend so I'll report back my findings then.

Sorry for not making clear that I meant 100-120mbit with SQM. Without SQM, just doing NAT routing for the home AP, I see all the way up to 350mbit, which is as high as the peak I ever see out of my "300" DL cable speed.

And thanks to diizzzy for bringing the thread back with a selection of traffic monitoring methods