Extreme lags and timeouts on luci on WRT1200AC out of the blue

doa · February 18, 2021, 8:20am

Hi there,

I hope someone could direct me, I have a problem with Openwrt on WRT1200AC. for a half year it was working perfectly, now it has problem described below. I have reset the router, reconfigured it manually from scratch on newer Openwrt version, same symptoms.

LUCI is extremely laggy, there are many timeouts on most of the page loads (attached). Sometimes new ssh session cannot be started (there is no password prompt and ssh client times out), but existing ssh session is working fine - and clients do not see any problem, I have no problems reaching 800Mbps downloads, no ping has been lost.

problem occurs on LAN and wifi, no difference in that

I was watching top output during the lags, there is plenty of memory free, cpu is at 99% idle, I have literally no idea what is going on. It usually times out on ubus or rpc timeouts but that's no rule, it hanged also on cgi-exec

What I have configured is here

dnsmasq forwarded to 127.0.0.1#5453 (stubby)
/etc/config/network wan section contains option dns '127.0.0.1'
vpn client
vpn server
vpn-policy-routing configured
adblock

note that turning of any/all of these things above had no effect on the behaviour

I'm not an expert in networking but managed to get all working half year ago. I have never experienced this before on this router.

Is there anything to dig into? I believe this is not the problem with uhttpd, this is a bit deeper imho, but I don't know where to look at. Any logging that could be turned on?

could anybody give me a direction please?

![Screenshot 2021-02-18 at 09.01.33|677x500]

doa · February 18, 2021, 8:20am

doa · February 18, 2021, 8:21am

doa · February 18, 2021, 11:43am

I'm not using HTTPS and here is output of

root@routar:~# uci show uhttpd
uhttpd.main=uhttpd
uhttpd.main.home='/www'
uhttpd.main.rfc1918_filter='1'
uhttpd.main.cert='/etc/uhttpd.crt'
uhttpd.main.key='/etc/uhttpd.key'
uhttpd.main.cgi_prefix='/cgi-bin'
uhttpd.main.lua_prefix='/cgi-bin/luci=/usr/lib/lua/luci/sgi/uhttpd.lua'
uhttpd.main.http_keepalive='0'
uhttpd.main.tcp_keepalive='1'
uhttpd.main.listen_http='0.0.0.0:80'
uhttpd.main.redirect_https='0'
uhttpd.main.max_connections='200'
uhttpd.main.no_ubusauth='1'
uhttpd.main.script_timeout='15'
uhttpd.main.network_timeout='5'
uhttpd.main.max_requests='30'
uhttpd.defaults=cert
uhttpd.defaults.days='730'
uhttpd.defaults.key_type='rsa'
uhttpd.defaults.bits='2048'
uhttpd.defaults.ec_curve='P-256'
uhttpd.defaults.country='ZZ'
uhttpd.defaults.state='Somewhere'
uhttpd.defaults.location='Unknown'
uhttpd.defaults.commonname='OpenWrt'

I have done some modifications for testing like disable ubusauth, raise timeouts, max_requests, setting http_keepalive to zero but did not help in any way

anon50098793 · February 18, 2021, 1:00pm

if you've had this firmware running snappily in the past... reset to defaults and disabled the above as mentioned...

the only feasable explanations are browser/cache/ssl drama or config formatting issues... ( firewall? or something half stale like switched wan from pppoe to dhcp and left some remnants ) although typically I would not expect them to manifest purely/within luci... ( edit: failing flash / hardware is possible though less likely culprit )

if you'd told me the client symptoms and not mentioned openwrt... from a networking point of view there are feasible reasons for such symptomatology...

otherwise... you need to reassess what exactly you've done and provide more information... odd's on, you're overlooking some changes you've made...

in all honesty though... flashing a new build ( with all fresh config! ) is alot easier than trying to work out why something that was great 6 months ago is no longer great...

doa · February 19, 2021, 7:36am

sorry it must be my english. let me explain

router works perfectly for 6 months (i bought it second hand)
luci and ssh started misbehaving cca week ago
day before I opened this thread I have reset the router, flashed newest build and reconfigured the router manually

these things are exactly the reason why I'm confused. I'll try to reset it again and re-assess where the misbehaviour starts, I have no other choice

anon50098793 · February 19, 2021, 8:30am

just to be perfectly clear...

you have not restored any backups or configs?

doa · February 20, 2021, 3:24pm

Nope, nothing. Pure manual setup from scratch. Even 60 static dhcp leases have been entered manually

anon50098793 · February 20, 2021, 3:34pm

if the following is true...

then you need to be checking local resolution ( /etc/hosts, ethers, interface server settings )... otherwise you can tcpdump the connection from the router side... or maybe try ubus monitor...

one of them is bound to show something... probably start with tcpdump would be wisest...

doa · February 20, 2021, 7:02pm

yeah it sounds weird, doesn't it? I cannot understand why would this happen on a freshly built server and why it happened after half year without changing anything.

heh, now I wanted to do the tcpdump trace and start ubus monitor. But luci responds lightning fast now... I have to wait until the problem starts again, I cannot find any pattern in this behaviour.

There is most probably a troll biting the cable when I need to work with it

I'll get back to this thread when I have something in hand. Thanks for letting me know about the ubus monitor, now I know that ubus exists so I can move a bit closer to the thing.
thanks a lot, I'll reply

eduperez · February 21, 2021, 10:15am

This could be caused by a conflicting IP address, too.

doa · February 22, 2021, 9:52am

you mean conflicting router address or any duplicate IP on the network?

eduperez · February 22, 2021, 1:17pm

If it happens out of the blue, I would discard a issue on the router itself. And re-reading your post, it does not look like a duplicate address in your network, either. Sorry.

doa · February 22, 2021, 1:33pm

would you check the capture please? I don't see much because I don't know what to look at. file is stored at https://www.dropbox.com/s/wmwu0farc8tq699/laggy_luci.pcap.zip?dl=0

10.0.0.1 is the router and 10.0.0.7 is the web browser client

i was browsing first to Status > Overview, then switched to Status > Firewall. Both took too much time to load

setting wireshark filter to "tcp.stream eq 12" I can see several retransmissions but I don't know why they had to be sent so many times and why the router did not ack back right away

anon50098793 · February 22, 2021, 2:20pm

its a start... but you really need to capture from the router... and all traffic (well sorting for router ip/mac in srcdst and probably just br-lan would be ok for a start) not just that session...

good news is you now have something to search for in the larger capture... and see whats around those tcp-retranmissions... ( windows update is in the capture and it can chew some serious cpu and some bandwidth... if it's wifi go into the ap connection settings and set it to metered )

you should probably also check client+router interfaces for drops/error rates... split the issue between upper and lower level issues...

doa · February 22, 2021, 2:49pm

client drops is 0, router drops on br-lan is 0 as well, that I have checked already, always zero

this capture has been created on router, arguments: tcpdump -i br-lan host 10.0.0.1 and port 80 -w laggy_luci.pcap

is this what you mean? the pcap file was created like this

anon50098793 · February 22, 2021, 3:41pm

bowing out of this one now... given several good clues on how to follow up... but my hypothesis..

client issue ( likely high upper layer )
low layer network problem or surrounding/overlapping/nefarious/misconfigured traffic related

in other words... nothing to do with luci/openwrt...

Stefan1 · February 22, 2021, 7:43pm

So how many client machines are in the network ? Would it be possible to disconnect them one by one when the problem occurs next time ?
Or at what daytimes does the problem occur ?
Does it also happen at times where there is no load on the system ?
What about pings to the openwrt host ? Are they laggy, too ?
Could you run a ping job on another machine that regularly measures the ping response times to see maybe a pattern here ?

doa · February 22, 2021, 7:59pm

Thanks Stefan for suggestions

there are cca 20-30 active devices on the network. 50% wired, 50% wireless, approximately. I will try to plug the cables off tomorrow when I see the lag, good idea.

I have node-red running on my network, I have just set up ping every second, I'll look tomorrow at the graph. The daytime is not relevant it seems. likewise, the system load does not have an effect on this

I have turned off an rclone job which made my "sirq" at 20%. Now the CPU is almost on 0% but the behaviour is that laggy like before.

I have uploaded another example https://www.dropbox.com/s/xxcb2c2xb8h1if9/luci.pcap?dl=0
PCAP has been taken on router using same arguments (-i br-lan host 10.0.0.1 and port 80) but this time the HTTP client was 10.0.0.4.

looking at the packet trace it looks to me that http server did not manage to respond with ACK. There are many tcp retransmission but I'm not expert in this so I don't know what this mean.

I'll come up tomorrow with more answers. The router is quite powerful but I don't think it's ok for it to reboot faster than logging on into luci. It reboots so fast that Netflix clients don't notice. but logging into luci takes a minute

thanks a lot Stefan. this is a really strange thing, I cannot imagine what that is ...

doa · February 23, 2021, 8:23am

found the culprit. I was taking capture file and saw thousands of connections to MQTT broker on my openwrt router. The source was failed script that created new and new connections, hundreds per second. Script has been taken down, mosquitto broker restarted and all seems to work properly again.

Pity I did not take capture without the port filter before, I could save everyone's time. Thanks for helping guys