Asus RT-N16 unstable with LEDE

Hi there,

since the latest WPA2 vulnerability, I decided it´s time to upgrade the Firmware of my Asus RT-N16. Don´t hit me, but I ran a DD-WRT Firmware from 2010 with some Optware packages.

So this morning I switched to LEDE 17.01.4 and was quite happy for a few minutes until I realized that my Router is doing Reboots occasionally. Further investigation revealed, that it´s rebooting whenever I create some traffic. Like watching a Youtube Video, doing a Speedtest or downloading a file.

So I reset to default settings, flashed the firmware again and reset to default again. Changing absolutely nothing (WiFi is off by default), I started a download again and immediately the Router rebootet again.

Next step was "going back" to Open-WRT "Chaos Calmer 15.05.1". But I had exactly the same issue. So it´s not really just a LEDE problem I have here.

To make sure my Router is still working flawless, I flashed back my DD-WRT Image from 2010 which runs like a champ. No reboots and it utilizes the full 100Mbit/s from my Internet connection without any hiccups.

The big question is: How can I debug what issue I have? Using telnet and watching the logs does not really help when the device reboots.
The image I used was the lede-17.01.4-brcm47xx-mips74k-asus-rt-n16-squashfs.trx.

Is there anything I can do? I really want to upgrade from my 7 year old image to something more modern and secure.
Any help is appreciated.

Kind regards!

buedi

I'm troubleshooting the same (or very similar) problem. Running Asus RT-N16 with latest LEDE - updated from LEDE Reboot 17.01.4 from 17.01.3 as part of the troubleshooting - and did a clean install.

The details that I've figured out are that the router fails under high load. If you're getting a reboot, you're better off than me - my router simply stalls, no longer routing traffic, no longer available over HTTPS or SSH.

Using TOP when connected to the SSH interface shows near 100% CPU utilization (all in SIRQ) when under network load.

Note also that I'm testing over Ethernet, not WiFi. I'm not sure that the old OpenWRT Bug 7356 is relevant - and yes, Chrome gives "Cert Revoked" going there.

OK, and here's a potentially useful description of "Interrupt Thrashing", which suggests that a straight Ethernet implementation would possibly result in the SIRQ behavior, if it weren't leveraging some form of interrupt pacing - like NAPI.

Maybe the problem is that the driver for the Asus RT-N16 is not properly using interrupt pacing? Beyond my depth here, but plausible due to recent work on LEDE Ethernet Drivers to use napi_complete_done.

Hi @mrbene,

you´re much further than I am :slight_smile:
It´s been a few days now, but I think I ran a top on one of my tests and could not see CPU spikes. But this may also be because the router is almost immediately rebooting the moment I give it some network load and top updates only every second or so. Or I missed it. I might need to try it again.
If I´m luck I find some time on during the weekend to flash it again with Lede and test it. Then at least we know if we might have the same issue.

And yes, all my testing was also done over the Ethernet, WiFi even deactivated completely to make sure I don´t have a problem in that department.

@buedi - I have been using Steam as a way of generating predictable bandwidth usage (there's throttling in "Settings - Downloads - Limit Bandwidth To..." and their servers are reliable at hitting the specified limit). Set at 1 MB/s I see ~15% SIRQ, at 1.5 MB/s I closer to 40%.

My next steps are to see if I can repro on 17.01.2 or 17.01.1. If not, I'll need to move some things around, maybe revert back to an even older firmware. I was running an older OpenWRT without issue until recently.

It might not be LEDE - it's probably the RT-N16 itself.

I've had two with the same symptoms - flakiness, random reboots, and eventually the router won't boot at all. The problem is sometimes described as random wifi dropouts, but usually the router is rebooting.

The problem is caused by a poor quality power supply capacitor in the unit itself. Over time, it leaks and goes bad.

The fix is simple and inexpensive, but it requires opening the router and soldering in a replacement capacitor.

Here's a link with some decent pictures of the process: http://www.nerdybynature.com/2013/10/26/fix-a-fried-asus-rt-n16/

A search for 'rt-N16 capacitor' will pop up plenty of other examples.

I also recommend checking the power supply with a voltmeter as well. The capacitor problem fried one of my supplies.

Hi @mikeyp,

Thanks for your suggestion. I ran accross the capacitor issue last weekend when I tried to find out why my router acts so strangely. Before opening it up I decided to go back to my very old DD-WRT build and since then it runs flawless. It puts through the full 100Mbit/s my ISP provides me and in the LAN to my NAS I have no issues reaching 750MBit/s without any hiccups.
So I decided it can´t be a hardware issue when it runs great with another Software. But maybe it becomes an issue and it only shows up earlier with a newer Firmware than with that very old one I´m using right now. So I don´t rule out the capacitors yet, but since time is an issue and I need my Internet connection during the week, investigations have to wait a bit now :slight_smile:

The problem probably correlates directly to power consumption / load.

I fully understand needing the router during the week.. My first failure was the hard failure, and I didn't have time to mess with it, so I immediately bought a replacement. I didn't find the capacitor problem until the second one got flaky months later.

A quick visual check will usually identify the failed caps. Here's what my two looked like. The one on the left has a rust colored stain from the leakage. (That was the hard failure.) The one on the right has a small black mark on top, but the bulging top is the dead giveaway.

both_caps

Good Luck !

1 Like

@mikeyp are you currently running latest LEDE on RT-N16? Would love to get extra data points.

@mrbene, Not yet. If all goes well, I will be over the weekend. I landed here looking for firmware that's actively maintained, and noticed this thread while checking on RT-N16 install details. dd-wrt mega was on the routers when they were flaky, and is stable after the hardware repair, so I'm optimistic about LEDE.

@buedi @mikeyp - I didn't have a chance to investigate this further over the weekend, instead learning about larger scale systems with a failure in domestic water heater. One thing I noticed though is that sometimes LEDE is failing and recovering (a brief period of internet unavailability, then an "Uptime" in main status page of <1 minute), and sometimes LEDE is failing and not recovering (no further routing, no response on HTTP or SSH ports, needs to be hard rebooted with unplugging to regain functionality).

I had started down the path of writing logs to a USB drive so that they survive across reboots, but USB drives aren't supported out of the box. Y'all have any strategy for logging events?

@mrbene - Me neither, Haha! But when I had Lede on my Router, I added USB support and wrote down the steps needed to get my Thumbdrive working. Maybe that helps you (although I don´t know how to redirect the Logs to another Directory):

  • opkg update && opkg install usbutils (this gives you access to lsusb and lsusb -t was very useful getting my Thumbdrive up and running)
  • opkg install e2fsprogs (because my drive had an existing ext2 filesystem)
  • opkt install kmod-fs-ext4 (to get ext2 / ext4 support into the kernel)
  • opkt install kmod-usb-storage (the kernel should support usb storage)

With that I was able to mount my thumbdrive and have access to it.

It´s not helpful in this case, but it helped me to rule out one potential problem: I opened up my Router and the capacitors look fantastic. So it should not be the more or less common capacitor problem. It still has the 470uF caps though. I´ve seen newer models have 680uF.
I wonder if this really could make a difference. But if it does, we still have the problem of the very high CPU utilization @mrbene noticed :frowning:

Last night after the kiddos passed out from sugar overdoses, I had a chance to poke around with the RT-N16. I changed the firmware to OpenWRT Chaos Calmer 15.05.1 (March 2016). High CPU (especially in SIRQ) under ethernet load reproduced there, too - but without triggering the failure that I'm seeing with LEDE 17.01.4.

Now, this might be because I'm maxing out my line at ~60 Mbps up/down and not reaching your 100 Mbps, @buedi. Definitely supports the thesis that an ethernet driver in the OpenWRT has a problem.

Oh, and here's additional data. Specifically:

Unfortunately CPUs on most of SoCs are too slow to provide 1000 Mb/s routing or NAT. It results in NAT being limited to something around 130Mb/s on BCM4706 and even less on slower units (like ~50Mb/s on BCM4718A1).

RT-N16 runs BCM4718.

So, synopsis:

  • The CPU in RT-N16 isn't expected to support NAT while also supporting transfer faster than ~50Mb/s
  • High CPU utilization seems to be the expected result of this, and early investigation to reverse engineering ctf.ko don't look to have gone anywhere.

Now I'm definitely interested in what version of DD-WRT you're using - and testing transfer rates on my network with that.

it would be nice to try those custom patches - it should improve wan speeds up to 300%

Hey @mrbene

everything you found makes sense. In that way, that the information you found is backed by facts and sounds valid.
But I´m reaching easily 100Mbit/s on the WAN side via NAT on my RT-N16. And on the LAN side (just switching), I reach speeds of up to 750MBit/s reading from my NAS and around 450MBit/s writing to my NAS and I´m not even sure if the NAS is the limiting factor here.

My ancient DD-WRT Image is this one:
DD-WRT v24-sp2 (08/07/10) mega - build 14896
I had 14949 on it before my flash adventure, but I was not able to find that, so I´m on 14896 now.

It runs like a champ, but I´d really like to have something more modern with the latest security patches. A friend will help me changing the capacitors on mine. They look good from the outside, but maybe they became dry after all those years. This might help running it more stable. But that CPU load will probably prevent using a newer Firmware version.

What blows my mind is: Why is there such a huge difference in load with a 2010 Firmware version compared to a newer one when performing the basic tasks like switching and NAT.
If we would using tons of additional fancy stuff, I could understand it. But you do the same with a 7 years newer Firmware and we get such a major difference? That´s really strange.

Not too much RT-N16 discussion around here, just wondering if you guys find the wifi driver included by default very useful?
I've found that I need to use the kmod-brcm-wl driver as detailed here: https://wiki.openwrt.org/toh/asus/rt-n16
I also find I have to use the command line mostly or risk having settings get stepped on. But I'm interested in your experiences.
thanks

Sorry to revive an old one here. Wanted to mention I have a rt-n16 and installed LEDE over the weekend (from ddwrt). I ran the original b43 driver, the kmod-brcmsmac and the kmod-brcm-wl driver using the instructions here https://wiki.openwrt.org/toh/asus/rt-n16. Landed on the kmod-brcm-wl after it was fastest after tweaks BUT I may have whiffed on the earlier drivers and had them running in G only. Has been very solid so far. Streaming sling, youtube etc. Has been up for a little over a day with 1G+ of transmission no drops or reboots. I did notice some slight oddness with the UI and commits of changes after putting on kmod-brcm-wl but definitely no worse than ddwrt.

Do you get a proper working list of wireless clients that are connected?