If you have v1.001 you may safely upload the dlf version of OpenWRT 21.02 to this device. It will most likely run perfectly, as mine does. The v1.002 version of this router behaves differently in my case. The download, both wired and wireless does never exceed 10Mb/s. While upload is on par with v1.001. Still have to figure out if it's just my router or all WLR-7100v1.002 are affected.
I have both versions, and have been using the v1.001 with the v1.002 on the shelf for the past two years or so. To upgrade my router, I configured the v1.002 with a newer firmware and replaced the entire router, but ran into poor performance problems. I first suspected firmware changes (not realizing I had two different versions), but I downgraded the v1.002 to the same firmware as v1.001, and the problem persisted. I have not yet tried upgrading the v1.001 to the latest firmware (since that is currently running my organization's network, so I'm reluctant to risk breaking my only working setup), but it seems indeed that v1.002 is slightly different and thus OpenWRT has bad performance on it, just like @Dock also described.
@Dock Do you happen to have pictures of the v1.001 for comparison? I put pictures of the V1.002 on the wiki, but cannot easily open up my v1.001 now.
As for further debugging:
I have found that the root cause seems to be dropped incoming packets. I've seen 1% -15% depending on circumstances.
I've seen this happen on both WAN and LAN.
Setting up port mirroring on the internal switch suggests that the packets pass through the switch properly, but get lost before or in the CPU ethernet interface.
I've seen this happen with HTTP, scp, but also simple pings. The easiest way to reproduce this is to just ping the router: ping 192.168.2.1 -s 1400 -i .01. This uses a high interval, so you can see the problem without having to wait for a long time, and bigger packets since that also seems to increase the packet loss percentage (but not linearly). However, the problem is also visible with default 56-byte packets and 1 second interval, so it does not seem to be a packet size or packet rate issue.
On the recent 23.04 release, I get about 1.3% with 56-byte pings, and 13% with 1400-byte pings. However, on some old snapshot version (r15249-85caf21ade, kernel 5.4.83) I get about 0.2-0.4% with 56-byte pings, and around 2% with 1400-byte packets, so it seems the problem did get worse in recent versions.
I suspected that only incoming packets were affected, because e.g. outgoing file transfers were fast while incoming were slow (outgoing file transfers would then still have dropped ACK packets, but those are smaller so less problematic). If this were true, however, then an outgoing ping would see the same amount of packet loss (replies would get lost instead of the pings), but that does not seem to be the case - ping 192.168.2.191 -s 1400 gives me 0 lost packets (out of 400 tries, busyboxy pin doesn't support < 1s interval). Using -A (next ping as soon as previous reply is reached), I see 10 out of 4000 packets dropped, so maybe the problem still exists but is somehow even less likely to occur on reply packets? Running longer with 1s interval I see 2 out of 500 packets dropped and later 3 out of 4000 packet.
The stock firmware is the same for both hardware versions (both versions have their own download page and different filenames, but are byte-for-byte identical). Maybe the firmware source has a hint about what is different between hw versions, but I have not investigated yet.
On other interesting detail: On v1.001, /proc/cpuinfo says system type: Atheros AR9342 rev 3
and on v1.002 it says: system type: Atheros AR9342 rev 1. So this confirms a previous suspicion that these versions use a different revision of at least the SoC (interesting that the newer router version has a lower SoC revision, but maybe there's some revision register weirdness going on there), though it is unclear if the problem is related to a change in SoC revision, or maybe there is also a PCB change.
I've tried to lookup the revision history for this SoC, but I couldn't find a datasheet for the AR9342 at all (I could find 41 and 44 here). Also, the actual chip is marked AR1022, but I suspect this is the old atheros name for the Qualcomm-Atheros AR9342 (suggested by this wiki page as well). It's a bit weird that I can find hardly any mention of the AR1022 chip at all, though, so certainly no datasheet either...
These are very nice findings, the differences in SoC revision could explain the problem. Could You test if packet loss occurs if 100Mb/s or 10Mb/s is used. If not, then probably 1Gb value from pll-data is wrong. But if the losses are also there, maybe we need to add delays in dts gmac-config node (https://git.openwrt.org/f3ffac90bc). The values can be from 0 to 3. For elaborate description check datasheet AR9344_May_2012.pdf (they float on GitHub) paragraph 9.7 and 11-3, 11-4 tables.
Do You still have vendor firmware on v1.001? Could You check if ethreg or devmem (eventually /dev/mem) is present?
I will take pictures of both v1.001 and v1.002 this week.
I also have different strange behaviour of the v1.002, which made me decide to stop using the device: it seems to run out of memory when trying to upgrade with Sysupgrade. I just hangs while uploading the file... only once in a lot of iterations reaching 100%. The v1.001 always uploads new firmware in a steady velocity.
Any of you also using the WLR-8100 (both versions)?
@tmn505 Thanks for your debug suggestions, that was exactly what I needed to continue this investigation. I'll check - I know where to find the DTS file and can figure out how to test changes, just not what values to check.
As for the stock firmware - I do not have it running, but I can flash it again (did that this week to recover already). I'll check for /dev/mem and ethreg and come back to you with the result.
I've also seen this when upgrading through the webui - I can imagine that the packet loss just breaks the upgrade?
Furthermore: I have one new observation: it turns out I have two v1 002 boxes, and they have a different SoC in them (AR1022 and AR9342). I previously noticed a different revision number in /proc/cpuinfo and assumed that the other one (without problems) would be a v1 001, but today I removed it from our network and saw that it was also v1 002. Looking more closely, it seems that the PCB is fully identical (the only difference I could find is a bit of silkscreen on the right upside down that looks like it might be a timestamp and "P2-1" vs "P4-2").
The 1 and 3 in AL1A and DL3A matches the revision shown in /proc/cpuinfo, so that might be related.
That's interesting - you have different version with both AR1022, I have two of the same version but with different chips, heh... I'm not sure if/how this helps yet, but it does suggest that we should not focus on the difference between AR1022/AR9342 too much, since there is a AR1022 chip that also works (or maybe something is operating on the edge of specifications and it just happens to work on some units but not on others...).
It also seems that your v1 001 version has a slightly different PCB color, my PCBs look more like your v1 002 one.
Otherwise your PCB looks completely identical (including silkscreen and which components are left out) for the part I can see on your picture. Could you maybe post one more picture of both entire PCBs, to compare the silkscreen version numbers (which fall off the bottom of your v1 001 picture)?
Nope, though a friend of mine mentioned they have a WLR-5100 (which is apparently identical to the WLR-4100 but with 5Ghz wifi added to an identical PCB).
Anyway, thanks for the input. I'm out of time for today, but will try to do some more tests soon (but I should also be doing other things, so might take me a bit more time).
Yeah, I already found that. I couldn't find the AR9342 datasheet, though. Do you know if that just wasn't leaked, or is the AR9342 and AR9344 both covered by that datasheet? I assume if not, that they are similar enough so most of the contents still applies, of course (though I noticed the docs for RST_REVISION_ID did not match the kernel sources - datasheet documents 0x011c1 for the AR9344, but the kernel sources use a different value - maybe the value was changed later or something...).
These are the same cores, they differ in CPU speed (MHz), supported wifi band. They could also differ in integrated peripherals, like integrated switch (not relevant in our case).
I want to check if registers value on all revisions differ somewhere. If both bare not present in vendor firmware, I'll attach busybox binary with ethreg compiled in.
Can You also check if switch chip differs between all Your devices?
I've taken my "working" AR9342 version, which I've only tested so far with an years old snapshot version, upgraded it to the latest snapshot version and confirmed it is still working. Then I did the same with my "broken" AR1022 version, and confirmed it is still broken. This is another confirmation that the problem is indeed caused by hardware differences.
The loss also happens on 100Mbit (my home switch does not do gbit).
The switch chip is the same - QCA8337N-AL3C on both my boards, and @Dock's v1 001 (I do not know about their v1 002 board).
I installed the stock firmware again, it has both ethreg and /dev/mem:
I also tried installing ethreg inside OpenWRT, but found that it does not seem to be available anywhere, I just found one github repo with a busybox version that still had ethreg.c. I guess it was never part of upstream busybox and is no longer used nowadays (I also couldn't find any reference to the IOCTL used in kernel sources, so I suspect that might have been WRT-specific patches in the past?). In any case - compiling it with prefix=mips-linux-gnu- worked, but then failed to run on OpenWRT with invalid instruction - probably missing some compiler option...
In any case - ethreg is available, but I couldn't quite figure out what the base register address for the tool is. I thought (looking at the AR9344 datasheet) maybe 0x18070000 (base address of GMAC registers), or maybe no offset (just full register addresses), but in both cases trying to read 0x18070004 (LUTs_AGER_INT) returns an incorrect value (bits 31:4 are reserved and should read 0):
I also considered using /dev/mem, but could not find any usable tool to make a usable dump (there is tail, but not head, so that might be usable to make a binary dump, but then I do not have an easy way to get the binary file off the board through serial (stock firmware does not have telnet or ssh or netcat)
/bin/sh: dd: not found
/bin/sh: devmem: not found
/bin/sh: hd: not found
/bin/sh: hexdump: not found
# ls /bin/
* cp egrep ls netstat ps sleep umount
ash date grep mkdir nice pwd stty uname
busybox df ip mknod pidof rm sync vi
cat dmesg kill mount ping sed touch
chmod echo ln mv ping6 sh true
# ls /sbin/
KC_SMB hotplug2 radvd
apcfg_init httpd raether
app_agentd ifconfig rdisc6
app_agentsd ifdown reboot
apps_init ifup rmgmt
apps_init_ver.txt igmpproxy rmmod
appscore_init_ver.txt init route
arp insmod rpcbind
autoFWupgrade ip setconfig
brctl ip6tables ssdk_sh
burn ippoold starteth.sh
config_init iptables sysconf_cli
config_term iwconfig sysconfd
dhcp6c iwlist sysctl
dhcp6s iwpriv syslogd
dl.sh links.sh tc
dnsget llmnrd udhcpc
dnsmasq logserver udhcpd
dr.sh lsmod umount
dumpleases md updatedd
eraseall minidlna usbmgr
ethreg miniupnpd usbmgr_cli
ez-ipupdate mkfs.jffs2 utmconfig
factory_apps_init mm utmproxy
flashw modprobe uuidgen
gendoclist.sh mount vconfig
genmuslist.sh ntfs-3g wandetector
genpiclist.sh ntfslabel wanmanager
genvidlist.sh ntpclient wanmanager_host_get
halt openl2tpd watchdog
header opmode.sh wget
hostapd poweroff wlanconfig
hostapd1 pppoe wolmanager
# ls /usr/bin/
* burnA config_init ipcs tail wc
[ burna config_term killall test which
[[ burnb ether-wake logger tftp
awk burnf expr md5sum time
basename burnk find printf tty
burn cmp id sort uptime
# ls /usr/sbin
chat poff pon pppd pppd0 pppd1 pppstats
@tmn505 If you have suggestions on what offsets to pass to ethreg to inspect relevant registers, let me know.
I've also just soldered on a UART header on the AR9342 working board, so we can compare (see if the stock firmware uses different settings on both boards, or maybe just uses the same settings that work on both).
w00t! I went ahead and compiled a custom image with this patch (just to see if it would change anything - I copied these settings blindly from some other board), and it fixes the packet loss (tested just with ping so far) on both 100Mbps and 1Gbps:
Of course, there is no indication that these are really the best settings (In git history I see other boards where they reverted these delays in favor of changing the phy-mode or pll-data), but at least it confirms that we're looking in the right place. So I guess still worth looking at the settings used by the stock firmware.
I also enabled /dev/mem in this image, so I can see the 4 times "3" settings in memory (the 0x3FC), bits 21:14:
# devmem 0x18070000
I can also change them back to the default 0, which causes packet loss again:
# devmem 0x18070000 32 0x00000001
Fiddling around with these settings, I can see that only the ETH_RXD_DELAY seems to influence the packet loss seen with ping, and setting it to 1 is sufficient to fix it completely (at both 100Mbps and 1Gbps).
I haven't tested the other board yet.
I also noticed an RX_DELAY setting in the ETH_XMII_CONTROL register (which is what pll-data sets AFAICS). In the default ar934x.dtsi, it is set to 1, but changed to 0 for the wlr-7100. However, if I cange it back to 1, connectivity seems to break entirely, but maybe that's because they shouldn't be changed while the NIC is running
I also noticed that the phy-modecan also be used to introduce a delay, which then configures the PHY (or in this case, the AR8337 which is sortof a virtual PHY since it is connected to the CPU using RGMII) to introduce a delay (which, looking at upstream linux qca8k-8xxx.c, sets a delay of "2", which I presume is 2ns). This would be triggered by setting phy_mode = "rgmii-rxid. I haven't tried this yet, since that would involve updating the DTS, so regenerating the image (or at least the kernel), which takes a while.
What is interesting, is that AFAIU both RX and TX need a delay somewhere (by default CLK and DATA are outputted synchronously, resulting in instabilities). So it seems that TX actually does have a delay somewhere, or is just lucky that it always works? I see that sometimes PCB traces can be laid out to introduce this delay (there are some squiglies in the traces between CPU and switch, but maybe that's just length matching), but then I would be surprised if they did that for TX but not for RX.
One question I had: How is this really relevant? The link between the switch and the CPU is always 1Gbps AFAICS, only the link between the switch and the outside world is different. And AFAICS both the pll-data as well as the delays only affect this link, right? Or does the pll-data also somehow drive the switch chip?