Vectoring on Lantiq VRX200 / VR9 - missing callback for sending error samples

Is it only about line stability? In that case I can't test it, I'm afraid. My line is rock stable. Last years my line uptime was equal to the time between 2 OpenWrt releases.

This is very unlikely to decrease stability. But if your line is already entirely stable, it won't improve anything, either. So it is understandable if you don't want to try it.

It would make the most sense to try this on a vectored line where a VR9-based modem running OpenWrt is currently unstable (where unstable could also mean a resync just every few days), while stock firmware or other modems are working fine.

I guess my link would qualify, VDSL2 100/40 link with known stability issues ever since the switch from plain VDSL2 50/10 VDSL2-Vectoring+G.INP 100/40. I am just not sure when I can inflict this on my user base/family (my holiday is just over) ...

2 Likes

As a note to anyone who wants to try this: There seems to be a bug in the DSA switch driver, so that packets larger than 1496 bytes are dropped when VLAN is used (FS#3990). This can be worked around by decreasing the MTU by 4 bytes (or patching the driver).

I also noticed that after a few hours (between 8 to 24 so far), a small fraction (<5%) of packets transmitted over the DSL interface get delayed by up to 4 seconds. I am currently trying to find out what causes this, and if it could be related to my changes.

1 Like

The latency issue is actually related to the vectoring driver. The problem seems to be that the vectoring driver directly calls the ndo_start_xmit method of the PTM driver to send the data. I think the reason this works on non-OpenWrt firmware is that instead of the PTM driver they use the PPA driver where calculation of the tx descriptor is protected by a lock.

I found out that AVM's current version of vectoring driver does not call the ndo_start_xmit method, but instead queues the data normally using dev_queue_xmit. It is available from https://osp.avm.de/fritzbox/fritzbox-7430/source-files-FRITZ.Box_7430-07.27.tar.gz (the vectoring driver is in "GPL/GPL-kernel.tar.gz" under "linux/drivers/net/avm_cpmac/switch/ifx/vectoring/").

Currently I have been using that driver for more than 24 hours and it seems to work. The only thing I noticed is that the n_processed parameter increases faster while the upstream is saturated. I think this is because some error reports get dropped due to missing prioritization, and as a result the VCE requests more to be sent.

As I have done tests for some time now both with and without the vectoring driver, I think that it actually fixes downstream vectoring. During 3 days without the vectoring driver, the downstream SNR margin dropped by 1.3 dB (from 15.9 dB). While this is not that much, it has always stayed within 16-16.4 dB with the vectoring driver so far (except for the first 5 minutes after synchronization where it is often a bit lower). Also, most of the decrease happened one night between 2/3 AM, which is when ASSIA/DLM usually reconfigures lines. And connection/disconnection of other lines is exactly when error reports are needed the most.

I really hope this is working out. The proof is in a longer uptime, though. I distinctly remember that, when I had my vectoring line on a FritzBox 3370 (VR9), my line wasn't completely craptastic -- it sometimes took more than two or three weeks to significantly deteriorate. The real difference is that it never recovered from a downgrade such as the one you describe, and ended up slowly but steadily creeping down to ~6 dB SNR.

Truth be told, I will probably never switch back to an XRX200 system again, if only because I got quite fond of AC wireless and the increased SoC capability of my current MT7621 device (the VR9 can only "almost" handle a 100 mbit line). But I would actually consider switching my external modem with a bridged VR9 device because my current modem only has 100 mbit ethernet, and my downstream currently provides ~115 mbit.

1 Like

I guess I also qualify for testing, since I suffer from the same line deterioration problem. What kind of information, besides testing for a couple of days/weeks, do you need?

As a side note, your branch does not build with the testing kernel (5.10).

I think it is useful to monitor the change over time of the following parameters: DSM statistics (dsmstatg), downstream SNR margin (g997lsg 1 1) and downstream net data rate if the line uses SRA (g997csg 0 1).

I would strongly recommend using the ltq-vectoring-avm branch. The current version in that branch also sets a priority for the error reports.

I think using AVM's version of the vectoring driver makes more sense than trying to get the original version from Lantiq working without concurrency issues. (I am wondering if the current PTM driver is even correct without the vectoring driver, as the PPA driver uses locking in lots of places while the PTM doesn't use any locks at all.)

I am going to look into this.

(cough) I might have written something that could help you there but it's been a while since I wrote this. I'm not entirely sure if it will still work with 21.02 or current snapshot. I have been loosely following development on Lantiq VDSL stuff, and some things might have changed (I believe VDSL stats are now available via ubus or something, also I'm not entirely sure rrdtool definitions are still using LUA.)

I built this branch and so far, everything seems to run fine *knockingonwood. With the lantiq-branch I had several random crashes/reboots.

I collect them every hour. Is this fine-grained enough?

AFAIK, this needs to be rewritten/updated for 21.02/master. When I updated from 19.04.x to 21.02-rc on another device, no data was collected any more.

Both branches are now updated to build with the testing kernel.

I only had such severe problems with the version that was shortly in the branch this night. But it's also possible that the concurrency issue just leads to different kinds of issues.

For monitoring line stability this should be enough.

Personally, I monitor the output of all useful dsl_cpe_pipe.sh commands every half hour. Currently, I am also monitoring dsmstatg every minute to check if there are any irregularities.

By the way, an interesting effect of AVM's driver is that you can capture the error reports using tcpdump (SNAP OUI 0x0019a7 and protocol ID 0x0003, this filter works: llc and ether[14:4]=0xaaaa0300 and ether[18:4]=0x19a70003). One thing that still needs fixing is the source MAC address, which is currently all-zeroes. It seems to work regardless, but the spec says to use the proper device MAC address.

I thought so. I believe it is not that hard to collect values in 21.02, maybe even easier (since the root requirement may not apply anymore, what with the line stats exposed through ubus.)

Thank you. I just built it with the testing kernel, and it is working.

Pointer: I rewrote the scripts to collect Lantiq DSL values in collectd for 21.02. (Unfortunately, ltq-vdsl-app does not expose DSM statistics through ubus.)

Thanks. Already installed. I'll give you feedback in the other thread, if I encounter any problems.

I had an uptime of 7 days. The line had to be disconnected a day ago, since I switched from DTAG to o2.

During the 7 days, I had a stable SNR of 11.5 dB +/- 0.5 dB and the line did not suffer from deterioration. Let's see how it develops with my no ISP, since it is basically the same VPE I doubt that there are any significant changes. But I'll keep you posted.

Are there any plans to get this work merged into OpenWRT?

Yes, I plan to get this merged eventually. However, there are a few reasons for not yet doing that:

  • The source MAC address is not yet set correctly.

  • Testing to make sure this actually works (results from @stonerl and myself look good so far, my own line is synced for over 12 days now and SNR margin is also absolutely stable).

  • I am still waiting on a license clarification from AVM. While the code must be licensed under GPLv2 (it is built into the kernel image in AVM firmware), this is not clearly marked in the source code. There is only the MODULE_LICENSE macro, and the header refers to a separate file named "LICENSE" which doesn't actually exist.

2 Likes

Quick question, if I wanted to join the testing fun, is there a simple instruction somewhere, how I can quickly build my own instance from your sources? (I have build openwrt in the past, but only by leveraging @hnyman's really nice scripts, so am a bit rusty with OpenWrt proper builts).