Vectoring on Lantiq VRX200 / VR9 - missing callback for sending error samples

I noticed that usage of mei_dsm_cb_func_hook is patched out in the Lantiq VDSL MEI driver in OpenWrt:
https://git.openwrt.org/?p=openwrt/openwrt.git;a=blob;f=package/kernel/lantiq/ltq-vdsl-mei/patches/100-compat.patch;h=75e1500171a748c7b616758e635e0a2ded5871e2;hb=HEAD#l385

This is a callback function that would be defined in another kernel module which is currently missing from OpenWrt. The purpose of this callback is to send error samples to the VCE (Vectoring Control Entity).

Without those error samples, vectoring obviously can't work properly, so that would explain reports about instability with vectoring.

Copies of the missing driver are available at https://github.com/brunompena/dwr-966/tree/master/target/linux/lantiq/files/drivers/net/lantiq_ppa/vectoring and https://gitlab.com/gplmirror/fritzbox-7560-v1/-/tree/master/source-files-FRITZ.Box_7560-06.51/GPL-release_kernel/linux-3.10/drivers/net/lantiq_ppa/vectoring

It looks like the callback in the driver just sends the data on ptm0 (this would be dsl0 on OpenWrt).

Looking at the vectoring specification (ITU G.993.5), this implements the L2 Ethernet encapsulation of the backchannel. Alternatively, the error samples could also be transmitted using the eoc (embedded operation channel). I don't know where that is handled, but I suspect it is done entirely within the DSL firmware. Anyway, the actual encapsulation to be used is selected by the VCE, so both methods need to be supported.

If anyone here is using a VR9 device with OpenWrt on a vectoring line, the output of dsl_cpe_pipe.sh dsmstatg would be interesting (and also dsl_cpe_pipe.sh dsmsg to verify that vectoring is enabled). If the value n_mei_dropped_no_pp_cb is non-zero, this means that error samples were dropped due to the missing callback.

I will probably test this myself soon, but I will have to use my actual DSL connection for that, and want to limit unnecessary interruptions of the line (I am not using an OpenWrt modem normally).

5 Likes

Same, I'm not using an OpenWrt VR9 internal modem anymore because I was tired of the constant line deterioration. As much as I would like to help out, I really don't want to rock the boat at the moment, getting excellent and reliable line quality and stability from the external Broadcom-based Zyxel VMG1312. However, I have previously seen this slow, but steady deterioration with VR9 modems also on non-vectoring (ADSL!) lines, is this backchannel also used in other profiles?

3 Likes

I didn't actually use an OpenWrt modem myself so far (well, the stock firmware on my current Broadcom modem is OpenWrt-based, but that doesn't count). But I recently got a supported device for another reason and happened to stumble on this patch. As I read before about these issues I decided to look into it a bit more.

I don't think this is used for anything else (this is basically using the normal data channel for management traffic). This code in the driver is definitely specific to vectoring error samples.

root@Router:~# dsl_cpe_pipe.sh dsmstatg
nReturn=0 n_processed=0 n_fw_dropped_size=1782447 n_mei_dropped_size=0 n_mei_dropped_no_pp_cb=9670 n_pp_dropped=0

root@Router:~# dsl_cpe_pipe.sh dsmsg
nReturn=0 eVectorStatus=2 eVectorFriendlyStatus=0

That is at uptime of 6 days, 170GB down and 150GB up

1 Like

Thanks! As expected, this shows that no error samples were successfully sent (n_processed is 0), but some were not sent due to the missing callback.

The value n_fw_dropped_size (number of error vectors that were dropped in the firmware) is very high, but that could be a result of the missing callback (the callback sets the first 4 bytes of the buffer to 0 to signal the firmware that the data was processed).

If anyone is feeling adventurous, they can try this branch: https://github.com/janh/openwrt/commits/ltq-vectoring

The driver builds and it doesn't seem to cause any adverse effects when used with a non-vectored line. But I haven't tested it on an actual vectored line so far.

I connected the modem to my DSL line now. The result without the vectoring driver is as expected (this is after 30 minutes):

# dsl_cpe_pipe.sh DSM_STATisticsGet
nReturn=0 n_processed=0 n_fw_dropped_size=6112 n_mei_dropped_size=0 n_mei_dropped_no_pp_cb=6145 n_pp_dropped=0

# dsl_cpe_pipe.sh DSM_StatusGet
nReturn=0 eVectorStatus=2 eVectorFriendlyStatus=0

With the driver this changes to:

# dsl_cpe_pipe.sh DSM_STATisticsGet
nReturn=0 n_processed=0 n_fw_dropped_size=183 n_mei_dropped_size=0 n_mei_dropped_no_pp_cb=0 n_pp_dropped=184

So, that's a partial success: The callback is called, but it does not successfully send the data. After enabling error output in the driver with echo enable err > /proc/vectoring, dmesg shows the following:
target-mips_24kc_musl/linux-lantiq_xrx200/ltq-vectoring/ifxmips_vectoring.c:149:mei_dsm_cb_func: g_ptm_net_dev == NULL

Now the question is why the driver does not detect the network device.

Edit: I think I found the issue. Linux 3.11 changed how the netdev can be accessed in the event handler. The updated version is on GitHub, but still untested, as I need to interrupt the connection for that.

1 Like

It seems to work! This is after 4 hours:

# dsl_cpe_pipe.sh DSM_STATisticsGet
nReturn=0 n_processed=19884 n_fw_dropped_size=0 n_mei_dropped_size=0 n_mei_dropped_no_pp_cb=0 n_pp_dropped=0

Of course, this doesn't tell if the error samples are actually received at the other end. The only realistic way to test this is probably to do a long-term test and monitor stability.

However, my line is not well suited for that, as it is very short and stability issues are likely to be hidden by that (current downstream SNR margin is 16 dB). Also, the line normally uses profile 35b and with the VR9 modem it can only run in fallback mode with significantly reduced data rate.

If anyone wants to try this out, the current code is available from the GitHub repository linked above.

1 Like

Is it only about line stability? In that case I can't test it, I'm afraid. My line is rock stable. Last years my line uptime was equal to the time between 2 OpenWrt releases.

This is very unlikely to decrease stability. But if your line is already entirely stable, it won't improve anything, either. So it is understandable if you don't want to try it.

It would make the most sense to try this on a vectored line where a VR9-based modem running OpenWrt is currently unstable (where unstable could also mean a resync just every few days), while stock firmware or other modems are working fine.

I guess my link would qualify, VDSL2 100/40 link with known stability issues ever since the switch from plain VDSL2 50/10 VDSL2-Vectoring+G.INP 100/40. I am just not sure when I can inflict this on my user base/family (my holiday is just over) ...

2 Likes

As a note to anyone who wants to try this: There seems to be a bug in the DSA switch driver, so that packets larger than 1496 bytes are dropped when VLAN is used (FS#3990). This can be worked around by decreasing the MTU by 4 bytes (or patching the driver).

I also noticed that after a few hours (between 8 to 24 so far), a small fraction (<5%) of packets transmitted over the DSL interface get delayed by up to 4 seconds. I am currently trying to find out what causes this, and if it could be related to my changes.

1 Like

The latency issue is actually related to the vectoring driver. The problem seems to be that the vectoring driver directly calls the ndo_start_xmit method of the PTM driver to send the data. I think the reason this works on non-OpenWrt firmware is that instead of the PTM driver they use the PPA driver where calculation of the tx descriptor is protected by a lock.

I found out that AVM's current version of vectoring driver does not call the ndo_start_xmit method, but instead queues the data normally using dev_queue_xmit. It is available from https://osp.avm.de/fritzbox/fritzbox-7430/source-files-FRITZ.Box_7430-07.27.tar.gz (the vectoring driver is in "GPL/GPL-kernel.tar.gz" under "linux/drivers/net/avm_cpmac/switch/ifx/vectoring/").

Currently I have been using that driver for more than 24 hours and it seems to work. The only thing I noticed is that the n_processed parameter increases faster while the upstream is saturated. I think this is because some error reports get dropped due to missing prioritization, and as a result the VCE requests more to be sent.

As I have done tests for some time now both with and without the vectoring driver, I think that it actually fixes downstream vectoring. During 3 days without the vectoring driver, the downstream SNR margin dropped by 1.3 dB (from 15.9 dB). While this is not that much, it has always stayed within 16-16.4 dB with the vectoring driver so far (except for the first 5 minutes after synchronization where it is often a bit lower). Also, most of the decrease happened one night between 2/3 AM, which is when ASSIA/DLM usually reconfigures lines. And connection/disconnection of other lines is exactly when error reports are needed the most.

I really hope this is working out. The proof is in a longer uptime, though. I distinctly remember that, when I had my vectoring line on a FritzBox 3370 (VR9), my line wasn't completely craptastic -- it sometimes took more than two or three weeks to significantly deteriorate. The real difference is that it never recovered from a downgrade such as the one you describe, and ended up slowly but steadily creeping down to ~6 dB SNR.

Truth be told, I will probably never switch back to an XRX200 system again, if only because I got quite fond of AC wireless and the increased SoC capability of my current MT7621 device (the VR9 can only "almost" handle a 100 mbit line). But I would actually consider switching my external modem with a bridged VR9 device because my current modem only has 100 mbit ethernet, and my downstream currently provides ~115 mbit.

1 Like

I guess I also qualify for testing, since I suffer from the same line deterioration problem. What kind of information, besides testing for a couple of days/weeks, do you need?

As a side note, your branch does not build with the testing kernel (5.10).

I think it is useful to monitor the change over time of the following parameters: DSM statistics (dsmstatg), downstream SNR margin (g997lsg 1 1) and downstream net data rate if the line uses SRA (g997csg 0 1).

I would strongly recommend using the ltq-vectoring-avm branch. The current version in that branch also sets a priority for the error reports.

I think using AVM's version of the vectoring driver makes more sense than trying to get the original version from Lantiq working without concurrency issues. (I am wondering if the current PTM driver is even correct without the vectoring driver, as the PPA driver uses locking in lots of places while the PTM doesn't use any locks at all.)

I am going to look into this.

(cough) I might have written something that could help you there but it's been a while since I wrote this. I'm not entirely sure if it will still work with 21.02 or current snapshot. I have been loosely following development on Lantiq VDSL stuff, and some things might have changed (I believe VDSL stats are now available via ubus or something, also I'm not entirely sure rrdtool definitions are still using LUA.)

I built this branch and so far, everything seems to run fine *knockingonwood. With the lantiq-branch I had several random crashes/reboots.

I collect them every hour. Is this fine-grained enough?

AFAIK, this needs to be rewritten/updated for 21.02/master. When I updated from 19.04.x to 21.02-rc on another device, no data was collected any more.

Both branches are now updated to build with the testing kernel.

I only had such severe problems with the version that was shortly in the branch this night. But it's also possible that the concurrency issue just leads to different kinds of issues.

For monitoring line stability this should be enough.

Personally, I monitor the output of all useful dsl_cpe_pipe.sh commands every half hour. Currently, I am also monitoring dsmstatg every minute to check if there are any irregularities.

By the way, an interesting effect of AVM's driver is that you can capture the error reports using tcpdump (SNAP OUI 0x0019a7 and protocol ID 0x0003, this filter works: llc and ether[14:4]=0xaaaa0300 and ether[18:4]=0x19a70003). One thing that still needs fixing is the source MAC address, which is currently all-zeroes. It seems to work regardless, but the spec says to use the proper device MAC address.