Vectoring on Lantiq VRX200 / VR9 - missing callback for sending error samples

IMHO the nicest would be to simply package these as part of at least xrx200 builds (or any lantiq builds with DSL functionality), they appear small enough not matter much for size and are clearly something quite useful...

Since neither collectd nor collectd-mod-exec are in the default packages (of course they aren't), I don't see why they would (or even should) include files building on those packages.

It's always an option to include the three (four) files when building one's own custom image (which is what I do using the ImageBuilder).

The modified rrdtool.js, however, could maybe go into luci-app-statistics. Interesting sidenote: One of the things I modified is the ability to (un)set the line width. The current rrdtool.js already has a "width" parameter for setting the line width, but it's not picked up from the configuration file. Someone must have forgotten.

Fair point.

That's what I did with my last build, dump the files/folders into ${openwrt_build_root}/files, but that is not helping consumers of normal stable builds that might naively expect something like you nice visualization tools to be there by default, no?

+1 if that means one less file for your github project to carry even better :wink:

Of course not, but that expectation will never be met. Even if my files were included the user would still have to install luci-app-statistics and collectd-mod-exec and configure the latter. At that point they would have to look up what to configure anyway, and from there it's just a minor inconvenience having to install the files themselves. I don't think there's a win to be had here.

I accept your position, just for completeness, luci-app-statistics and collectd-mod-exec can be easily installed from the GUI or CLI from the default repository, your scripts however require a bit more involvement/action from the potential users and can not be detected from looking at the packet sources :wink: .

I see your point, but that's not going to change. For a convenient "install package" experience, my package would have to go into the default repository, and I don't see that happening. I'm not trying to be difficult or save myself from work here: A major part of my scripts is the exec.js "configuration" for rrdtool/luci-app-statistics, and we don't have exclusive rights to that file.

Edit: It may be worth looking into luci-app-statistics, maybe it's possible to pull not only a singular exec.js, but multiple exec-<execplugin>.js if available. As is, this is a general problem: If you have multiple exec scripts you still have only one exec.js that has to visualize all data from all exec plugins. I believe it's only due to the fact that almost noone ever uses multiple exec scripts that this hasn't lead to problems before.

For this to be done in an upstreamable way, one would have to create a collectd plugin. And while that's probably not too difficult, I'm not up to that task, my language proficiency in C/C++ is entirely passive.

1 Like

Here another sample of data:

Model	AVM FRITZ!Box 3370 Rev. 2 (Micron NAND)
Architecture	xRX200 rev 1.2
Target Platform	lantiq/xrx200
Firmware Version	OpenWrt SNAPSHOT r19454-8084ec8061 / LuCI Master git-22.089.43958-7110635
Kernel Version	5.10.110
Line State: Showtime with TC-Layer sync
Line Mode: G.993.2 (VDSL2, Profile 17a, with down- and upstream vectoring)
Line Uptime: 20d 4h 28m 13s
Annex: B
Data Rate: 105.400 Mb/s / 36.999 Mb/s
Max. Attainable Data Rate (ATTNDR): 110.159 Mb/s / 43.181 Mb/s
Latency: 0.15 ms / 0.00 ms
Line Attenuation (LATN): 10.9 dB / 10.9 dB
Signal Attenuation (SATN): 10.9 dB / 10.8 dB
Noise Margin (SNR): 7.1 dB / 7.9 dB
Aggregate Transmit Power (ACTATP): -1.5 dB / 13.1 dB
Forward Error Correction Seconds (FECS): 0 / 13295
Errored seconds (ES): 8 / 117
Severely Errored Seconds (SES): 2 / 59
Loss of Signal Seconds (LOSS): 1 / 0
Unavailable Seconds (UAS): 140 / 140
Header Error Code Errors (HEC): 0 / 0
Non Pre-emptive CRC errors (CRC_P): 0 / 0
Pre-emptive CRC errors (CRCP_P): 0 / 0
ATU-C System Vendor ID: Broadcom 194.127
Power Management Mode: L0 - Synchronized

Not sure what all those numbers mean, but I hope it's useful.
I'm using the the blob vr9-B-dsl.bin with sha1sum of 539162673a99a05d8343da06bebe7e6c356f0a7d and the bare (didn't apply any patches etc) development snapshot.
The ISP is 1&1 and I'm in Germany.

Same here, absolutely stable since 68 days. :slightly_smiling_face:
DSL Firmware 5.9.1.4.0.7-5.9.0.D.0.2

Model	AVM FRITZ!Box 3370 Rev. 2 (Micron NAND)
Architecture	xRX200 rev 1.2
Target Platform	lantiq/xrx200
Firmware Version	OpenWrt SNAPSHOT r19034+4-ba6a48366f / LuCI Master git-22.058.70382-d29400e
Kernel Version	5.10.100
Line State:Showtime with TC-Layer sync
Line Mode:G.993.2 (VDSL2, Profile 17a, with down- and upstream vectoring)
Line Uptime:68d 17h 12m 42s
Annex:B
Data Rate:63.679 Mb/s / 12.736 Mb/s
Max. Attainable Data Rate (ATTNDR):124.273 Mb/s / 38.710 Mb/s
Latency:0.13 ms / 0.00 ms
Line Attenuation (LATN):11.8 dB / 9.8 dB
Signal Attenuation (SATN):11.8 dB / 9.9 dB
Noise Margin (SNR):23.2 dB / 23.1 dB
Aggregate Transmit Power (ACTATP):-2.4 dB / 13.4 dB
Forward Error Correction Seconds (FECS):0 / 151
Errored seconds (ES):0 / 91
Severely Errored Seconds (SES):0 / 40
Loss of Signal Seconds (LOSS):6 / 0
Unavailable Seconds (UAS):396 / 396
Header Error Code Errors (HEC):0 / 0
Non Pre-emptive CRC errors (CRC_P):0 / 0
Pre-emptive CRC errors (CRCP_P):0 / 0
ATU-C System Vendor ID:Broadcom 194.127
Power Management Mode:L0 - Synchronized

Running for some weeks now OpenWRT on a ZyXEL P-2812HNU-F1 using snapshot OpenWrt SNAPSHOT r19597-65258f5d60 / LuCI Master git-22.123.46500-111c551 and firmware version 5.7.11.5.0.7

Although DSL transmission (SNR, bitrates) looks very stable I do have the issue that after some days transport of anything suddenly stops.

DSL Status
Line State:Showtime with TC-Layer sync
Line Mode:G.993.2 (VDSL2, Profile 17a, with down- and upstream vectoring)
Line Uptime:4h 31m 47s
Annex:B
Data Rate:106.449 Mb/s / 32.952 Mb/s
Max. Attainable Data Rate (ATTNDR):106.378 Mb/s / 33.376 Mb/s
Latency:0.13 ms / 0.00 ms
Line Attenuation (LATN):14.2 dB / 15.0 dB
Signal Attenuation (SATN):14.3 dB / 14.9 dB
Noise Margin (SNR):5.3 dB / 5.0 dB
Aggregate Transmit Power (ACTATP):11.8 dB / 6.4 dB
Forward Error Correction Seconds (FECS):0 / 6745305
Errored seconds (ES):70575 / 48
Severely Errored Seconds (SES):16 / 22
Loss of Signal Seconds (LOSS):0 / 0
Unavailable Seconds (UAS):597 / 597
Header Error Code Errors (HEC):0 / 0
Non Pre-emptive CRC errors (CRC_P):14413060 / 0
Pre-emptive CRC errors (CRCP_P):0 / 0
ATU-C System Vendor ID:Broadcom 178.30
Power Management Mode:L0 - Synchronized

Interrupts as well are a bit too silent:

Restarting the DSL drivers (/etc/init.d/dsl_control restart will bring everything back.

Any idea what might happen here and how to solve this?

I don't have any technical insight to contribute, but what I find interesting:

Your line seems to dip below the miminum SNR of 6 dB on a regular basis, and judging by your graph never really manages to get up to 6 dB afterwards.

That is a metric ton of line errors. I am not sure if I would call that "very stable", it looks more like your line is barely hanging on by a thread.

And I mean "hanging on by a single thread" quite literally: I have seen these SNR and error numbers before, on what later turned out to be a DSL line running on a single(!) wire (the other was mistakenly pulled from the in-house panel when another tenant's line was connected). The Lantiq modem is shockingly error tolerant, it could establish and maintain a connection on the obviously faulty line when the technician's multi-thousand-euro diagnostic modem could not.

OK, let me check the cabling. Thnx for your initial feedback.

To be perfectly honest: With these numbers, after making sure that my local wiring is not at fault (i.e. doesn't improve with a different cable), I would open a ticket with my ISP. That line is not well.

Checked and replaced cable (just in case) which didn't change anything. But thinking this over, the 5.3 Noise Margin is really that margin between required and actual SNR (no absolute SNR) unless that name is wrong. In case of bad signal I would have expected many retrains and that's something I don't see happening os far.

In a perfect world with no noise (and frankly no resistance/impedance) and SNR-margin of 0 would potentially work, in the real world all ISPs noted that for halfway reliable operations at the very least 3dB are required and often a bit more. So they enforce a SNR-margin at sync, so that the link has enough reserves to deal with the expected level of noise.

Very large counter values from the remote end can also mean that the remote end DSLAM has not be reset/rebooted in ages, no?

In your graphs the DSL metrics continue to be reported while the issue occurs. So it looks like the modem and its drivers and userspace application seem to be working the entire time. (It is a bit weird though that the upstream stats are sometimes missing.)

As far as I can see from the interrupt graph, the rate of the mei_cpe interrupt (63) stays roughly the same the entire time (it is hard to be see though with the stacked graph). Only the PTM (96) and Ethernet (72/73) interrupts drop to nearly zero. This doesn't necessarily point to the modem as the issue, you would generally see the same thing whenever no data is transferred for any reason.

Have you made sure that this is actually a problem with the modem itself? Maybe there is some other issue that just happens to go away with a reconnection (due to the dsl0 interface link going down and up again).

To try to narrow down possible causes you could check if restarting the control daemon is actually necessary to fix the connection. For example, try reloading the PTM driver (rmmod ltq-vr9-ptm and modprobe ltq-vr9-ptm) or restarting the DSL connection via dsl_cpe_pipe.sh acs 2 instead.

Regarding the error counters, it probably makes sense to look at when these errors occur (include the error counters in your monitoring if not already the case).

Just another thing ...

... is a pretty old firmware. I am aware that it was common practice to try different firmware versions in hope of more stability before @janh found the actual root cause, but from own experience lots of the older firmware versions are unstable and/or unnecessarily aggressive. Now that the actual problem for Vectoring instability has been fixed, the current best practice seems to be to get the latest available, currently 5.9.1.4.0.7-5.9.0.D.0.2. Maybe you want to try a different/the newest firmware, just to rule out that potential cause?

Sure, let's try that as well. Thnx.

OK, done changing for today:

  • switched from the Zyxel to a Fritzbox 7360v2 I had still around, flashed it with the latest OpenWRT master (of today)
  • switched firmware to 5.9.1.4.0.7-5.9.0.D.0.2 (guess I need to B version, never considered using an A version)

So far stable, DS SNR margin increased to 6 dB. Let's see what happens the coming hours and days.

Hi @janh , thnx for your support. As stated below, changed to a Fritzbox to explore whether that gives the same result. Regarding reloading the PTM driver, I did that once and immediate lost the Ethernet link so checking anything after that became a kind of impossible (reboot required). What I want to check when it happens again is whether the sending of error samples continues (which is a kind of expected) or whether this stops as well.

I thought I remembered reloading the PTM driver to work without any issues. But testing again, it seems to break the DSL connection when the driver is loaded again. And after that no packets are received over DSL anymore (sending works). Seems to be fixed only by unloading and reloading the CPE API and MEI drivers (the PTM driver will then be loaded again automatically during the initial connection).

Unfortunately, it is not really possible to know for sure if the error samples are actually transmitted without having access to the other end of the line. Of course, one can look at the counters and check using tcpdump if the packets are passed to the PTM driver. But in this case I'm not sure if the results from that should be trusted.