Vectoring on Lantiq VRX200 / VR9 - missing callback for sending error samples

Same here, absolutely stable since 68 days. :slightly_smiling_face:
DSL Firmware 5.9.1.4.0.7-5.9.0.D.0.2

Model	AVM FRITZ!Box 3370 Rev. 2 (Micron NAND)
Architecture	xRX200 rev 1.2
Target Platform	lantiq/xrx200
Firmware Version	OpenWrt SNAPSHOT r19034+4-ba6a48366f / LuCI Master git-22.058.70382-d29400e
Kernel Version	5.10.100
Line State:Showtime with TC-Layer sync
Line Mode:G.993.2 (VDSL2, Profile 17a, with down- and upstream vectoring)
Line Uptime:68d 17h 12m 42s
Annex:B
Data Rate:63.679 Mb/s / 12.736 Mb/s
Max. Attainable Data Rate (ATTNDR):124.273 Mb/s / 38.710 Mb/s
Latency:0.13 ms / 0.00 ms
Line Attenuation (LATN):11.8 dB / 9.8 dB
Signal Attenuation (SATN):11.8 dB / 9.9 dB
Noise Margin (SNR):23.2 dB / 23.1 dB
Aggregate Transmit Power (ACTATP):-2.4 dB / 13.4 dB
Forward Error Correction Seconds (FECS):0 / 151
Errored seconds (ES):0 / 91
Severely Errored Seconds (SES):0 / 40
Loss of Signal Seconds (LOSS):6 / 0
Unavailable Seconds (UAS):396 / 396
Header Error Code Errors (HEC):0 / 0
Non Pre-emptive CRC errors (CRC_P):0 / 0
Pre-emptive CRC errors (CRCP_P):0 / 0
ATU-C System Vendor ID:Broadcom 194.127
Power Management Mode:L0 - Synchronized

Running for some weeks now OpenWRT on a ZyXEL P-2812HNU-F1 using snapshot OpenWrt SNAPSHOT r19597-65258f5d60 / LuCI Master git-22.123.46500-111c551 and firmware version 5.7.11.5.0.7

Although DSL transmission (SNR, bitrates) looks very stable I do have the issue that after some days transport of anything suddenly stops.

DSL Status
Line State:Showtime with TC-Layer sync
Line Mode:G.993.2 (VDSL2, Profile 17a, with down- and upstream vectoring)
Line Uptime:4h 31m 47s
Annex:B
Data Rate:106.449 Mb/s / 32.952 Mb/s
Max. Attainable Data Rate (ATTNDR):106.378 Mb/s / 33.376 Mb/s
Latency:0.13 ms / 0.00 ms
Line Attenuation (LATN):14.2 dB / 15.0 dB
Signal Attenuation (SATN):14.3 dB / 14.9 dB
Noise Margin (SNR):5.3 dB / 5.0 dB
Aggregate Transmit Power (ACTATP):11.8 dB / 6.4 dB
Forward Error Correction Seconds (FECS):0 / 6745305
Errored seconds (ES):70575 / 48
Severely Errored Seconds (SES):16 / 22
Loss of Signal Seconds (LOSS):0 / 0
Unavailable Seconds (UAS):597 / 597
Header Error Code Errors (HEC):0 / 0
Non Pre-emptive CRC errors (CRC_P):14413060 / 0
Pre-emptive CRC errors (CRCP_P):0 / 0
ATU-C System Vendor ID:Broadcom 178.30
Power Management Mode:L0 - Synchronized

Interrupts as well are a bit too silent:

Restarting the DSL drivers (/etc/init.d/dsl_control restart will bring everything back.

Any idea what might happen here and how to solve this?

I don't have any technical insight to contribute, but what I find interesting:

Your line seems to dip below the miminum SNR of 6 dB on a regular basis, and judging by your graph never really manages to get up to 6 dB afterwards.

That is a metric ton of line errors. I am not sure if I would call that "very stable", it looks more like your line is barely hanging on by a thread.

And I mean "hanging on by a single thread" quite literally: I have seen these SNR and error numbers before, on what later turned out to be a DSL line running on a single(!) wire (the other was mistakenly pulled from the in-house panel when another tenant's line was connected). The Lantiq modem is shockingly error tolerant, it could establish and maintain a connection on the obviously faulty line when the technician's multi-thousand-euro diagnostic modem could not.

OK, let me check the cabling. Thnx for your initial feedback.

To be perfectly honest: With these numbers, after making sure that my local wiring is not at fault (i.e. doesn't improve with a different cable), I would open a ticket with my ISP. That line is not well.

Checked and replaced cable (just in case) which didn't change anything. But thinking this over, the 5.3 Noise Margin is really that margin between required and actual SNR (no absolute SNR) unless that name is wrong. In case of bad signal I would have expected many retrains and that's something I don't see happening os far.

In a perfect world with no noise (and frankly no resistance/impedance) and SNR-margin of 0 would potentially work, in the real world all ISPs noted that for halfway reliable operations at the very least 3dB are required and often a bit more. So they enforce a SNR-margin at sync, so that the link has enough reserves to deal with the expected level of noise.

Very large counter values from the remote end can also mean that the remote end DSLAM has not be reset/rebooted in ages, no?

In your graphs the DSL metrics continue to be reported while the issue occurs. So it looks like the modem and its drivers and userspace application seem to be working the entire time. (It is a bit weird though that the upstream stats are sometimes missing.)

As far as I can see from the interrupt graph, the rate of the mei_cpe interrupt (63) stays roughly the same the entire time (it is hard to be see though with the stacked graph). Only the PTM (96) and Ethernet (72/73) interrupts drop to nearly zero. This doesn't necessarily point to the modem as the issue, you would generally see the same thing whenever no data is transferred for any reason.

Have you made sure that this is actually a problem with the modem itself? Maybe there is some other issue that just happens to go away with a reconnection (due to the dsl0 interface link going down and up again).

To try to narrow down possible causes you could check if restarting the control daemon is actually necessary to fix the connection. For example, try reloading the PTM driver (rmmod ltq-vr9-ptm and modprobe ltq-vr9-ptm) or restarting the DSL connection via dsl_cpe_pipe.sh acs 2 instead.

Regarding the error counters, it probably makes sense to look at when these errors occur (include the error counters in your monitoring if not already the case).

Just another thing ...

... is a pretty old firmware. I am aware that it was common practice to try different firmware versions in hope of more stability before @janh found the actual root cause, but from own experience lots of the older firmware versions are unstable and/or unnecessarily aggressive. Now that the actual problem for Vectoring instability has been fixed, the current best practice seems to be to get the latest available, currently 5.9.1.4.0.7-5.9.0.D.0.2. Maybe you want to try a different/the newest firmware, just to rule out that potential cause?

Sure, let's try that as well. Thnx.

OK, done changing for today:

  • switched from the Zyxel to a Fritzbox 7360v2 I had still around, flashed it with the latest OpenWRT master (of today)
  • switched firmware to 5.9.1.4.0.7-5.9.0.D.0.2 (guess I need to B version, never considered using an A version)

So far stable, DS SNR margin increased to 6 dB. Let's see what happens the coming hours and days.

Hi @janh , thnx for your support. As stated below, changed to a Fritzbox to explore whether that gives the same result. Regarding reloading the PTM driver, I did that once and immediate lost the Ethernet link so checking anything after that became a kind of impossible (reboot required). What I want to check when it happens again is whether the sending of error samples continues (which is a kind of expected) or whether this stops as well.

I thought I remembered reloading the PTM driver to work without any issues. But testing again, it seems to break the DSL connection when the driver is loaded again. And after that no packets are received over DSL anymore (sending works). Seems to be fixed only by unloading and reloading the CPE API and MEI drivers (the PTM driver will then be loaded again automatically during the initial connection).

Unfortunately, it is not really possible to know for sure if the error samples are actually transmitted without having access to the other end of the line. Of course, one can look at the counters and check using tcpdump if the packets are passed to the PTM driver. But in this case I'm not sure if the results from that should be trusted.

@janh

I thought I remembered reloading the PTM driver to work without any issues. But testing again, it seems to break the DSL connection when the driver is loaded again. And after that no packets are received over DSL anymore (sending works). Seems to be fixed only by unloading and reloading the CPE API and MEI drivers (the PTM driver will then be loaded again automatically during the initial connection).

Indeed, the only way I had it working each time is by restarting DSL (/etc/init.d/dsl restart). So what happened after checking a long tcpdump is:

  1. traffic is handled as normal for a long time (send and receive)
  2. at some point packet receiving stops (observed by tcpdump -i dsl0). For a while packets still appear at dsl0 to be send
  3. LCP (as using PPPoE) receives no reply any more and terminates the connection

What might be helpful is a possibility to check the interface at the bottom side of the PTM-TC layer for packets being send and received. Would that be possible ?

ICYMI: The patches have (finally) been merged into 22.03-master. Thanks to everyone involved!

5 Likes

BTW#1, I noticed that the EBRs are send with a ethernet source address of 00:00:00:00:00. Not sure whether this would cause serious impact. I tried to change this in a nice way but Luci has apparently no point to change this, so for now I just hardcoded this in the DSL startup script until I've found a better way.

BTW#2, the transission of EBRs continues when sending and especially receiving of normal packets has stopped. EBRs are still visible on dsl0.

The dsl_control init script already sets the source MAC address based on the one configured for the dsl0 device. Can you check which part of this is not working for you?

For your issue I don't really have any further ideas. You could try enabling debug messages to see if anything interesting comes up (I described how to do this earlier in this thread).

Using ifconfig dsl0 does show a MAC-address

dsl0      Link encap:Ethernet  HWaddr 00:20:DA:86:23:75

however, this address is not used for some reason and resulting in the 00:00:00:00:00 used as default.

Checking the script it tries to find the MAC-address via get_config .... which apparently does not result in something useful.

Will also check the debugging options.

Is there a MAC address configured for dsl0 in /etc/config/network? Normally, there should be a section like this by default:

config device
	option name 'dsl0'
	option macaddr '12:34:56:ab:cd:ef'

So no, that line wasn't there. Added line and issue has been solved.

More worrying is that stability seems to become only worse, not reaching 24 hrs without the issue reoccuring.

1 Like