Vectoring on Lantiq VRX200 / VR9 - missing callback for sending error samples

janh · November 29, 2021, 12:35am

This reminds me of an issue in the VRX518 driver which caused the mei_cpe interrupt to be masked. As a result of this the DSL connection dropped after a short time. So if there is actually an issue here that results in interrupts somehow being missed or not being generated at all, it makes sense that this could also cause the connection to be interrupted.

The drops in the rate of vectoring error reports make me wonder if switching to polling really fixes the issue, or just reduces its impact enough so that the connection doesn't drop. In that case there is the question what actually causes the issue.

You don't happen to have data for the mei_cpe interrupt from before switching to polling? It would be interesting to see if there were any changes there before any connection drops.

francisuk1989 · November 29, 2021, 8:14am

@moeller0
In the UK we have two DSLAM cabinets, One is a ECI m41 with lantiq chipset, And we have huawei ma5603 which is broadcom chipset however no idea what they going to replace they with them huawei has to say goodbye to us.
Our g.fast pods are Nokia so not sure what chipsets they run on.

From a huawei ma5603 which is broadcom chipset i get this below from a fritzbox info page from UK cabinet if any helps.

Profile 17a
G.INP only the downstream is enabled, upstream no
INP 53 but only downstream only, Upstream nothing.
Carrier record J43 on both downstream/upstream
G Vector No / No (not used)
Bitswap On/On

From what i read they tried G.INP on the ECI cabinet's but turned it off because of errors on the CPE router side but this was back in August 2020 nothing about them re-enabled it again however from this day, They have G.INP on the downstream only on the Huawei cabinets.

janh · November 30, 2021, 11:36pm

I am currently doing another test myself, and this time I also had an unexpected reconnection after a bit more than a day. Interestingly, this happened while error reports were being handled.

The data also shows another strange result. At one point, the discarded ERB count shortly increased by 1133712 and then the next second it decreased again by 1133711. This also happened while error reports were being handled.

The only difference to my previous tests is that I am now monitoring the ERB and IRQ counters every second. I also did some stress testing on the device to see if that causes any issues, but it didn't have any immediate effects.

@jekkos, how do you actually measure the ERB data to get that graph? This seems to be be error reports per some time period, but what is the actual interval?

janh · December 1, 2021, 6:39pm

It happened again about 19 hours later. I'm wondering why this didn't affect me before. It does seem to be vectoring-related though, as it was again while error reports were being handled.

jekkos · December 2, 2021, 4:22pm

Hi @janh, here two screencaps of a reconnect, there is a short spike of irq 63 before, and after but that could be as expected? I'm using the 1m derivative in grafana to visualize the error counter rate. I guess it rerenders a bit once you zoom out (one of the screenshots is after I zoomed in on a time interval)

EDIT: I should also add that since the polling patch my uptime is 10d+ and counting

janh · December 3, 2021, 11:14pm

Averaging over 1 minute produces the typical Moiré effect that is visible in your zoomed in screenshot. It's really hard to visualize this data properly without averaging over a much longer period (or matching the interval of the error reports, but that is going to be difficult). But if the zoomed-out view averages over a longer period (and it seems it does), those graphs are probably realistic (the correlates with slight SNR wobble in your previous graph is also a hint that it may be correct).

So if there are actually around 130-140 error reports per minute on your line, that could explain why this is more of an issue for you. I am only seeing about 80 error reports per minute here (batches of 65 roughly every 50 seconds). Could you confirm that this is the case? By looking at the raw data with a sampling rate of 1 second it should be easy to determine the batch size and interval (just looking at the data for 5 minutes or so should be enough, if there is actually a regular interval).

This reminds me that the vectoring driver from AVM actually has code for a rate limit for error reports. It seems to be built without it though, and from the support data it also looks like they are using interrupts, so there should be some way to make this work without resorting to polling…

This spike seems to be quite some time before the connection was interrupted, so I don't think this is related. What is the actual unit for the interrupts graph? The spike on reconnection is at roughly 1.5 million in the graph and this is definitely not realistic.

jekkos · December 7, 2021, 8:06am

I have the impression that the number of interrupts is also related to the amount of traffic that goes through. Yesterday I noticed that the amount of ptm interrupts over time halved after my ethereum node dropped most of it's peers.

I estimate it currently processes around 1000 callbacks every 30s by looking at the graph. The unit on the interrupts graph is a short type, it might be scaled wrongly as you say.

janh · December 7, 2021, 11:50pm

The interrupt ptm_mailbox_isr notifies the PTM driver of received and transmitted packets, so the count definitely depends on the amount of traffic. The interrupts xrx200_net_rx and xrx200_net_tx do the same thing, but for ethernet.

What I am really interested in is the frequency of just the error reports (interval and batch size if there is a regular pattern). If you don't want to analyze that yourself, could you just run this command for around 5 minutes and send me the results?

while true; do ubus call dsl metrics | jsonfilter -e '@.erb'; sleep 1; done

It would also be interesting to get this information from anyone else who has the resync issue when not using the polling patch, to see if there is a pattern.

jekkos · December 8, 2021, 6:09pm

@janh find here the script results. I can't attach it as a file so I'll paste it here. Also another example of such an IRQ 'spike' I saw yesterday. If you know anything extra I can monitor to find out the cause here, let me know.

janh · December 8, 2021, 8:04pm

So it looks like most of the time, error reports are sent in batches of 132 roughly every 45 seconds. Sometimes the pattern is broken and there are smaller and more irregular batches, so either the vectoring engine requests them irregularly, or the modem fails to keep up. In any case, this confirms that the number of error reports is actually significantly higher on your line compared to mine (as your graphs already suggested).

It looks like the entire data is missing for roughly 3 minutes. Can you rule out that there is an issue with your monitoring? Otherwise this would suggest a very serious problem…

jekkos · December 8, 2021, 9:32pm

Ok no, the collectd daemon on the other side was away for that interval. So it's the monitoring itself. I suppose from time to time there might be a spike due to higher traffic.

If there is anything else I can try to gain more insights in the issue, let me know.

janh · December 10, 2021, 9:26pm

After my modem ran fine for a few days there was another reconnect yesterday. I'm currently trying the vectoring callback profiling in the MEI driver to see if anything turns up (this can be enabled by building with MEI_DBG_DSM_PROFILING=1, but it needs a few changes to work). So far, only the lowest bin has a non-zero value, i.e. all callbacks took less than 4ms to execute.

janh · December 13, 2021, 5:31pm

In the meantime there were another two resyncs. The profiling shows that the vectoring callback always took less than 4ms to execute, so I think at least in my case the callback taking too long can be ruled out as cause of the issue.

moeller0 · December 14, 2021, 8:37am

Mmmh, so the good thing is, I a apparently not alone affected by VRX200 retraining a few times per week (occasionally multiple times per but mostly every few days) without any smoking gun why it did so?
Any idea what else to tests?

janh · December 17, 2021, 8:59pm

The actual source of the issue isn't clear, but at least we have some pointers. For me, all resyncs (including the latest one today) happened while transmitting error samples. And others have reported that polling fixes the issue. Assuming that these are actually not separate issues, my next idea would be to look closer at the interrupts.

moeller0 · December 17, 2021, 10:11pm

What is the quickest way to monitor this? I have so far not bothered to build my own builds for the HH5A, I guess I should start looking into this over the holidays (so starting next Friday)?

janh · December 17, 2021, 11:35pm

If you monitor the stats every second, this is visible from the ERB counters. For example here is an excerpt from the data during the last resync. The first two data columns are the ERB counters from ubus call dsl metrics, and the next two columns are interrupt counters from /proc/interrupts. I am only storing the difference to the previous value.
You can clearly see that the interruption of the connection happened at 15:51:37 while ERBs were being sent, as the counter is non-zero just before (the big jumps are because the values don't seem to be reported via ubus while the connection is down).

                     erb sent, discarded, irq mei, irq ptm
2021-12-17T13:51:32,        0,         0,     240,      80
2021-12-17T13:51:33,        0,         0,      76,      68
2021-12-17T13:51:34,        5,         0,     104,      84
2021-12-17T13:51:35,       16,         0,      80,      68
2021-12-17T13:51:36,       15,         0,      90,      54
2021-12-17T13:51:37,  -871683,        -9,      80,      50
2021-12-17T13:51:38,        0,         0,      10,       8
2021-12-17T13:51:39,        0,         0,      80,       0
2021-12-17T13:51:40,        0,         0,      32,       0
…
2021-12-17T13:53:16,        0,         0,      22,       0
2021-12-17T13:53:17,        0,         0,      32,       0
2021-12-17T13:53:18,   871704,         6,      78,       6
2021-12-17T13:53:19,        0,         0,      74,       
2021-12-17T13:53:20,        0,         0,      90,      18
2021-12-17T13:53:21,        0,         0,      74,       0
2021-12-17T13:53:22,        0,         0,      74,       0
2021-12-17T13:53:23,        0,         0,      90,       0
2021-12-17T13:53:24,        0,         0,      74,       0
2021-12-17T13:53:25,        0,         0,      74,      40
2021-12-17T13:53:26,        0,         0,      90,    1206
2021-12-17T13:53:27,        0,         0,      74,     228
2021-12-17T13:53:28,        0,         0,      74,     218

janh · December 20, 2021, 11:03pm

I looked into the logging which is built into the MEI driver. Enabling it is easy via /proc/driver/mei_cpe/config by writing LOG 1 and TRACE 1 (this specifies the log level, 1 means to log everything).

However, this generates a lot of log output, and writing all of this to the kernel log slows down the device significantly. The driver does include a way to log via netlink multicast instead, which can be enabled by writing DBG_LOGGER 1.

There doesn't seem to be any existing client available, but luckily only a few lines of C code are necessary. I put the code on GitHub. Unfortunately, the MEI driver is broken which causes kernel crashes when logging data from an interrupt handler, so a small patch is needed to make this work.

It also seems that enabling this can lead to resyncs sometimes, so use it carefully. I still have to look at the results to see if the logged information actually allows to narrow down the source of the original resync issue.

janh · December 22, 2021, 7:40pm

The full log output at log level 1 is quite large, and it is hard to find the actually relevant information in there. Thus, I decided to hack in some custom logging at level 3.

For the most part, the log looks something like this:

[  300.038] <MEI_IoctlCmdMsgWrite start classifier 0000 id 0010>
[  300.038] MEI_WriteMailbox start
[  300.038] (MEI_DisableDeviceInt)
[  300.038] (MEI_EnableDeviceInt)
[  300.038] MEI_WriteMailbox end
[  300.038] <MEI_IoctlCmdMsgWrite end>
[  300.038] <MEI_InterruptLinux start>
[  300.038] MEI_ReadMailbox MEI_MBOX_CODE_MSG_ACK
[  300.038] <MEI_InterruptLinux end 1>
[  300.038] <MEI_InterruptLinux start>
[  300.038] <MEI_InterruptLinux end 0>

So, the driver writes a message to the mailbox, and in the interrupt handler the driver reads the ACK from the mailbox. For some reason, the interrupt handler also always runs another time without actually handling anything. In this example, all of this happens nicely sequentially.

After a bit more than 1,5 hours I got a resync, as shown in the kernel log:

[ 5921.123130] leave showtime

Shortly before, the log shows the following:

[ 5921.041] <MEI_IoctlCmdMsgWrite start classifier 0000 id 0002>
[ 5921.041] <MEI_InterruptLinux start>
[ 5921.041] MEI_WriteMailbox start
[ 5921.041] (MEI_DisableDeviceInt)
[ 5921.041] MEI_ReadMailbox MEI_MBOX_CODE_(NFC|EVT|ALM|DBG)_REQ
[ 5921.042] (MEI_EnableDeviceInt)
[ 5921.042] MEI_RecvAutonomMailboxMsg default
[ 5921.042] MEI_WriteMailbox end
[ 5921.042] <MEI_InterruptLinux end 1>
[ 5921.042] <MEI_IoctlCmdMsgWrite end>
[ 5921.042] <MEI_InterruptLinux start>
[ 5921.042] <MEI_InterruptLinux end 0>
[ 5921.042] <MEI_InterruptLinux start>
[ 5921.042] MEI_ReadMailbox MEI_MBOX_CODE_MSG_ACK
MEI_DRV[00]: ARC2ME MB, 0..3 - 0x0000, 0x4010, 0x0009, 0x4503
MEI_DRV[00]: ARC2ME MB, 4..7 - 0x0000, 0x0009, 0x0000, 0x0034
[ 5921.042] <MEI_InterruptLinux end 1>
[ 5921.042] <MEI_InterruptLinux start>
[ 5921.042] <MEI_InterruptLinux end 0>

What we can see here is that the interrupt handler runs while a write to the mailbox is in progress. The call to MEI_ReadMailbox happens while MEI_WriteMailbox is in its critical section (interrupts are also disabled during this). The switch in MEI_RecvAutonomMailboxMsg ends up in its default section instead of one of the cases where different message types are handled. Shortly afterwards, the interrupt handler is called another time, this time for an ACK. However the ARC2ME MB messages show that the ACK does not match the expected message ID.

My suspicion at this point is that this could be a concurrency issue. Looking at the MEI_PollIntPerVrxLine (which is called when using polling instead of interrupts), the call to MEI_ProcessIntPerVrxLine is protected by a lock there. The same function is also called in the interrupt handler, but without any locking. There is some code in MEI_ProcessIntPerVrxLine itself that checks the interrupt mask, and skips reading the mailbox if it is masked. But I don' think this is enough on a system with SMP.

Just adding locking to the interrupt handler is not a solution, though. The used lock is a semaphore, and it cannot be used in an interrupt handler, as locking it may sleep. So, assuming that my conclusion here is correct, switching to polling may actually be the most realistic fix for the issue (just that there is now an explanation why that helps).

jekkos · December 22, 2021, 11:07pm

Cool to see how you got that sorted out. This kernel debugging stuff is a bit over my head honestly, but interesting to read how you tackled this.

Then I wonder if this issue won't occur then on the VR11 you were working on ? And do you have any idea how does AVM implemented this functionality, do they use polling as well?