As expected, there weren't any resyncs so far with polling (running for more than 22 hours at the moment). I updated my ltq-vectoring branch with a polling patch.

One thing I noticed is that with polling the reported line uptime is even more inaccurate than before. The reported value is now a bit more than half of what it should be. Reading status information (via dsl_cpe_pipe.sh or ubus) is also much slower now.

By the way, anyone who wants to do some testing without patching the MEI driver, using log level 2 should be safe but could still provide some useful information (for example, an ACK mismatch error should be logged then). You would just have to write LOG 2 and TRACE 2 to /proc/driver/mei_cpe/config for that, without setting DBG_LOGGER it will just print to the kernel log.

Looking at the changes in newer versions of the MEI driver I couldn't find any relevant changes, so I guess the same issue exists there. It could be that the issue just isn't triggered there in practice, as error reports are handled by the firmware (so there shouldn't be any vectoring-specific interrupts).

The support data from a Fritzbox 7362 SL running firmware 07.12 shows that they are using interrupts. There isn't any [VrxCtrl] kernel thread in the process list, and the kernel log shows this:

[   27.920000] activating IRQ mode
[   27.920000] requesting IRQ
[   27.920000] request_irq 
[   27.920000] usedIrq: 57 | usedIsrHandler: 811adb00 | usedFlags: 0x100 | pUsedDevName: mei_vr9 | pUsedDevId: -2035560832
[   27.920000] IRQ requested ok
[   27.920000] MEI_DRV: MEI_IfxRequestIrq(IRQ = 57, .., ), lock = 1

So, I have no idea why this it works on the vendor firmware. In the system log their MEI driver claims to be version 1.4.4, but then there is the question how different it actually is and how much they patched it.

I would guess this issue doesn't exist on devices without SMP, i.e. when the second VPE is used for voice. However, AVM uses their own voice solution, so they probably also use both VPEs.

For me the issue only occurred after I started monitoring stats every second (but I'm not sure if that is actually what triggered it). But it makes sense that the amount of messages that are sent to the device to query data would also change the likelihood of the issue actually being triggered.

I think you need a double check. The interrupt service routine checks if the write is busy, and the write checks if an interrupt is busy. Something like this:

int status = 0;
enum{
	statusInterrupt = 1,
	statusWrite = 2,
	flagInterruptDropped = 4
};

// interrupt service routine
void MEI_Interrupt()
{
	int oldvalue, newvalue;
	while( true )
	{
		oldvalue = status;
		if( oldvalue & statusWrite )
		{
			// Write is busy. Set interrupt dropped flag
			newvalue = oldvalue | flagInterruptDropped;
		}
		else if( oldvalue & statusInterrupt )
		{
			// Interrupt being handled on other core
			return;
		}
		else
		{
			// Lock 
			newvalue = oldvalue | statusInterrupt;
		}
			
		if( atomic_compare_exchange( &status, oldvalue, newvalue ) ) 
			break;
	}
	if( newvalue & statusWrite )
		return;
	
	/* Handle interrupt */

	newvalue &= ~statusInterrupt;
	atomic_exchange( &status, newvalue );
}

void MEI_WriteMailbox()
{
	int oldvalue, newvalue;
	while( true )
	{
		oldvalue = status;
		if( oldvalue & statusInterrupt )
		{
			// interrupt busy
			futex_wait( &status, 0 );
			continue;
		}
		newvalue = oldvalue | statusWrite;
		if( atomic_compare_exchange( &status, oldvalue, newvalue ) )
			break;
	}
	
	/* Payload code */

	while( true )
	{
		oldvalue = status;
		newvalue = 0;
		if( atomic_compare_exchange( &status, oldvalue, newvalue ) )
			break;
	}
	
	if( oldvalue & flagInterruptDropped )
	{
		/* handle dropped interrupt if possible/necessary */
	}
}

Thanks:

root@OpenWrt:~# cat /proc/driver/mei_cpe/config
 0: LOG 	=  4
 1: TRACE 	=  4
 2: L_MEI 	=  4
 3: T_MEI 	=  4
 4: MB_ME2ARC 	= 0x00040014
 5: MB_ARC2ME 	= 0x00040120
 6: DL_BLOCK_TOUT 	= 0
 7: W_MODEM_ONLINE 	= 10000 [ms]
 8: W_DFE_RESP_MS 	= 2000 [ms]
 9: FSM_PRE 	= 0x00 
10: MDMP_ENABLE 	= 0x00 
11: MDMP_ID 	= 0x86 
12: GDL 	T = 1 L = 1 	Cnt = 0x0, Mask[0..3] = 0x0 0x0 0x0 0x0
13: DBG_FLAGS 	= 0
	Bit0:  0: ARC HALT
	Bit1..31: Reserved
14: DBG_LOGGER 	= 0
	0: Logger is not used
	1: Logger is used
echo "LOG 2" > /proc/driver/mei_cpe/config
echo "Trace 2" > /proc/driver/mei_cpe/config
root@OpenWrt:~# cat /proc/driver/mei_cpe/config
 0: LOG 	=  2
 1: TRACE 	=  2
 2: L_MEI 	=  4
 3: T_MEI 	=  4
 4: MB_ME2ARC 	= 0x00040014
 5: MB_ARC2ME 	= 0x00040120
 6: DL_BLOCK_TOUT 	= 0
 7: W_MODEM_ONLINE 	= 10000 [ms]
 8: W_DFE_RESP_MS 	= 2000 [ms]
 9: FSM_PRE 	= 0x00 
10: MDMP_ENABLE 	= 0x00 
11: MDMP_ID 	= 0x86 
12: GDL 	T = 1 L = 1 	Cnt = 0x0, Mask[0..3] = 0x0 0x0 0x0 0x0
13: DBG_FLAGS 	= 0
	Bit0:  0: ARC HALT
	Bit1..31: Reserved
14: DBG_LOGGER 	= 0
	0: Logger is not used
	1: Logger is used

Let's see whether this will yield something interesting....

I note that this is uper chatty, repeating:

Fri Dec 24 23:17:55 2021 kern.warn kernel: [6658315.867677] MEI_DRV[0x00]: ERB processed successfully by PP driver
Fri Dec 24 23:17:55 2021 kern.warn kernel: [6658315.867677]
Fri Dec 24 23:17:55 2021 kern.warn kernel: [6658315.924949] MEI_DRV[00]: DSM EVT ErrorVectorReady - ErrVecProcResult = 0, ErrVecSize = 1371

in logread, maybe I should just enable this when I expect a retrain....

Compared to log level 1, this is definitely not super chatty (as mentioned before that would log so much that the system is slowed down very noticeably). But you are right, it may be a better idea to use log level 3 instead, unless you are sending the logs to an external device. Otherwise you would have to be quick to fetch the log after a resync.

1 Like

Thanks switched to level 3, let's see whether I can extract anything helpful around the spurious retrains (I say spurious as these are not correlated with increased FEC, CRC, or Retransmission error counters, albeit it is possible that a burst of errors drives the link into retrain before those counters are updated, but I would expect the local counters in the modem to show something, unless the problem only exists in the other direction....)

Sat Dec 25 19:27:01 2021 kern.warn kernel: [6660398.312232] MEI_DRV[00]: DSM EVT ErrorVectorReady - ErrVecProcResult = 0, ErrVecSize = 1371
Sat Dec 25 19:27:01 2021 kern.warn kernel: [6660398.312232]
Sat Dec 25 19:27:01 2021 kern.warn kernel: [6730861.860521] MEI_DRV[00]: ARC2ME MB, 0..3 - 0x0000, 0x4010, 0x2001, 0x0107
Sat Dec 25 19:27:01 2021 kern.warn kernel: [6730861.860521]
Sat Dec 25 19:27:02 2021 kern.warn kernel: [6730861.869354] MEI_DRV[00]: ARC2ME MB, 4..7 - 0x0000, 0x0001, 0x0000, 0x055B
Sat Dec 25 19:27:02 2021 kern.warn kernel: [6730861.869354]
Sat Dec 25 19:27:02 2021 kern.warn kernel: [6730862.740055] leave showtime
Sat Dec 25 19:27:02 2021 kern.info kernel: [6730862.753525] br-lan: port 7(dsl0.7) entered disabled state
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Network device 'dsl0' link is down
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Network alias 'dsl0' link is down
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Interface 'wan6' has link connectivity loss
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Interface 'wan' has link connectivity loss
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Interface 'wan' is now down
Sat Dec 25 19:27:02 2021 daemon.notice netifd: 8021q 'dsl0.7' link is down
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Interface 'wan6' is now down
Sat Dec 25 19:27:02 2021 daemon.notice netifd: Interface 'wan6' is disabled
Sat Dec 25 19:27:36 2021 daemon.err collectd[1919]: exec plugin: Program `/usr/lib/collectd/exec/exec-lantiqdsl.sh' has closed STDERR.
Sat Dec 25 19:28:51 2021 daemon.notice netifd: Network device 'dsl0' link is up
Sat Dec 25 19:28:51 2021 daemon.notice netifd: Interface 'wan' has link connectivity
Sat Dec 25 19:28:51 2021 daemon.notice netifd: Interface 'wan' is setting up now
Sat Dec 25 19:28:51 2021 kern.warn kernel: [6730964.533114] MEI_DRV[00]: EVT_TC_StatusGet_t - TC indicated is 1 (PTM)
Sat Dec 25 19:28:51 2021 kern.warn kernel: [6730964.533114]
Sat Dec 25 19:28:51 2021 kern.warn kernel: [6730972.423903] enter showtime

Now, what is ARC2ME? ;), loglevel 3 seems fully acceptable noise-wise but does this actually help?

Mmmh, seems similar to yours above, so nothing new.... but it might indicate that my retrains/resyncs come from the same root cause you just described, no?

It's clear that it is possible to do this correctly with interrupts. I want to avoid a larger patch to the MEI driver though, and I'm not sure I know the driver enough to do any deeper changes anyway. And just looking at MEI_ReadMailbox and the interrupt handler may not be enough, as I think the driver expects mutual exclusion with the interrupt handler also in other places.

However, here (and also in my ltq-vectoring branch) is a small patch that seems to work (stable for over 16 hours now, from the log I can see that there are no longer any calls to MEI_ReadMailbox while MEI_WriteMailbox is active).

--- a/src/drv_mei_cpe_common.c
+++ b/src/drv_mei_cpe_common.c
@@ -104,6 +104,8 @@ IFX_uint32_t MEI_FsmStateSetMsgPreAction
 MEI_DEVCFG_DATA_T MEI_DevCfgData;
 #endif
 
+static DEFINE_SPINLOCK(MEI_InterruptLock);
+
 /* ============================================================================
    Proc-FS and debug variable definitions
    ========================================================================= */
@@ -2134,6 +2136,9 @@ IFX_int32_t MEI_ProcessIntPerIrq(MEIX_CN
 #if (MEI_SUPPORT_DEBUG_STREAMS == 1)
    IFX_int_t   extraDbgStreamLoop = 0;
 #endif
+   unsigned long flags;
+
+   spin_lock_irqsave(&MEI_InterruptLock, flags);
 
    /* get the actual chip device from the list and step through the VRX devices */
    while(pNextXCntrl)
@@ -2167,6 +2172,8 @@ IFX_int32_t MEI_ProcessIntPerIrq(MEIX_CN
    }
 #endif
 
+   spin_unlock_irqrestore(&MEI_InterruptLock, flags);
+
    return meiIntCnt;
 }
 
@@ -2639,9 +2646,14 @@ IFX_int32_t MEI_MsgSendPreAction(
 */
 IFX_void_t MEI_DisableDeviceInt(MEI_DEV_T *pMeiDev)
 {
+   unsigned long flags;
+   spin_lock_irqsave(&MEI_InterruptLock, flags);
+
    MEI_MaskInterrupts( &pMeiDev->meiDrvCntrl,
                        ME_ARC2ME_INTERRUPT_MASK_ALL);
 
+   spin_unlock_irqrestore(&MEI_InterruptLock, flags);
+
    return;
 }
 

The assumption here is that the driver calls MEI_DisableDeviceInt whenever mutual exclusion with the interrupt handler is needed. This masks interrupts, so no further interrupts should happen, and there is also another check for masked interrupts in MEI_ProcessIntPerVrxLine. But as mentioned before, this is not enough on systems with SMP, as this code may execute while the interrupt handler is already running and past the interrupt mask check.

This adds a lock around the actual work including the interrupt mask check in the interrupt handler. This lock is also acquired in MEI_DisableDeviceInt, which should make sure that the interrupt handler is currently not running, or it is before the interrupt mask check, and is thus going to notice the change and not actually do anything.

I guess a cleaner solution would be to acquire the lock for the entire section between MEI_DisableDeviceInt and MEI_EnableDeviceInt. I would have to look into this a bit further for that, as not every call to these functions are actually for mutual exclusion with the interrupt handler, so this would require additional changes at some of the places these are called.

Yes, it seems that your issue may actually be the same one I'm also seeing here. Maybe you could try this patch to check if it also fixes the issue for you?

I have applied your patch now (and reverted the polling). I'll report back on stability in a couple of days.

I had two retrains since I flashed the patch. I'll monitor a bit longer to see if anything interesting comes up if this happens again. I've set the loglevels to what @moeller0 has suggested here.

Btw there are lock that can sleep or we can set the interrupt to not sleepable. Pooling is a way but it depends. Most of the time pooling is not that good if a locking mechanism can be implemented.

Mmmh, this morning it seems especially flaky:

Tue Dec 28 11:06:58 2021 kern.warn kernel: [6960058.740560] MEI_DRV[00]: ARC2ME MB, 0..3 - 0x0000, 0x4310, 0x0002, 0x4000
Tue Dec 28 11:06:58 2021 kern.warn kernel: [6960058.740560]
Tue Dec 28 11:06:58 2021 kern.warn kernel: [6960058.740578] MEI_DRV[00]: ARC2ME MB, 4..7 - 0x6000, 0x0002, 0xBF88, 0x003A
Tue Dec 28 11:06:58 2021 kern.warn kernel: [6960058.740578]
Tue Dec 28 11:06:58 2021 kern.warn kernel: [6960058.747902] MEI_DRV[00 - 01]: ReadAck[0x4000] - FctOP ERROR 0x31!
Tue Dec 28 11:06:58 2021 kern.warn kernel: [6960058.747902]
Tue Dec 28 11:06:59 2021 kern.warn kernel: [6960058.756618] MEI_DRV[00 - 01]: Send Msg[0x0002] read ACK failed (cnt: 188340383, t: 1911700 - 1911710 = 10)
Tue Dec 28 11:06:59 2021 kern.warn kernel: [6960058.756618]
Tue Dec 28 11:06:59 2021 kern.warn kernel: [6960059.800120] leave showtime
Tue Dec 28 11:06:59 2021 kern.err kernel: [6960059.813064] target-mips_24kc_musl/linux-lantiq_xrx200/ltq-ptm-vr9/ltq-ptm/ifxmips_ptm_vdsl.c:293:ptm_hard_start_xmit: not in showtime

I guess i need to start building a new version with your patch, as I seem to be on a good position to test its efficacy right now :wink:

Embarrassing question, to build this I basically need to clone https://github.com/janh/openwrt/tree/ltq-vectoring and then drop-in/modify the .config and that should be it?

You can either clone my fork of the OpenWrt repository and check out the ltq-vectoring branch, or you could apply my changes on top of a clone of the upstream repository (GitHub lets you download individual commits if you add .patch to the end of the URL). At the moment, you will get the latest version in any case, as I just rebased my branch while I added the missing package version bump in the latest commit.

Other than that, just follow the usual OpenWrt build instructions (there is a short guide the README in the root directory). And don't forget to run one of make menuconfig / oldconfig / defconfig even if you copy in an existing config (but you'll get a warning about that anyway).

Thanks, I am just trying my luck...

First goal getting the thing to compile, then add a few tools/features and test it out.

Mmmh, I got it to built with your patches included (and all those packages I missed before) but I seem to be unable to get it to sync with my vectoring DSLAM.... when I switch back to a non-vectoring firmware blob I get a sync in the <= 2.2 MHz fall-back profile (without vectoring). I was using a second BTHH5A for this test which might be subtly broken, so I will need to re-test that device with my old firmware version to see whether I get reliable vectoring sync... (I hesitate to use my production HH5A for testing, as it is quite nice to have a drop-in replacement sitting on the shelf to switch the internet back to the known and accepted level of flakiness :wink: )

That sounds like a good idea. Have you also checked that the DSL configuration in /etc/config/network is identical on both devices?

I copied the old working config into the new build, but I have not yet diff'd them.... but the rest, including my DSA configuration to bridge DSL0.7 with wan.7 worked immediately....

Mmmh, I reflashed the dubious HH5A to the older OpenWrt SNAPSHOT, r17498-07203cb253 (kernel 5.4) and it seems to be much more robust... will need to try it on the DSL link, but that will need to wait until all users are done for the day/evening.

Oh, well, uhm. Yes. You might not want to use that one for testing. It is very much not-subtly broken in the DSL department I think. :wink:

1 Like

Oops, my plan was to use that as a canary and if it survived/worked for an hours, reflash the other one. I guess not going to fly, that idea; so I will reflash the main modem.....

Not that bad, I still have thw zyxel as backup, only I did not manage to configure that the same way as the hh5a so it is not a drop-in replacement....

Mmmh, so with your patches on the known working router I get a flood of"

[  564.406439] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!
               
[  564.470149] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!
               
[  564.533862] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!
               
[  564.597563] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!
               
[  564.661299] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!
               
[  564.725240] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!
               
[  564.788990] MEI_DRV[0x00]: ERB dropped by PP driver (err code -19)!

in dmesg, an apparently acceptable sync, but no data passes over the link. I reverted back to the old snapshot....

This is all with a 5.9.1.4.0.7 firmwareblob extracted from FritzOS 7.28 for the xrx2000 based 7490.

Any idea what I should test next? Or do I need to fully power cycle the modem when changing kernels?