As expected, there weren't any resyncs so far with polling (running for more than 22 hours at the moment). I updated my ltq-vectoring branch with a polling patch.
One thing I noticed is that with polling the reported line uptime is even more inaccurate than before. The reported value is now a bit more than half of what it should be. Reading status information (via dsl_cpe_pipe.sh or ubus) is also much slower now.
By the way, anyone who wants to do some testing without patching the MEI driver, using log level 2 should be safe but could still provide some useful information (for example, an ACK mismatch error should be logged then). You would just have to write LOG 2 and TRACE 2 to /proc/driver/mei_cpe/config for that, without setting DBG_LOGGER it will just print to the kernel log.
Looking at the changes in newer versions of the MEI driver I couldn't find any relevant changes, so I guess the same issue exists there. It could be that the issue just isn't triggered there in practice, as error reports are handled by the firmware (so there shouldn't be any vectoring-specific interrupts).
The support data from a Fritzbox 7362 SL running firmware 07.12 shows that they are using interrupts. There isn't any [VrxCtrl] kernel thread in the process list, and the kernel log shows this:
[ 27.920000] activating IRQ mode
[ 27.920000] requesting IRQ
[ 27.920000] request_irq
[ 27.920000] usedIrq: 57 | usedIsrHandler: 811adb00 | usedFlags: 0x100 | pUsedDevName: mei_vr9 | pUsedDevId: -2035560832
[ 27.920000] IRQ requested ok
[ 27.920000] MEI_DRV: MEI_IfxRequestIrq(IRQ = 57, .., ), lock = 1
So, I have no idea why this it works on the vendor firmware. In the system log their MEI driver claims to be version 1.4.4, but then there is the question how different it actually is and how much they patched it.
I would guess this issue doesn't exist on devices without SMP, i.e. when the second VPE is used for voice. However, AVM uses their own voice solution, so they probably also use both VPEs.
For me the issue only occurred after I started monitoring stats every second (but I'm not sure if that is actually what triggered it). But it makes sense that the amount of messages that are sent to the device to query data would also change the likelihood of the issue actually being triggered.
