It's clear that it is possible to do this correctly with interrupts. I want to avoid a larger patch to the MEI driver though, and I'm not sure I know the driver enough to do any deeper changes anyway. And just looking at MEI_ReadMailbox and the interrupt handler may not be enough, as I think the driver expects mutual exclusion with the interrupt handler also in other places.
However, here (and also in my ltq-vectoring branch) is a small patch that seems to work (stable for over 16 hours now, from the log I can see that there are no longer any calls to MEI_ReadMailbox while MEI_WriteMailbox is active).
--- a/src/drv_mei_cpe_common.c
+++ b/src/drv_mei_cpe_common.c
@@ -104,6 +104,8 @@ IFX_uint32_t MEI_FsmStateSetMsgPreAction
MEI_DEVCFG_DATA_T MEI_DevCfgData;
#endif
+static DEFINE_SPINLOCK(MEI_InterruptLock);
+
/* ============================================================================
Proc-FS and debug variable definitions
========================================================================= */
@@ -2134,6 +2136,9 @@ IFX_int32_t MEI_ProcessIntPerIrq(MEIX_CN
#if (MEI_SUPPORT_DEBUG_STREAMS == 1)
IFX_int_t extraDbgStreamLoop = 0;
#endif
+ unsigned long flags;
+
+ spin_lock_irqsave(&MEI_InterruptLock, flags);
/* get the actual chip device from the list and step through the VRX devices */
while(pNextXCntrl)
@@ -2167,6 +2172,8 @@ IFX_int32_t MEI_ProcessIntPerIrq(MEIX_CN
}
#endif
+ spin_unlock_irqrestore(&MEI_InterruptLock, flags);
+
return meiIntCnt;
}
@@ -2639,9 +2646,14 @@ IFX_int32_t MEI_MsgSendPreAction(
*/
IFX_void_t MEI_DisableDeviceInt(MEI_DEV_T *pMeiDev)
{
+ unsigned long flags;
+ spin_lock_irqsave(&MEI_InterruptLock, flags);
+
MEI_MaskInterrupts( &pMeiDev->meiDrvCntrl,
ME_ARC2ME_INTERRUPT_MASK_ALL);
+ spin_unlock_irqrestore(&MEI_InterruptLock, flags);
+
return;
}
The assumption here is that the driver calls MEI_DisableDeviceInt whenever mutual exclusion with the interrupt handler is needed. This masks interrupts, so no further interrupts should happen, and there is also another check for masked interrupts in MEI_ProcessIntPerVrxLine. But as mentioned before, this is not enough on systems with SMP, as this code may execute while the interrupt handler is already running and past the interrupt mask check.
This adds a lock around the actual work including the interrupt mask check in the interrupt handler. This lock is also acquired in MEI_DisableDeviceInt, which should make sure that the interrupt handler is currently not running, or it is before the interrupt mask check, and is thus going to notice the change and not actually do anything.
I guess a cleaner solution would be to acquire the lock for the entire section between MEI_DisableDeviceInt and MEI_EnableDeviceInt. I would have to look into this a bit further for that, as not every call to these functions are actually for mutual exclusion with the interrupt handler, so this would require additional changes at some of the places these are called.
Yes, it seems that your issue may actually be the same one I'm also seeing here. Maybe you could try this patch to check if it also fixes the issue for you?