Lantiq DSL Modem - Line Uptime drift (lagging)

Handyman · August 5, 2020, 5:28pm

I hope one of the developers will implement this fix so we don't have to redo it with every firmware update.

mbo2o · August 6, 2020, 4:19am

Good Work

Although the drift factor you calculated looks like a combination of systematic drift caused by a count of 1024 vs 1000 per second and and a time drift

A multiplier of 1.024 should be enough to correct the systematic drift.

Handyman · August 6, 2020, 5:38am

The 1.024 multiplier was the first value I tested but was getting +1 second drift every few hours.
So I started experimenting with other values.
Most people won't care about +1 second drift every few hours, so 1.024 should work fine for them.
The only concern now is to have this fix committed so we don't have to edit the lantiq_dsl.sh file with every update.

takimata · August 6, 2020, 11:12am

I applaud the strive for accuracy, but honest question: What does actually depend on this value being exact, down to the second no less?

Again, the uptime value is generated by the modem firmware, and why it drifts and for which firmware versions and for which modem chipsets and for which line conditions and under which modem load and at which room temperature is anyone's guess. If "multiplying by 1.024062" works for you, great, more power to you.

But this is not a fix, it is a workaround based on your private observations, and IMNSHO should not be picked up by OpenWrt. Even if your empirical approach is accurate, it is a "sensor value" and should be presented, not be guesstimateinterpreted. If a fix is necessary, it needs to be done in the modem/DSLAM firmware (whichever entity calculates it wrong in the first place.)

moeller0 · August 6, 2020, 11:23am

I concurr, unless say the division by 1024 is caused by a DSL standard, and hence theoretically correct, I would always also present the actually returned value...

Handyman · August 6, 2020, 1:24pm

Extract and display statistics and diagnostic data from Lantiq modem

This problem is specific to the Lantiq modems family as far as I know and from what I can tell I am not the only one complaining and the fix is applied to the lantiq_dsl.sh file not for every dsl modem out there. And like I said the multiplier 1024 should work for everyone don't worry about my number, my curiosity took me far but 1024 should be fine for the masses.

I totally agree with you that it should be fixed on the Lantiq modem firmware level. But we don't have the luxury to do that, do we? so we work with what we have, even if it's a patch job to keep things working nice and neat till by a miracle it gets fixed in the Lantiq modem firmware itself.

takimata · August 6, 2020, 2:05pm

How can you possibly claim that? You have a sample size of one device, one chipset and one DSL line.

We don't know the reason for the drift, so how can you possibly know your "fix" will work for everyone? Different firmwares can be tested, yes. But if it's a modem chipset issue, how do you know different chipsets from the same family behave the same? If it's a an issue with values from the DSLAM, how do you know all DSLAMs behave the same?

"Trust me, I play a doctor on TV."

Handyman · August 6, 2020, 2:32pm

Well then, play the devil's advocate and prove me wrong. It's as simple as that.
I left a step by step guide on how to apply the fix in one of my previous posts for others to test it as well and they are more than welcome to post feedback here.
It's a community forum people.
It's a group effort.
We share our knowledge and help each other. That's how progress is made.

altuntepe · August 6, 2020, 7:30pm

Thanks. It worked! I tried on zyxel P2812 modem.

Handyman · August 6, 2020, 8:42pm

So we now know the fix works for

TP-Link	TD-W8970	Lantiq XWAY VRX268
ZyXEL	P-2812HNU-F1	Lantiq XWAY VRX288

I will keep updating the list with every confirmed report.
Thanks all for your support.

takimata · August 6, 2020, 11:02pm

Alright, besides the fact that this is not how any of this works, I'll play along. I restarted my router this morning, so I can't show a huge uptime, but it's already enough:

Fritz!Box 3370 (Lantiq XWAY VRX288), Firmware 5.7.8.9.1.4
Uptime according to system clock: 14h08m55s = 50935 seconds
Uptime according to modem: 13h46m14s = 49574 secs
Which means a drift of 1.0274 for me.

The uptime corrected by your 1.024062, while admittedly an improvement, would still be off by 3 minutes.

Look, I'm not trying to be confrontional, or argue that there is not an issue here. What I am saying is that we don't know the cause, so we don't know the solution. Trying to find a "correction factor" by experimentation/observation is a valiant effort, but it's only a band-aid, and apparantly one that isn't sticking all too well.

Edit: In the end, though, I am not the arbiter here. Submit it to the devs, hear their feedback. Personally, I don't give enough expletives about the uptime the modem reports, and if it's off by 2 or 3% it's no skin off my back.

Handyman · August 7, 2020, 3:09am

Thank you, you actually proved it for me.
Your calculations are wrong.
You expected that the system uptime should equal to the line uptime (hence the 1.0274 drift factor) and didn't factor in the time it takes to load other modules and the time it takes the line to sync, so 2-3 min between system uptime and line uptime is an acceptable variance.
That's why I said 1.024 is an acceptable median and should work for everyone.

moeller0 · August 7, 2020, 3:23am

Not to put too fine a point on it, but you really need a better theory/model what and why is going on, please. That is quite a level of adhocery. But this should also be easy to test, if say after a few more days of continuous uptime the difference stays at a fixed ~3 minutes your hypothesis might have merit, but if the delta linearly increases with uptime, then it might be time to rethink the theory, no?

Handyman · August 7, 2020, 3:33am

moeller0 · August 7, 2020, 3:55am

You are missing the point I was making, by a bit. This needs to hold for @takimata's line as well.
But, I understand from the terseness of your reply, that you are done discussing this issue. Fine with me, although it is a bit disappointing that we are seemingly completely missing a proper theory why there seems to be an intriguing base10 to base2 mismatch somewhere between the DSL chips (CPE or DSLAM) and the Linux side of Lantiq socs.
Also, if we actually trust the system walltime more, why are we not simply using that and ignore the DSL time?

Handyman · August 7, 2020, 4:44am

I am not missing the point and I am not done discussing the issue.
The reason I made this thread is to discuss this issue and hear everyone's input.

You asked for proof that my hypothesis have merit and I gave it to you.
Delta holds, remain constant over a long period of time and across reboots.
The only variable is the modules loading time and line sync time which they hold no bearing.

Maybe a developer can answer that for you.

mbo2o · August 7, 2020, 9:34am

Some timer designs use 1000 ticks per second
Other timer designs use 1024 ticks per second

This fix is just correcting a wrong assumption about how many ticks are used by the driver.

It not rocket since and is very common.

Not having source or documentation for the driver is why it took so long to find.

The variance in the drift around 1.024 factor is due to natural drift between any two timers.

That can not be guessed it hard coded. That is the reason 1.024 will be the best that can be done to correct the drift

moeller0 · August 7, 2020, 9:57am

This is conjecture at this point, a compelling story, but really just that IMHO a story.

This is also what makes this a story/heuristic. IMHO the question really is, what is the rationale for correcting the returned number, based on our theory. A look at the lantiq driver source shows:

"/**
   PM synchronization modes.
*/
typedef enum
{
   /**
   Free synchronization.
   PM will be synchronized to its startup time. After start of the PM no further
   synchronization to an external clock source is done. */
   DSL_PM_SYNC_MODE_FREE = 0,
   /**
   System time synchronization.
   PM will be synchronized to the system time. The time base is derived from the
   DSL_SysTimeGet function (OS specific). */
   DSL_PM_SYNC_MODE_SYS_TIME = 1,
   /**
   External synchronization.
   PM will be synchronized to the external time network time.
   The host application should call the the function DSL_PM_15MinElapsedExtTrigger
   each 15 minutes. In addition the bOneDayElapsed parameter should be set
   accordingly. */
   DSL_PM_SYNC_MODE_EXTERNAL = 2
} DSL_PM_SyncModeType_t;"

If you would ask me, which you did not, I would say the 1024 timer story we came up with is not as clear cut as we seem to think...

I am not saying, that I know better, and ultimately the 1024 "correction" might be the best we can do. BUT I am sure that the least we should do is to report both the veridical counter as maintained by the dsl subsystem as well as the estimated corrected value.

That said, I only use the output of lantiq_dsl.sh for quick on-line checking of the status, so I do not really care about the truthfulness of the returned values (at least not, if the error is in the single digit percent range).

takimata · August 7, 2020, 10:58am

Nope, I measured the time since the dsl0 interface came up, according to the system logfile. The system itself was actually running for a day before that. Also, my system doesn't take 3 minutes to boot, and my line doesn't need 3 minutes to synchronize, neither would explain the discrepancy.

(I would appreciate if you didn't work off the assumption that I'm completely thick and just measuring anything to disprove you. I am not your enemy, even if you by now probably think I am.)

Handyman · August 7, 2020, 11:08am

Never crossed my mind that you are thick.
I am actually enjoying having an intelligent conversation with everyone who participated in this topic and getting the feeling that we are going somewhere.

I ran the numbers that you provided, line uptime multiplied by a drift factor of 1.0274 resulted in the system uptime that you provided.