20.02.2-rc2 on BT Home Hub 5a (Lantiq) initial report

This is a very short initial summary.

1: No big problems. There are signs of a speed increase but, ADSL or VDSL, re-training may reduce the speed. Most of the change is to upload speed.

2: Lantiq kit has had problems with timer drift. The Connection and Line Uptime reading in the Overview increase at different rates. This is an old problem but there was a fix from handyman. This was an edit to a config file which has either moved or been removed. So the drift problem is back.

3: This is a lot better than the rc1 version.

It is still a test version, and I may yet regret trying it.

1 Like

I did some checking. There have been some big internal changes which completely change how the familiar LuCi display of status is changed. So the old fix for the timing drift can't work, but the root cause for the drift is still there, and the clock for Line Uptime is running fast.

I remember some strange arguments against Handyman's fix, but the lack of an accessible explanation doesn't stop a fix being science or engineering. Think of calibrating a lab instrument.

Losing the option is a backward step.

Sorry, I objected to it on the basis of having no precise explanation, the theory a counter that should increment at 1000 incrementing at 1024 or the other way around, should have been easy to verify. Short of such a verification, I would recommend to report both the actual number as well as the estimated corrected numbers. Because if the actual cause would be an (unlikely) thermal difference between two clock sources (quartzes) the magnitude of the drift might not be constant and hence the "fix" might make things worse under certain conditions...
But I am not in any way involved in ACKing, let alone NACKing patches, so this is just my personal opinion (which I am still prepared to argu for and defend).

Two key things here. The 1024/1000 ratio is all too common. You can find it in the reporting of disk space for hard drives. And, compared to the clock errors from such things as thermal effects, it's a gigantic error.

I'm old enough to remember the days before quartz clocks. A minute a day was tolerable. This is over 30 minutes a day. And mechanical clocks had a rate adjustment as well as the time-setting knob. My wristwatch needed winding every day anyway. Same knob, getting the time signal or Big Ben on the radio was daily routine.

And the current situation for this version of OpenWRT is that we don't have a way of applying any correction.

If you don't want to fix it on your system, that's your privilege.

This is a different issue, for storage memory 2^10 was quite common, while for networking and time base2 never was common...

That clearly depends on the temperature divergence from the design te,perature, see e.g. https://ieee-uffc.org/frequency-control/educational-resources/introduction-to-quartz-frequency-standards-by-john-r-vig/introduction-to-quartz-frequency-standards-static-frequency-versus-temperature-stability/ for an example of multi second variation over a day by temperature...

What stops you from taking the reported number of seconds and multiply with your desired correction factor of 1000/1024?

I clearly am miscommunicating here, I advocate for reporting something like:
"reported uptime N days; M hours; O minutes; P seconds ; estimated wall clock time: Q days; R hours; S minutes; T seconds"
My point is, unless we know a correction is actually improving things unconditionally, it seems cautious to also report the original number....
But hey, as I said, I am not in a position to ACK or NACK any such change anyway.

You're comparing 20 seconds per day from temperature with over 30 minutes per day.

It's such a gigantic difference that it makes me wonder if I can trust your observations on anything.

I guess it is time to retire this sub-thread,;we both exchanged our positions but apparently failed to reach through to the respective other. Luckily, you do not need to trust my observations on anything, nor has my stated position any effect on what OpenWrt will or will not do with the uptime counter; so all should be fine.

The difference between Line Uptime and the other timers on the Overview page still matches the Handyman calculation. That is, the time in seconds shown by that timer needs to be multiplied by 1.024 to match the other timers. The System and Network uptime values have a different start time, by a few minutes, System first, then the DSL connection is established, and only then does the PPPoE for the Network start up.

I don't know for sure all three timers depend on the same internal "tick", the same quartz crystal, or if the Line Uptime uses a "tick" in the DSL signalling, but in my experience, the Handyman-observed ratio has worked on my hardware. He gave us 6 decimal places, which has been reported as being correct by several independent observers.

This is the sort of precision that you would expect from the clock on your wall, maybe not quite at the level of a clock synced to NTP, but adequate for human perception. You can look at the three figures and tell the difference between a power cut and the DSL dropping.

I have, incidentally, known the PPPoE go down without the DSL being affected. It's unusual, but they are different things.

And let me repeat this: thermal effects on quartz crystals need an extreme temperature variation to reach 20 seconds per day. We're talking halt-and-catch-fire levels as the maximum. The 1.024 factor is over 30 minutes per day.

These are two very distinct effects.

The changes between v19.07 and v20.02 change how that 1.024 factor can be applied. The new system is more efficient but, somewhere between the Lantiq chip and the Luci page displayed, a calculation is done, taking timer "ticks" and converting them to days, hours, minutes, and seconds.

I know this problem was noticed about a year ago. v20.02 has been in development a long time. I am surprised that nobody seems to know where a correction factor could be applied (this is not a question of whether it should be applied at user level or pre-installed).

I have doubts whether any firmware-blob will be found that fixes this problem. We are talking rather old hardware now. But if one is found, DOCUMENT IT! Tell everyone that they don't need to correct the drift problem.

Remember, I am using a version of OpenWRT which is specifically built for this hardware. The drift problem is not subtle (30+ minutes per day!) and it's not new.

If there is some magic firmware-blob that corrects the drift, but doesn't work on a Home Hub (Assume that's possible) the correction factor, clearly documented, can be applied just to the Home Hub. Could be a conditional, could be a clearly labelled line that is normally commented out, there are several ways of doing such things.

If I had a clock on my wall that drifted over 30 minutes per day, it had better have serious individual value.

I have decided to revert to OpenWRT v19.07.07, the current stable version.

Lantiq has been an Intel subsidiary for a long time now, and the Home Hub 5A is a very old piece of hardware. I doubt there is ever going to be any official "fix" to the time drift problem.

Until some way of fixing the drift can be found in the 20.02 series, it doesn't matter whether it is something built into the release, or some mod that a user implements, as Handyman has done for 19.07, it's effectively broken for this hardware.

It's a pity, it's good hardware, even after all these years.

Is the uptime counter the only reason for you to downgrade to 19.07 or are there other reasons as well?
If it is the former, what use-case do you have where veridical uptime logging is that important, that you can not work around by alternative means of uptime sampling (like using a cron job every X minutes to test wether reported uptime increased monotonically and then just adding 15 minutes to your robust uptime estimate)?

This division has been sold to maxlinear some time ago....

OK, so we now have rc3

I doubt I shall bother. Since I switched back to 19.07 things have just been working. The only interruption I had was a power cut.

There's no sign of any way of fixing the Lantiq time drift. Does it matter? It's over half an hour per day. It's not trivial. But it seems nobody knows how to fix it, and that worries me.

While I am of different opinion on the severety of the uptime reporting issue, I think it worth mentioning, that a post in the forum is no replacement for either a real bugreport or even better, a patch that actually solves the problem (it is much easier on the core developers to ideally just bless a decent patch than to come up with a solution from just a problem report).
P.S.: If you truely believe you understand the root cause of the issue the right place to fix would be the lantiq driver code that converts the blob's reported values into wall clock time... at least that would squelch the issue for good....

I am totally unsurprised

So our so helpful correspondent suggests switching to some alien dialect, which even at its closest to English is describable as jargon, or maybe argot. As for writing the patch myself, I would have to find the code, understand the language it is written in (a big assumption there), and pin down what amounts to the correct verse in a long piece of formal poetry.

No, not poetry. That, at least, was mostly written to be memorable. I do know why the plums were in the icebox.

Look, you are the one claiming that a cosmetic uptime reporting issue is a show stopper that is making him switch back to an older release; I just pointed out that repeating to mention that issue in this thread is not likely to result in any change.

Maybe have a look at https://dev.iopsys.eu/intel/drv_dsl_cpe_api/-/blob/master/src/include/drv_dsl_cpe_pm_core.h and you will find:
#define DSL_PM_MSEC (1000)

Which seems relevant because of https://dev.iopsys.eu/intel/drv_dsl_cpe_api/-/blob/master/src/pm/drv_dsl_cpe_api_pm.c doing:

/* Fill Total Counters elapsed time*/
   pCounters->total.nElapsedTime = DSL_DRV_PM_CONTEXT(pContext)->nPmTotalElapsedTime/DSL_PM_MSEC;

and https://dev.iopsys.eu/intel/drv_dsl_cpe_api/-/blob/master/src/pm/drv_dsl_cpe_pm_core.c doing:

         /* Update current showtime elapsed time*/
         DSL_DRV_PM_CONTEXT(pContext)->nCurrShowtimeTime   += (msecTimeFrame/DSL_PM_MSEC);

If I was concerned about the reporting, that is the parameter I would look at modifying (assuming I knew for a fact that the value of 1000 is wrong).

The language is C... that should give you enough information to come up with at least a proof of principle patch you can test and then post on the developer list for discussion.
In case you wonder why I do not offer to do this, I stopped using my HH5A/lantiq xrx200 modem so am not setup to for any meaningful testing (and I consider the uptime reporting to be cosmetic in nature, not worth my time to create, test and try to upstream a patch; others might disagree).

1 Like