Collecting statistics from LEDE devices (telemetry)

makro · February 7, 2017, 8:57pm

Collecting statistics was brought up on the mailing list earlier, primarily to track successful flashes/upgrades per board/device. I've had some thoughts about collecting statistics earlier, but for other purposes, although not in conflict with tracking successful flashes. Communities tend to generate a lot of popularity questions, seeing countless threads asking which device is the most popular (in both LEDE and OpenWrt forums over time) lead me on to this. I'll start with two polls covering the basic questions, and elaborate on my further thoughts afterwards.

LEDE should collect statistics/telemetry
LEDE should not collect statistics/telemetry at all

0 voters

LEDE should collect statistics/telemetry by default
LEDE should not collect any statistics/telemetry by default

0 voters

What data to collect and why

This is a list of data points I have considered and what makes them relevant, it is not meant to be exhaustive.

Board/device name and release/revision information - Tracking successful software/hardware combinations, and showing what devices are most used, could potentially track use of community builds if they identify themselves
If the image is dirty (LEDE source has been modified, look for rXXXX**+Y**) - Particularly related to the previous point, does it run LEDE vanilla
List of installed packages and potentially versions - Show popular packages, discover widespread use of outdated packages, discover little used packages
From package list, create flags/points based on packages of particular interest - Show which wireless drivers are popular (e.g. kmod-ath9k, kmod-mt76...), LuCI or no LuCI etc.

How to collect

My idea is to use a script on LEDE devices that will collect the data of interest, and submit it to a server using HTTP POST. Uclient-fetch is included and supports sending POST data as well as HTTPS if one of the libustream variants are installed. A script on a web server receives this data, and saves it in a database. Periodically, another script runs through the database and generates pretty web pages with aggregated statistics for users to browse, search and filter (my thought was that generating statistics on the fly per request could be fairly slow).

Conceptual issues

There will be concerns about any kind of data collection or dialing home, so we need to consider what data we really want (if any), and inform users appropriately. We also need to decide whether this package should be included by default, or optional. Some filtering of data could also be useful, we could e.g. only show packages that exist in official feeds, so if anyone has a custom package in their image that would be ignored by the server collecting statistics.

Another issue is keeping the statistics relevant (in my mind the goal is up-to-date statistics, not historical statistics). My plan is to have the device generate a random device ID on firstboot, and on each submission (let's say the default is to submit an updated report each 48 hours) include this ID so the server knows if it's a new device or one that has submitted before. If a device ID has not submitted any reports for say 30 days, the data about it is deleted and no longer counted. As tying each submission to a unique ID will likely cause further concern from some users, I suggest some datapoints could be optional - opt-in or opt-out remains to be decided. The UCI configuration could look like this (the device has generated an ID, and the user does not want to submit package list):

config statistics
    option 'device_id' '5dcb530e6cb90b63c75ab8792a0176792bb1bae433312f6d7891b108653f7db0'
    option 'submit_release' 1
    option 'submit_revision' 1
    option 'submit_board' 1
    option 'submit_packages' 0
    option 'submit_dirty' 1

Reliability is another issue - there is little that stops spammers or trolls from filling this database with garbage data. Rate limiting submissions based on IP addresses and accepting only expected input (release names that are real, revision numbers that look like actual revision numbers etc.) are measures I can think of.

Implementation issues

I don't know if this kind of volume justifies using a message broker like RabbitMQ, or if that's overkill. My main concern, which a message broker would solve, is that the database could be overloaded with many simultaneous device reports if it's a synchronous operation (scripts writes to database as soon as it receives data). The web server could probably handle it, but especially with package lists it could lead to complex/large INSERT queries depending on the data model.

What I could contribute

Writing the client part (seems easy)
Implement the server side parts (medium+ - I understand the logic, but will need time to get things right)
Writing a script to generate simple HTML reports
Wiki documentation

What I would need help with

Making the browseable reports pretty and functional (e.g. filtering)
Making a LuCI application
Infrastructure - for development and testing free AWS/other cloud things will work, but I don't have infrastructure to provide for long term deployment

stangri · February 7, 2017, 9:37pm

If it's all uci-based, I could totally use my newfound knowledge to write a luci app.

As much as I personally love the idea, I don't think many will agree to having this reporting enabled by default. It would still be a win for the project if the code is included in default images and instructions on how to enable reporting (alongside with the detailed description what information is collected and reported) were to be included in the "first boot" document.

jimzhong · February 8, 2017, 8:19am

I agree that reporting should be disable by default. A idea, I think, is to prompt the user to enable telemetry the first time he visited the luci web interface.

CereS · February 8, 2017, 10:46am

the user could also be prompted to allow telemetry just ONCE to support LEDE Statistics.
Something like: Report telemetry data?
Yes | Once | No

charcoal · February 8, 2017, 11:28pm

I have not voted yet,

I do like to see stats, but i also find privacy very important.

This leads to my question, is it possible to do telemetry without doing damage to privacy?
Say for instance, use TOR (.onion) as a tool for this. Even though this then would make all LEDE users appear on the TOR network... Other options?

If there is a way to preserve privacy then i would say YES to telemetry.
If telemetry hurts privacy then i would say NO to telemetry.

makro · February 9, 2017, 7:53pm

That's a possibility, although we would need to consider how to make it useful. As described in my initial post, my idea is to make the statistics an up to date representation of the LEDE user base (that wish to report their data, at least). Your suggestion is well suited for tracking "revision X works with board Y", which is also useful information.

Privacy isn't a binary choice, where you have either complete or no privacy. Any kind of data collection directly or indirectly related to a person will have some impact on their privacy (strictly speaking this isn't about persons but devices, but someone owns and uses it, so let's go with it anyway). In some cases the impact is purely theoretical, in other cases the impact is quite severe. What information you collect and how you collect, retain, use and protect the information can affect the privacy impact positively or negatively. Only collecting information voluntarily is an example of a measure that will reduce the privacy impact. Ultimately you need to find a solution where the privacy impact is acceptable, for both users and the LEDE project itself.

As for using TOR, that does one very specific thing: it masks the IP address of the device submitting information. It does not change the fact that information about the device and installed software is tied to a random, unique ID in the database. Identifying who owns the device based on this alone would be difficult, but the tie between an ID and the rest of the information reported remains - and that in itself affects the privacy impact. I see no reason to save the IP address of a submitting device, it would only be saved in generated logs, firewall state tables etc., not as part of reports.

Using TOR is impractical though. It is ~800 kB on its own, and it pulls in OpenSSL adding another ~720 kB on top. Reporting via TOR would be possible and transparent for those who install the package manually, but adding 1.5 MB of packages to make all reports anonymous isn't worth the cost. It would eliminate many devices (all devices that would be interesting to know about in the 4 MB flash debate) and users who would rather use the space for other things (it would eat a lot of usable space on 8 MB devices too).

At this point 1/3 is against collecting any statistics at all, and a large majority is against collecting statistics by default. Roughly 10% of users who have posted something on the LEDE forums at all (counted all with at least 1 reply) have voted. I'd say that is enough interest to give it a try and see where it goes, both in terms of how many decide to submit their data and how useful the results are. I'll post a new thread in the community projects category when I have something worth showing - might take a few weeks due to moving houses.

tmomas · February 9, 2017, 9:57pm

@makro The option LEDE should not collect any statistics by default was set a bit too narrow IMHO.

To get a clearer picture, I would propose for the next voting (when you have time and the idea has matured a bit):

Board/device name and release/revision information -> default? [default | optional | not at all]
People are posting in the forum with a nickname/real name which device they use with which release. How could the same information without the name affect the privacy of the user?
List of installed packages -> optional? [default | optional | not at all]

stangri · February 10, 2017, 1:24am

I'd say default on both (however while the board info can be collected on first boot, the list of packages is more fluid -- do you want to collect it frequently and match it to previous reports to avoid duplicates?).

I'd be more interested in how the logs/IP addresses handled (to make sure they are fully purged). I'd be concerned that the telemetry data (with non-purged IP addresses) + hack into LEDE/OpenWrt infrastructure + zero-day exploit could be used to instantly break into great number of routers which sent telemetry (where IP is static or didn't change yet).

I also think it's a great idea to send telemetry with a delay (up to 72 hours after first boot) and the reasons are four-fold:

To escape "garbage" data where the image is flashed, but then the user quickly changes their mind and flashes back to stock.
To escape "garbage" data when the image is flashed, user messes something up and reflashes.
To let VPN (if used) connection establish before sending telemetry to anonymize the source.
If you are collecting package list only once, it lets the user install their packages.

tmomas · February 10, 2017, 6:58pm

Because people are making themselves publicly naked on facebook already. The most intimate details get posted. Billions (!) of people do not care the least.

In addition to this: As already mentioned before, users are already posting on the forum, which device they use with which release, and they do this with a name and other information attached to this.

But then, when it comes to anonymous "Board/device name and release/revision" collection by default, without IP, without name, without MAC adress, without anything that could help to correlate "username" -> "device/release", without any commercial interest behind this datacollection, then suddenly this draws people away from LEDE.

Sorry, I don't understand this logic.

Yes, I use LEDE, and I don't pay a single cent for it! And even better: The forum that solves my problems is also for free! What??? Giving something back? ANONYMOUSLY??? No, never ever, because you never know what someone will do with this information:

callhome: DIR-505 17.01.0-rc2

(Sorry for any exaggeration and sarcasm.)

What reasons could one possibly have to not contribute the above information to LEDE?
How could this information be abused?

Answers to these questions would certainly help to increase my understanding, and a vote will show what others think about callhome: DIR-505 17.01.0-rc2.

richb-hanover · February 12, 2017, 1:23am

+1. I also would be curious to read a serious response to this question.

I would also like to make a perhaps somewhat more difficult request... I would like to see the data collection include a hashed unique ID, so we can filter out duplicates. For example:

callhome: DIR-505 17.01.0-rc2 md5(LAN MAC Address)

Thanks.

makro · February 12, 2017, 1:11pm

That's probably true, I'll consider a second round of polls at some point. I was trying to avoid making it a full-on multi-page questionnaire, settling for go/no-go and default/not-default for now to make it quick to answer for people.

The latest posts made me think of splitting it into two reports: install report submitted on install and sysupgrade (device, LEDE version/revision/dirty), and usage report submitted regularly (device/target platform, package list).

For the record, I agree with you that many people have irrational concerns about data collection (as in: use Facebook, despise collecting the data we're discussing here). But to appear level-headed and attempt to represent both sides, I'll try to answer the questions anyway (assuming we're discussing someone who doesn't use Facebook, and is consistently concious about data collection).

A user living in an oppressive country installs LEDE to use VPN in order to bypass filters, eavesdropping and such from the government. Submitting statistics (prior to VPN setup) will indicate to an eavesdropper that "this guy uses custom router firmware, that is suspicious" (VPN would be a red flag anyway if detected, though).
A users ISP only allows use of ISP-issued equipment. If the ISP detects LEDE statistics are submitted they may apply sanctions towards the user.
Users ISP listens for statistics submissions and sells this info to ad-trackers (Verizon have already been caught injecting headers for ad-trackers into users HTTP traffic, so this one isn't far fetched IMO).
The client part of LEDE (uclient-fetch at the idea stage, and the script to call it and submit data) has a security vulnerability. When the device makes the outbound connection to report statistics, someone in position to man-in-the-middle the connection (malicious ISP/government, attacker hijacking domain name or IP address of report server) can inject malicious data exploiting the vulnerability on the device.

What about HTTPS/encryption? That negates only scenario 4. It hides the details in scenario 1-3, but "device contacts LEDE statistics server" is still in the clear for the attacker - and that is all the attacker needs in those cases.

Whether it is LEDE or the users responsibility to avoid these scenarios is of course another discussion, and goes back to the point in my previous post: we can't avoid risk/impact completely, we have to find a risk/impact level that is acceptable.

That is my plan (far down my initial post, easy to miss), although I'd make it random, e.g.: dd if=/dev/urandom bs=1M count=1 | sha256sum. Collecting the hash of a MAC address is as good as collecting the MAC address itself, it's a small effort to hash all possible MAC addresses and reverse a reported hash. We (well, I) don't want the MAC address, so we shouldn't collect it.

When it eventually is up and running, I intend to document these things. For one, IP addresses won't be tied to reports - for rate-limiting we only need to record "[IP address] submitted 3 reports in last 60 minutes", we don't need "[IP address] sent report ID X, Y and Z".

slh · February 12, 2017, 6:05pm

Wouldn't it be much less controversial -and let's face it, any kind of automatic phoning-home capability is controversial- to just drop a script for this into the filesystem and advertising its use in via /etc/banner (for those without luci) and the first-login page (as in when no password is set), with some explanations and the full contents of the intended submission before it actually gets sent over the wire? This could even be extended with a simple (multiple-choice) questionaire asking if all major features of said device are indeed working.

makro · February 12, 2017, 6:38pm

Yes, it certainly would. And that's my plan, for now. Make something that works, put the client-side part in feeds/packages, put up a random server somewhere, document things in the wiki, request help for improvements, encourage community builders to consider including the package in their build.

When we have a working proof of concept, we can move on to discuss making it more "official" - move it to base, include it in images, put the server on LEDE-controlled infrastructure etc. And adjust the behavior as the community sees fit.

Thanks for the /etc/banner suggestion by the way. I was thinking of how to reach non-LuCI users, the solution was literally right in front of me at every shell login.

deuteragenie · February 13, 2017, 10:19am

Two things:

Could the statistics gathered be sent over TOR? (and if it does not work, well, too bad, stats are not sent...)
Upon initialization/first use, it would be good to gather the results of some internal speed tests (say, OpenSSL speed benchmarks) and send them to the LEDE eye of Sauron.

tmomas · February 13, 2017, 11:22am

See the answer above:

CereS · February 17, 2017, 2:08pm

the information there isn't complete, there metadata missing:

callhome: DIR-505 17.01.0-rc2
: IPv4 a.b.c.d or IPv6 ..

Having a private internet line, this address is as good as your physical address.
And there it stops being anonymous.
Where will the data be stored, who has access to it and how secure would the database be?
Imagine an attacker getting access to that DB and knowing a possible remote exploit, they would take over instantly all LEDE devices.
LEDE should be secure by default keeping these statistics optional is a part of that in my opinion.

Don't get me wrong, i surely would opt-in for that, but for me using open source software is about having a choice.

tmomas · February 17, 2017, 7:20pm

There's nothing missing. That's all what later should be in the database:

<date+time> DIR-505 17.01.0-rc2

But as we have seen above already, simply the connection to LEDE servers, without any further data, could lead to $something_horrible, while transmitting GB via VPN or downloading the VPN packages from LEDE servers for installation (i.e. VPN not up yet) will have no negative consequences.

(Sorry for slightly drifting into sarcasm again, but to me, most of the reasons against are highly theoretical)

But obviously, not transmitting any data and not connecting any LEDE server will completely eliminate any additional risk. No doubt.

stangri · February 19, 2017, 2:54am

Can telemetry be submitted thru a proxy service which will strip the source IP address?

tmomas · February 19, 2017, 5:55pm

Which of the countries listed in the LEDE website statistics would you categorize as oppressive?

stangri · February 19, 2017, 9:05pm

I'd say 3 out of top-10 entries qualify as use of internet is (or can be in the case of Unknown) restricted by government there.