I'd like to get a better picture of the number of installed Fedora systems, without compromising privacy. Right now, we count IP addresses per day, but this is unreliable due to network topology. Let's do what openSUSE does (https://en.opensuse.org/openSUSE:Statistics) and give each system a UUID.
I don't think this needs to be sent for all requests, just the metalink/mirrorlist ones. And to be really sure for privacy concerns, this id could be regenerated every month — my goal is to *count*, not to track.
Possibly related: https://bugzilla.redhat.com/show_bug.cgi?id=1222415 (although that links to a libhif upstream ticket which suggests using `/etc/machine-id` — we definitely *don't* want that.
So in order to cut down the amount of data that might be tracked, maybe look at making things not so unique. The goal is to try and get an idea about how many ips are behind a NAT per day.
If we take the system uuid and mod it by 2^13-1 (or some other small mersenne prime) that would give a rotation of 8191 so while there will be 'duplicates', we are likely to see the same number at different ips per day. The system uuid can be controlled by the user and/or regenerate daily or whatever is needed.
I expect that this has flaws but hopefully ones which aren't completely demasking.
Specifically, what I want is:
* The randomized UUID
* The CPU architecture
* ID, VERSION_ID, and VARIANT_ID from /etc/os-release
I think using a completely random UUID is better than a modification of system ID or other persistent data.
Note also that we need this at a level that works in the DNF command line, microdnf, and PackageKit.
This seems like it's thinking about dnf on host systems - what about in containers?
(In reply to Colin Walters from comment #3)
> This seems like it's thinking about dnf on host systems - what about in
I'm thinking about both. Right now, we only set VARIANT_ID for a few flavors (the Fedora Editions, and Cloud Base for historical reasons). I'd like to change that to be different in our container image, too — maybe just "container".
Is there something more I should be thinking about? This approach, of course, wouldn't link the containers to their host in any way and would count each one uniquely; I can't think of a privacy-preserving way to do that offhand.
> and would count each one uniquely;
Right, what I'm getting at is that this is going to pick up a lot of `docker build` type scenarios, and *not* pick up each time that image is deployed in an e.g. Kube cluster.
That's of course true for "classic host images" like using Packer or whatever to make AMIs, and for custom rpm-ostree imaging. We could perhaps teach rpm-ostree to use different user agents for client vs build/compose side.
Of course on the infrastructure server side we could just have a clear separation between containers versus hosts, but the fact that we don't differentiate the container base image today is exactly what I'm getting at here.
Also, AFAIK today nothing in the dnf/libdnf stack really tries to determine "am I in a container" in a reliable way. That's always been an interesting topic... Changing the base image OS variant makes sense to me offhand, I'm not sure why we didn't do that a long time ago.
We have discussed that we're not going to send UUID as it might allow user tracking.
There's user-agent and 'countme' work in progress in bug#1156007
and PR https://github.com/rpm-software-management/libdnf/pull/684
*** This bug has been marked as a duplicate of bug 1156007 ***