1672504 – DNF Better Counting

Bug 1672504 - DNF Better Counting

Summary: DNF Better Counting

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	Changes Tracking
Sub Component:
Version:	32
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Matthew Miller
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1737516
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-05 07:46 UTC by Ben Cotton
Modified:	2021-01-02 16:54 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2021-01-02 16:21:27 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Ben Cotton 2019-02-05 07:46:14 UTC

This is a tracking bug for Change: DNF Better Counting
For more details, see: https://fedoraproject.org/wiki/Changes/DNF_Better_Counting

Right now, we estimate installed Fedora systems by counting unique IP addresses which show up in our updates mirror statistics. We need better data than that. There are some proposals for more complicated systems, but a quick thing we can do now to greatly improve what we have without a gigantic new infrastructure.

Comment 1 Ben Cotton 2019-02-19 20:30:22 UTC

According to the Fedora 30 schedule[1], today is the deadline for changes to be in a testable state. If your change is ready to be tested, please set the status to MODIFIED. If you know your change will not be ready for Fedora 30, you can set the version to rawhide and notify bcotton. For more information about this milestone, see the Changes Policy[2].

[1] https://fedoraproject.org/wiki/Releases/30/Schedule
[2] https://fedoraproject.org/wiki/Changes/Policy#Change_Checkpoint:_Completion_deadline

Comment 2 Matthew Miller 2019-02-19 20:32:28 UTC

Daniel, can you update on the status from the DNF team? Thanks!

Comment 3 Ben Cotton 2019-03-05 21:49:59 UTC

We have reached the Code Complete (100%) milestone in the Fedora 30 development cycle. At this point, all Changes should be fully code complete and ready for testing during the beta freeze. If your Change has reached this milestone, please set the status to ON_QA. If it has not, this Change will be submitted to FESCo to evaluate the contigency plan and decide if the Change will continue in the Fedora 30 cycle.

Comment 4 Stephen John Smoogen 2019-03-09 01:12:32 UTC

I believe we are now at the point of ON_QA. The specifics are met, we can see the version info and we get count info. It is now integration into the product.

Comment 5 Ben Cotton 2019-04-10 14:43:06 UTC

The implementation of this feature is delayed until Fedora 31, so I am setting the version to Rawhide and changing the status to ASSIGNED, since the client side still does not fully support this.

Comment 6 Michal Domonkos 2019-05-13 07:03:49 UTC

For reference, these are the relevant PRs used to deliver this feature (currently still in progress):

https://github.com/rpm-software-management/libdnf/pull/684
https://github.com/rpm-software-management/dnf/pull/1324

Comment 7 Michal Domonkos 2019-05-13 07:56:47 UTC

The technical BZ tracking this:
https://bugzilla.redhat.com/show_bug.cgi?id=1647454

Comment 8 Ben Cotton 2019-08-13 16:54:04 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to '31'.

Comment 9 Ben Cotton 2019-08-13 19:20:13 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 31 development cycle.
Changing version to 31.

Comment 10 Ben Cotton 2019-08-14 17:56:14 UTC

We have reached the 'Code Complete (testable)' milestone in the Fedora 31 release cycle. If your Change is in a testable state, please set the status to MODIFIED. If this Change will not be ready for Fedora 31, please set the version to rawhide.

The 100% code complete deadline is Tue 2019-08-27.

Comment 11 Ben Cotton 2019-08-27 17:17:58 UTC

We have reached the '100% Code Complete' milestone in the Fedora 31 release cycle. If your Change is complete, please set the status to ON_QA. The Beta Freeze is underway. If you need a freeze exception, see https://fedoraproject.org/wiki/QA:SOP_freeze_exception_bug_process

If this Change will not be ready for Fedora 31, please set the version to rawhide.

Comment 12 Michal Domonkos 2019-08-29 15:00:28 UTC

# SUMMARY #

The first version of this feature implemented in Fedora 31 consists of the following two parts:

(1) A new User-Agent HTTP header that replaces the bare "libdnf" string with:
    libdnf/VERSION (NAME VERSION_ID; VARIANT_ID; OS.BASEARCH)

    e.g.

    libdnf/0.35.2 (Fedora 31; server; Linux.x86_64)
    
    This can be overridden with the newly added "user_agent" config option (see dnf.conf(5) for details).

(2) The "countme=1" flag that is added to a single metalink query once per week (non-incremental at the moment).  This is configurable with the newly added "countme" config option.  The global default is "false".  Since the original proposal talks about this being an opt-out, I have requested enablement in the default Fedora repo configs here:
https://bugzilla.redhat.com/show_bug.cgi?id=1737516

The business logic is also captured in plain English in the following Behave scenario:
https://github.com/dmnks/ci-dnf-stack/blob/countme/dnf-behave-tests/features/countme.feature

## IMPLEMENTATION DETAILS ##

We made privacy-awareness our *priority number 1*, since this is a very sensitive topic and even the appearance of tracking could severely damage the reputation of Fedora (or make the users disable the feature at best).  So these are some measures we took to prevent that:

(1)
The new OS information in the User-Agent header is checked against whitelisted values.  If any of the values is not known, the string will not contain the OS part (in parenthesis) at all. Currently, we only allow "Fedora" in, plus all of the official variants ("generic" being a placeholder for any unknown variant or unset VARIANT_ID).  This is to prevent reporting on some rare strings that could potentially turn the User-Agent field into a unique identifier.

(2)
The exact metalink query to which we attach countme=1 is picked randomly out of N queries, where N is the estimated number of metalink requests per machine per week.  This is to avoid "marking" the very first query in a given week, the act of which could be regarded as extra information that we didn't reveal previously.  This way, a countme-flagged request is no different from a regular request (bounded by N).

Choosing an N too small, we would send the flag very close to the first query (with 1 being equivalent to not doing this randomization at all).  Choosing an N too big, we would have a great probability of missing the current week.  For now, we went with N=4, based on the default value of metadata_expire which is 48h, meaning we should see around 3-4 metalink requests per week.  Note that this is just a rough estimate; we can adjust this value in a future DNF update as needed.

## FUTURE PLANS ##

(1)
The original proposal suggests (as an optional improvement) that we increment the countme flag every week.  While this is trivial to code, I suggest that we re-evaluate the consequences of doing that.  My concern is that, even with the proposed week 60 cut-off, this could still facilitate some tracking.  Given how many systems would eventually fall into those 60 buckets, we could consider that an OK compromise, but it could still be used by the hypothetical adversary to at least narrow down their candidate list, which is a departure from "no privacy loss occurs when you enable countme on your machine".

So, what I would suggest is to consider employing a differentially-private [1][2] mechanism to collect these longevity numbers.  One such example is a method called RAPPOR [3] proposed by Google and implemented in Chromium to safely gather usage statistics (on virtually any strings, even unique ones). It's basically just a more optimized version of the Randomized Response (RR) method [4] and is formally proven to be differentially-private.  I think there is a simple way we could do something similar here (RAPPOR also has a general-purpose open-source implementation [5] but that would be an overkill).  In a nutshell, we would just replace the value in "countme=1" with a randomized bitmask where each bit represents a property of the system and is perturbed using the RR scheme [6].

In theory, we could apply a RAPPOR-like method to the User-Agent string as well and perhaps collect even more statistical information about the systems in a differentially private way.

As a simpler (but less private) alternative, we could just reduce the number of buckets from 60 to something much lower (i.e. incrementing every month or so), without doing any kind of randomization.

(2)
Currently, we don't support the PackageKit code path.  That means, both the User-Agent and countme features are only applicable to the DNF CLI at the moment. We are looking into this and are hoping to enable that later, before the F31 code freeze.

[1] https://desfontain.es/privacy/differential-privacy-awesomeness.html
[2] https://en.wikipedia.org/wiki/Differential_privacy
[3] http://arxiv.org/abs/1407.6981
[4] https://en.wikipedia.org/wiki/Randomized_response
[5] https://github.com/google/rappor
[6] https://www.usenix.org/sites/default/files/conference/protected-files/usenixsecurity17_slides_tianhao_wang.pdf (see slide 20)

Comment 13 Matthew Miller 2019-08-29 16:57:04 UTC

I definitely appreciate the attention to privacy. You're right: this is an important community priority.

The most important thing we need to distinguish is between short-lived systems and longer installs. I think we should save the complicated RR ideas for a more detailed system census program and keep this simple. With those two things in mind, how about some larger "order of magnitude" buckets: 1 = first week (7 days), 2 = first month (8-30 days), 3 = six months (31-180 days), 4 = older (181+ days). This lets me do interesting analysis without fine-grained breakdown. (And these numbers wouldn't be reset on system upgrade, presumably.)

One other concern I have is the list of official variants. Where is this list maintained?

Comment 14 Michal Domonkos 2019-08-30 11:35:40 UTC

I agree with you, let's keep it simple.  There are easy ways to balance utility and privacy, just by sanely choosing our buckets, without the need to resort to elaborate RR setups.

These buckets you propose look good to me!  I'll see if I can make this little enhancement in an upcoming libdnf update in the F31 branch.  I second the need to distinguish the short-lived versus long-lived instances and this should be enough.

Regarding the variants list, it's actually dead simple (and ugly at the same time).  We hard-coded it in libdnf:
https://github.com/rpm-software-management/libdnf/blob/4365be61b9eeef54df0dc848d238f99953e284e5/libdnf/utils/os-release.cpp#L39

Where did the list come from?  From a simple
  $ git grep VARIANT_ID fedora-release.spec
in the fedora-release dist-git repo, which yielded what looked like an exhaustive list of official Fedora variants.

This obviously doesn't scale well.  And maybe I was just overly paranoid and we don't have to do the whitelisting and should scrape that thing.  My concern was that if a new Fedora variant comes up or someone makes their own, such a "bucket" of users wouldn't be big enough to ensure anonymity and prevent tracking based on the User-Agent header (esp. given that the header is bundled with each and every HTTP request libdnf makes to any repo, not just those with "countme=true").

Comment 15 Michal Domonkos 2019-10-18 13:03:43 UTC

We just made a few improvements to this feature (currently available in our DNF nightlies[1] and to be released in Fedora 31 after GA):

* The countme flag is now incremented!

    From dnf.conf(5):
    [...]
    The flag is a simple "countme=N" parameter appended to the metalink and
    mirrorlist URL, where N is an integer representing the "longevity" bucket
    this system belongs to.

    The following 4 buckets are defined, based on how many full weeks have
    passed since the beginning of the week when this system was installed: 1 =
    first week, 2 = first month (2-4 weeks), 3 = six months (5-24 weeks) and 4
    = more than six months (> 24 weeks).

    This information is meant to help distinguish short-lived installs from
    long-term ones, and to gather other statistics about system lifecycle.
    [...]

* PackageKit and microdnf now both support countme and the new extended User-Agent format
* Repos using "mirrorlist" instead of "metalink" are now also supported by countme

Other fixes:
* The countme flag is not resent on download failures anymore
* The libdnf version was dropped from the User-Agent due to privacy concerns

Note that we currently don't include the application name (PackageKit, microdnf etc.) in the User-Agent.  In order to implement this without introducing technical debt, we would need to figure out some technical details, such as: should libdnf detect the application automatically, or should applications extend the User-Agent field themselves?

And as always, for the up-to-date feature specification, please see:
https://github.com/rpm-software-management/ci-dnf-stack/blob/master/dnf-behave-tests/features/countme.feature

[1] dnf copr enable rpmsoftwaremanagement/dnf-nightly

Comment 16 Michal Domonkos 2019-10-18 13:08:10 UTC

Upstream PRs for reference:

libdnf:  https://github.com/rpm-software-management/libdnf/pull/807
librepo: https://github.com/rpm-software-management/librepo/pull/171

Comment 17 Ben Cotton 2020-02-11 15:48:01 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 32 development cycle.
Changing version to 32.

Comment 18 Ben Cotton 2020-02-28 16:31:46 UTC

The Code Complete (100% Complete) deadline has passed. If your Change is 100% complete, please set the status of this bug to ON_QA. If you need to defer to Fedora 33, please set the version to rawhide. A list of incomplete changes is being submitted to FESCo for review.

Comment 19 Matthew Miller 2020-02-28 20:37:48 UTC

We're waiting on the backend implementation, which is work the CPE team has to do.

Comment 20 Zbigniew Jędrzejewski-Szmek 2020-07-01 11:12:58 UTC

Any updates?

Comment 21 Zbigniew Jędrzejewski-Szmek 2021-01-02 12:25:10 UTC

Any updates for 2021?

Comment 22 Matthew Miller 2021-01-02 16:21:27 UTC

(In reply to Zbigniew Jędrzejewski-Szmek from comment #21)
> Any updates for 2021?

Yes! The backend is done and outputting weekly to https://data-analysis.fedoraproject.org/csv-reports/countme/. Will has a jupyter notebook with analysis examples at https://github.com/wgwoods/fedora-countme-data. I'm working on some scripts to run automatically every week. I think we can call this feature done and move tracking to somewhere else.

Comment 23 Zbigniew Jędrzejewski-Szmek 2021-01-02 16:54:01 UTC

Thanks!
https://github.com/wgwoods/fedora-countme-data/blob/master/jupyter/dnf-countme-pandas-demo.ipynb
is the direct link if anyone wants to look at the Will's plots.

Note You need to log in before you can comment on or make changes to this bug.