Bug 1376004

Summary: Interrupting PackageKit update then retrying it, incorrectly claims "no updates available"
Product: [Fedora] Fedora Reporter: Alan Jenkins <alan.christopher.jenkins>
Component: PackageKitAssignee: Richard Hughes <rhughes>
Status: CLOSED EOL QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 24CC: jonathan, klember, rdieter, rhughes, smparrish
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-08 17:21:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
`packagekitd --verbose` output, from performing steps to reproduce none

Description Alan Jenkins 2016-09-14 12:43:12 UTC
Created attachment 1200822 [details]
`packagekitd --verbose` output, from performing steps to reproduce

Interrupting PackageKit update then retrying, incorrectly claims "no updates available"


Description of problem:

When a PackageKit update is interrupted - by temporarily disconnecting the network - retrying the update doesn't help, and only returns "no updates available".

Additionally, `pkcon -c -1 get-updates` doesn't work either.

I observed PackageKit in a similar state several times.  Now I have found a way that at least reproduces similar behaviour.


Version-Release number of selected component (if applicable):
PackageKit-1.1.3-2.fc24.x86_64

How reproducible:
always

Steps to Reproduce:


1. System with pending updates.  (This can be reproduced e.g. by installing a LiveCD with no network access.  Or perhaps downgrading a package, or enabling a repo with newer packages like fedora-updates-testing.  Packages on your OS which can be downgraded can be found on the fedora package-announce mailing list archive, e.g. http://osdir.com/ml/fedora-package-announce/2016-08/threads.html).

2. PackageKit installed
Interrupting PackageKit update then retrying, incorrectly claims "no updates available"
3. systemctl mask --now packagekit  # For debugging, packagekitd must be run manually

2. Purge PackageKit database:

2a. dnf remove PackageKit  # WARNING this will remove e.g. gnome-software; take note of packages you will want to reinstall.

2b. rm -r /var/cache/PackageKit/*

3. dnf install PackageKit

4. /usr/libexec/packagekitd --verbose  # run packagekit daemon in a terminal window

5. `pkcon get-updates`, wait for "Downloading repository information", then immediately disable your network connection.  pkcon does not time out after a few seconds, so use ctrl+C to cancel the transaction.

6. Re-enable your network connection.

7. pkcon get-updates

CLEANUP: remember to reinstall gnome-software etc. as desired.

Actual results:
Output from step 7:

Getting updates               [=========================]         
Finished                      [=========================]         
There are no updates available at this time.

Expected results:

User request for an upgrade should not be prevented from working, just because a previous upgrade has failed.

I.e. output should be identical to `pkcon refresh force >/dev/null && pkcon get-updates`.  In my current testing:

Getting updates               [=========================]         
Starting                      [=========================]         
Finished                      [=========================]         
Available   	google-chrome-stable-53.0.2785.113-1.x86_64 (google-chrome) 	Google Chrome


Additional info:

Unexpectedly, `pkcon refresh` is sufficient to resolve this state.

On some previous occasions I had to resort to `pkcon refresh force`; neither `pkcon -c -1 refresh` nor `pkcon -c 1 get-updates` were effective.

It seems there's still something I'm missing.  Maybe this reproduction is a different edge case.  I guess if we can fix this it might be instructive and/or help eliminate possibilities if I observe the problem again :).

Comment 1 Alan Jenkins 2016-09-14 15:30:36 UTC
I notice I should add step 7 completes immediately.  I.e. having cancelled step 6 presumably during one of the large repo downloads, step 7 doesn't appear to try downloading any other repo.  There are two large repos: `fedora`, i.e. at time of release, and `updates`.


Initial log analysis:

12:55:35	Hif                 Attempting to update fedora
12:55:35	PackageKit-Hif      got state download-repository with hint (null)
12:55:35	Hif                 update state 509160/50816387
12:55:35	Hif                 update state 1017408/50816387
12:55:35	Hif                 update state 1531448/50816387
12:55:36	Hif                 update state 2036800/50816387
12:55:36	Hif                 update state 2542152/50816387

Last "update state" - network disconnected just after this point.

12:55:39	PackageKit          CreateTransaction method called

5 other get-updates transactions get created, but by a different "sender" (?)
There are exactly 5 repos in total - coincidence?

12:55:42	PackageKit          Cancel method called on /17_bcecccbc
12:55:42	PackageKit          same sender, no need to check uid
12:55:42	PackageKit          emitting allow-cancel 0
12:55:42	PackageKit          opened logind fd 34
12:55:42	PackageKit          6 transactions in list, 6 committed but not finished
12:55:42	Hif                 releasing lock 1
12:55:42	Hif                 Skipping refresh of fedora: cannot update repo 'fedora': Yum repo downloading error: Downloading error(s): repodata/8dcc989396ed27fadd252ba9b655019934bc3d9915f186f1f2f27e71eba7b42f-primary.xml.gz - Download failed: Curl error (42): Operation was aborted by an application callback ...

Ctrl-C of pkcon feeds through

12:55:42	PackageKit          transaction now finished
12:55:42	PackageKit          running /18_baaacebb as previous one finished
12:55:42	PackageKit          transaction now running
12:55:42	PackageKit          6 transactions in list, 5 committed but not finished

next transaction (18) starts

12:55:42	Hif                 Skipping refresh of fedora: cannot update repo 'fedora': Cannot download repomd.xml: Cannot download repodata/repomd.xml: All mirrors were tried

probably TCP connect() failure; DNS resolution was cached from earlier

12:55:42	Hif                 Skipping refresh of rpmfusion-free: cannot update repo 'rpmfusion-free': Cannot prepare internal mirrorlist: Curl error (6): Couldn't resolve host name for http://mirrors.rpmfusion.org/mirrorlist?repo=free-fedora-24&arch=x86_64 [Could not resolve host: mirrors.rpmfusion.org]

Now failing DNS resolution, transaction 18 continues burning through the repos as expected.

12:55:42	PackageKit          transaction now finished
12:55:42	PackageKit          running /19_bccaebad as previous one finished
12:55:42	PackageKit          transaction now running
12:55:42	PackageKit          6 transactions in list, 4 committed but not finished
12:55:42	PackageKit          install uid now 1001
12:55:42	PackageKit          install cmdline now PackageKit: get-updates
12:55:42	PackageKit          setting role for /19_bccaebad to get-updates
12:55:42	PackageKit-Hif      using cached sack HySack::release_ver[24]::filelists|remote
12:55:42	PackageKit          backend was running for 12 ms
12:55:42	PackageKit          emitting finished 'success', 12
12:55:42	PackageKit          transaction now finished

We see all subsequent transactions (including from step 7) answered from some higher-level cache - this is what causes the problem

12:55:52	PackageKit          CreateTransaction method called
12:55:52	PackageKit          job count now 23
12:55:52	PackageKit          trying to open database '/var/lib/PackageKit/transactions.db'
12:55:52	PackageKit          job count is now at 22
12:55:52	PackageKit          transaction now new
12:55:52	PackageKit          setting sender to :1.63
12:55:52	PackageKit          adding transaction 0x55c7b473d640
12:55:52	PackageKit          sending object path: '/23_deaceeaa'

another 5 transactions created by the same mysterious sender

12:55:59	PackageKit          setting sender to :1.382

then this last one must be my step 7 `pkcon get-updates`.

Looking at the sender numbers, our 1.63 is plausibly gnome-software.  The pattern of requests (5 in a row) is also consistent, having looked at the system logs for packagekitd.  Can't be packagekitd sending requests to itself, because the UID is wrong.

----
Aside: the last debug message below is buggy, uses "%i" where it should be "%u":

12:55:35	PackageKit          SetHints method called: locale=en_GB.UTF-8, background=false, interactive=true, cache-age=4294967295
12:55:35	PackageKit          locale changed to en_GB.UTF-8
12:55:35	PackageKit          cache-age changed to -1

Comment 2 Alan Jenkins 2016-09-14 19:02:53 UTC
So I think the gnome-software transaction is actually the key here.

Normally (while online) canceling `pkcon get-updates` doesn't cause a problem.  I don't think canceling it after we go offline is really causing a problem.

The problem is gnome-software queues up a transaction, which gets released when we're offline (once I cancel my pkcon transaction).

Submitting a transaction while we're offline, dumps failed results into the "higher level" cache.  "PackageKit-Hif: using cached sack".  Restarting packagekitd seems to be enough to clear this cache.

If gnome-software canceled transactions when we went offline, it wouldn't be a problem.   (I assume it doesn't *create* transactions when offline, instead waiting until we go online).

I think it's still wrong if other software (like pkcon) could pollute the cache while we're offline.  But it's more complex to work out how to fix it on the packagekitd side.

We could avoid caching failures while offline.  That's not exactly what this argument implies however.  It's asking for a tight coupling with gnome-software - it should use exactly the same criteria.  I don't know if gnome-software retries when the network connection is changed (while remaining on).  Or if a second network connection is added.  Or if there's a separate flag that flips after a captive portal login...

The simplest solution is that failures don't need to be cached.  The problem is, if the path to your repo is unreliable, every query is going to go out to the network.  PK aims to make queries light weight so they can be integrated into the gui shell etc...  It's not great for power consumption.  And I don't  want to add massive timeouts in every call to command-not-found while offline.

So I think we need to make sure all the online/offline logic is in packagekitd, not gnome-software, and just have gnome-software subscribe to a signal that tells it the network status has changed & it might want to retry.  (And invalidate the cached failures at that point).  IMO if NM happens to have a signal that does exactly what clients would want, it would make a better abstraction if PK relayed the information.

There's a question whether we want to assume the repos are unreachable when PK thinks we're offline.  I mean, if PK only ever did the dumbest thing, looking at online/offline, that would almost be a fix on it's own.  Except for some race conditions.  And I'd call this bug report a race condition...  I don't like this idea.

Another question is whether we need to avoid caching failures indefinitely if we don't see changes in network connection status.  I believe GNOME Software does something similar at a higher level - which will mitigate this bug report given that `pkcon -c -1` is working fine.  So I'm not majorly concerned, it's just a bit of a worry.

I'm also curious about when you ask for a refresh when we already have a cache, whether the cache is kept on failure.  I still don't understand what the difference is between `pkcon -c -1 refresh` and `pkcon refresh force`.

Comment 3 Alan Jenkins 2016-09-16 13:27:11 UTC
> need to avoid caching failures indefinitely if we don't see changes in network connection status.  I believe GNOME Software does something similar at a higher level - which will mitigate this bug report

Wups, I don't think that's right.  Whatever retry logic GNOME Software has, I think PK is returning "success" (but not updates).  PK "hides" the failure, therefore GNOME Software would have no reason to increase the frequency it checks at.

E.g. running `pkcon refresh force` while offline does not show any error (and `pkmon` doesn't elaborate on this).

Comment 4 Fedora End Of Life 2017-07-25 23:02:00 UTC
This message is a reminder that Fedora 24 is nearing its end of life.
Approximately 2 (two) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 24. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '24'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version'
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not
able to fix it before Fedora 24 is end of life. If you would still like
to see this bug fixed and are able to reproduce it against a later version
of Fedora, you are encouraged  change the 'version' to a later Fedora
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's
lifetime, sometimes those efforts are overtaken by events. Often a
more recent Fedora release includes newer upstream software that fixes
bugs or makes them obsolete.

Comment 5 Alan Jenkins 2017-07-26 08:06:15 UTC
I can't seem to reproduce this on Fedora 26.  Yay!

It looks like it changed to "failures don't need to be cached".  As predicted, if the configured repos are unreachable, then every single search e.g. in the Gnome app launcher is sending fresh requests out onto the network.

Comment 6 Fedora End Of Life 2017-08-08 17:21:18 UTC
Fedora 24 changed to end-of-life (EOL) status on 2017-08-08. Fedora 24 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.