Bug 1620275 - package gets removed on failed update
Summary: package gets removed on failed update
Keywords:
Status: CLOSED RAWHIDE
Alias: None
Product: Fedora
Classification: Fedora
Component: rpm
Version: rawhide
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Panu Matilainen
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-08-22 21:01 UTC by Valdis Kletnieks
Modified: 2019-06-13 12:50 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-13 12:50:51 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
dnf failure salvaged from scrollback buffer (78.74 KB, text/plain)
2018-08-22 21:01 UTC, Valdis Kletnieks
no flags Details

Description Valdis Kletnieks 2018-08-22 21:01:56 UTC
Created attachment 1477967 [details]
dnf failure salvaged from scrollback buffer

Description of problem:
I run 'dnf update' - about 130 RPMs to process. dnf hits a cpio error while updating gimp.  At the end, gimp is listed in the 'updated' list - when in fact the error caused gimp to be removed entirely.

error tossed during the update:

  Upgrading        : grubby-8.40-18.fc30.x86_64                                                                 108/263
  Upgrading        : gimp-2:2.10.6-1.fc30.x86_64                                                                109/263
Error unpacking rpm package gimp-2:2.10.6-1.fc30.x86_64
Error unpacking rpm package gimp-2:2.10.6-1.fc30.x86_64
error: unpacking of archive failed on file /usr/lib64/gimp/2.0/plug-ins/align-layers/align-layers;5b7d7374: cpio: open
  Upgrading        : mesa-libGLES-18.2.0~rc2-1.fc30.x86_64                                                      110/263
error: gimp-2:2.10.6-1.fc30.x86_64: install failed
  Upgrading        : grub2-tools-efi-1:2.02-51.fc30.x86_64                                                      111/263

Summary at the end:
Upgraded:
  annobin-8.25-1.fc30.x86_64                                autocorr-en-1:6.1.0.3-1.fc30.noarch
  babl-0.1.56-2.fc30.x86_64                                 babl-devel-0.1.56-2.fc30.x86_64
(...)
  gdb-doc-8.1.90.20180727-44.fc30.noarch                    gdb-headless-8.1.90.20180727-44.fc30.x86_64
  gegl04-0.4.8-1.fc30.x86_64                                gegl04-devel-0.4.8-1.fc30.x86_64
  gimp-2:2.10.6-1.fc30.x86_64                               gimp-devel-2:2.10.6-1.fc30.x86_64
  gimp-devel-tools-2:2.10.6-1.fc30.x86_64                   gimp-libs-2:2.10.6-1.fc30.x86_64
  grub2-common-1:2.02-51.fc30.noarch                        grub2-efi-x64-1:2.02-51.fc30.x86_64
(...)


(I have no idea why cpio died)

Full copy-paste rescued from the scrollback buffer attached...

Version-Release number of selected component (if applicable):
dnf-3.2.0-2.fc29.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Panu Matilainen 2018-08-23 06:17:10 UTC
Dnf has no say to what happens in the transaction, this is an rpm bug. I'll look into it, thanks for the report.

Comment 2 Panu Matilainen 2018-08-23 09:04:22 UTC
That said, I've can't see how this could actually happen, "cpio: open" error is a failure to exclusively open a file for writing, which for root requires some fairly special circumstances. And while there certainly *are* ways for that to happen, I don't see how it could result in the old package getting removed despite the error. Not unless there were two or more packages wanting to remove that older gimp version (through obsoletes) in the same transaction, and getting really "lucky" with ordering and all. And I dont see the necessary elements for that in this transaction either.

All of which goes to say that this is likely going to be hard to reproduce and thus fix.

Is this a physical host, vm, container or...? Old install upgraded through the ages or a new one? Anything other special maybe worth mentioning?

Comment 3 Valdis Kletnieks 2018-08-23 16:14:30 UTC
(In reply to Panu Matilainen from comment #1)
> Dnf has no say to what happens in the transaction, this is an rpm bug. I'll
> look into it, thanks for the report.

Well... if dnf is invoking rpm under the covers, there's *still* a dnf bug there.  It's not checking the exit code of rpm and noticing that the invocation of rpm failed.

Comment 4 Valdis Kletnieks 2018-08-23 16:31:54 UTC
(In reply to Panu Matilainen from comment #2)
> That said, I've can't see how this could actually happen, "cpio: open" error
> is a failure to exclusively open a file for writing, which for root requires

Yeah, I was mystified by the cpio failure as well.  Was ready to blame it on a kernel bug, but not seeing any other signs it's buggy that way...

> All of which goes to say that this is likely going to be hard to reproduce
> and thus fix.

Yeah, finding the cpio bug is hopeless.  The thing that's chasable here is that dnf fails to notice that an update failed.

Easy way to test would probably be 'chattr +i' a file in some random rpm, and then use 'dnf reinstall', which should complain when it tries to re-install that file.
 
> Is this a physical host, vm, container or...? Old install upgraded through
> the ages or a new one? Anything other special maybe worth mentioning?

It's a 5 year old Dell Latitude laptop, updated constantly.  Also running a linux-next kernel from last week.

Comment 5 Panu Matilainen 2018-08-24 05:57:24 UTC
No, the only relevant thing here is that *rpm* failed to notice, or at least record, the failure.

Dnf isn't invoking rpm, it's using librpm API. Rpm should've marked the gimp update element as failed which would cause the erasure of the old package to be skipped and errors getting logged, and the verify stage should show the failure. The rub here is that dnf IS relying on rpm provided data here:

        # Post-transaction verification is no longer needed,
        # because DNF trusts error codes returned by RPM.
        # Verification banner is displayed to preserve UX.
        # TODO: drop in future DNF
        for tsi in self._transaction:
            count = display_banner(tsi.pkg, count)

There's no actual *cpio* bug here (rpm has it's own routines to handle cpio), it's a failure to open a file for writing, and while chattr +i will cause a similar failure it's not the same. Which might or might not matter. Maybe I'll need to simulate the open failure with strace or something to see if there's a path that escapes visual inspection. Beyond that, like I said there *is* a known path where an error can escape like this, but it just seems unlikely to be the case here.

Also the logged error is totally useless for understanding what caused the failure.

Comment 6 Panu Matilainen 2018-08-24 13:38:47 UTC
As it usually happens with such things, the same thing found me via other route as well. The deal with the cpio-open error turns out to be quite a simple and reproducable afterall: it occurs when a regular file is to be replaced by an unowned directory. 

So there's the first bug - rpm can handle an owned directory replacing a regular file but unowned directories are created differently. This can be fixed with packaging, but since rpm allows unowned directories it should also handle this correctly or detect it as a conflict, failing mid-transaction is not ok.

Comment 7 Panu Matilainen 2018-08-27 08:33:18 UTC
FWIW, this fixes the particular case of gimp update failing:
https://src.fedoraproject.org/rpms/gimp/c/5806be946febc58b68ec28a3af16eab0e9207435?branch=master

Comment 8 Panu Matilainen 2019-05-15 11:16:53 UTC
For the record, just saw this myself. On f29 -> f30 distro-sync, filesystem update failed (probably due to iso image mounted on /mnt) and resulted in the filesystem package getting removed entirely during the update. And while there is a known case with multiple obsoletes where this could happen, there shouldn't be any package obsoleting the filesystem, let alone multiple packages doing so. 

Turns out this is trivially reproducable with dnf, because of a somewhat unexpected (but entirely legitimate) API use. And easily reproduced is usually easily fixed too, fix proposed upstream now:
https://github.com/rpm-software-management/rpm/pull/706

Thanks for reporting this and thus helping finding and fixing the nasty bug that this is!

Comment 9 Valdis Kletnieks 2019-05-15 15:49:44 UTC
Glad to hear the other piece of the bug finally got squashed! :)

Things like that are why I run Rawhide on top of a linux-next kernel - every
bug I find before release is one that doesn't ship. ;)

Comment 10 Panu Matilainen 2019-06-13 12:50:51 UTC
Fixed in rawhide as of rpm >= 4.14.90. Backports to other Fedora versions to follow later.

Thanks again for the report, found two bugs and fixed one (the unowned file thing is a good deal harder to fix)


Note You need to log in before you can comment on or make changes to this bug.