Bug 1318223

Summary: dnf -y upgrade fails without error message
Product: [Fedora] Fedora Reporter: Marius Vollmer <mvollmer>
Component: rpmAssignee: Packaging Maintenance Team <packaging-team-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 23CC: dpeschman, ffesti, ignatenko, jsilhan, jzeleny, lkardos, mls, mluscon, novyjindrich, packaging-team-maint, pknirsch, pnemade, stefw, vmukhame
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-20 11:50:08 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Marius Vollmer 2016-03-16 10:10:33 UTC
Description of problem:

During our automated image creation, "dnf -y upgrade" sometimes fails after the "cleanup" phase without any apparent error message.

Using strace shows a write() that fails with EPIPE towards the end.
 
Version-Release number of selected component (if applicable):
dnf-1.1.7-2.fc23.noarch

How reproducible:
Only in a very specific setting, hard to reproduce elsewhere

Steps to Reproduce:
1. git clone https://github.com/mvollmer/cockpit.git
2. cd cockpit
3. git checkout 5bf3073f489e4b9438309e0dd2d6fa0da2f3e86d
4. cd test
5. sudo ./vm-prep
3. ./vm-create fedora-23 -v

Actual results:
dnf -y upgrade runs and then ends like this:

  Cleanup     : libnl3-3.2.27-0.1.fc23.x86_64                           268/275 
  Cleanup     : libpng-2:1.6.17-2.fc23.x86_64                           269/275 
  Cleanup     : bash-4.3.42-1.fc23.x86_64                               270/275 
  Cleanup     : glibc-common-2.22-3.fc23.x86_64                         271/275 
  Cleanup     : nss-softokn-freebl-3.20.0-1.0.fc23.x86_64               272/275 
  Cleanup     : glibc-2.22-3.fc23.x86_64                                273/275 
  Cleanup     : tzdata-2015g-1.fc23.noarch                              274/275 
  Cleanup     : libgcc-5.1.1-4.fc23.x86_64                              275/275 
1039 blocks
[1458118831.96] EVENT: Domain 'fedora-23-iqwr' (ID 1075) Shutdown Finished
vm-create: setup failed with code 1


Expected results:
dnf enters the "verify" phase and completes successfully

Additional info:
I have captured a strace log: https://fedorapeople.org/~mvo/dnf.log (739 MB) 

It has a EPIPE in it, but I have no idea whether that causes the failure.  Some lines from the log:

1005  pipe([50, 51])                    = 0
1005  clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fd8231cf9d0) = 1832
...
1005  write(51, "hare/man/cy/man1\n/usr/share/man/"..., 4096 <unfinished ...>
...
1832  exit_group(0)                     = ?
1005  <... write resumed> )             = -1 EPIPE (Broken pipe)
...
1005  exit_group(1)                     = ?
1005  +++ exited with 1 +++

Comment 1 Stef Walter 2016-03-16 12:17:12 UTC
I've seen this several times in a row in a test VM.

Comment 2 Marius Vollmer 2016-03-16 12:40:07 UTC
It looks like dnf runs "mandb -q" (via sh) and writes a lot of stuff into its stdin, but mandb -q doesn't read stdin.

Comment 3 Honza Silhan 2016-03-29 14:36:36 UTC
Thanks for the report. It could happen because of new initializing of rpmdb cache through libsolv unofficial way. This should fix it: https://github.com/openSUSE/libsolv/pull/123

Comment 4 Marius Vollmer 2016-03-30 07:21:04 UTC
(In reply to Jan Silhan from comment #3)
> Thanks for the report. It could happen because of new initializing of rpmdb
> cache through libsolv unofficial way. This should fix it:
> https://github.com/openSUSE/libsolv/pull/123

Note that EPIPE happens with "mandb", not "rpmdb".

Comment 5 Michael Schröder 2016-05-20 10:55:29 UTC
This is not a bug in libsolv or dnf, but a bug in the man-db package. It uses the new filetrigger mechanism to update the mandb if a man page is installed. But the filtrigger scriptlet is supposed to read the trigger files from stdin, which it currently doesn't do. The resulting SIGPIPE terminates the complete software stack, as the transaction is done by the man dnf process.

So, two things IMHO need to be done:
1) the man-db package needs to read (and discard) the file list from stdin (i.e. cat > /dev/null)
2) rpm should also be changed so that it does not die when a scriptlet misbehaves.

Comment 6 Michael Schröder 2016-05-20 11:45:47 UTC
This is already fixed in newer rpm versions, but the installed rpm-4.13.0-0.rc1.3.fc23 package does not have the fix.

Comment 7 Ľuboš Kardoš 2016-05-20 11:50:08 UTC
Yes, this is fixed since 4.13.0-0.rc1.6.fc23.