Description of problem: it happens quite randomly i think, but repeatedly Version-Release number of selected component: dnf-0.4.11-1.fc20 Additional info: reporter: libreport-2.1.11 backtrace_rating: 4 cmdline: /usr/bin/python /usr/bin/dnf -v makecache timer crash_function: addfilelist executable: /usr/bin/python2.7 kernel: 3.12.6-300.fc20.x86_64 runlevel: N 5 type: CCpp uid: 0 Truncated backtrace: Thread no. 1 (10 frames) #0 addfilelist at /usr/src/debug/libsolv/ext/repo_rpmdb.c:723 #1 rpm2solv at /usr/src/debug/libsolv/ext/repo_rpmdb.c:951 #2 repo_add_rpmdb at /usr/src/debug/libsolv/ext/repo_rpmdb.c:1676 #3 repo_add_rpmdb_reffp at /usr/src/debug/libsolv/ext/repo_rpmdb.c:1872 #4 hy_sack_load_system_repo at /usr/src/debug/hawkey/py3/src/sack.c:788 #5 load_system_repo at /usr/src/debug/hawkey/src/python/sack-py.c:386 #6 do_call at /usr/src/debug/Python-2.7.5/Python/ceval.c:4313 #7 call_function at /usr/src/debug/Python-2.7.5/Python/ceval.c:4121 #8 PyEval_EvalFrameEx at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740 #9 PyEval_EvalCodeEx at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330
Created attachment 849011 [details] File: backtrace
Created attachment 849012 [details] File: cgroup
Created attachment 849013 [details] File: core_backtrace
Created attachment 849014 [details] File: dso_list
Created attachment 849015 [details] File: environ
Created attachment 849016 [details] File: exploitable
Created attachment 849017 [details] File: limits
Created attachment 849018 [details] File: maps
Created attachment 849019 [details] File: open_fds
Created attachment 849020 [details] File: proc_pid_status
Created attachment 849021 [details] File: var_log_messages
Torbjørn, is it possible for you to update the full core dump somewhere? (I've described how to do that in https://bugzilla.redhat.com/show_bug.cgi?id=1047087#c17.) Also if you could attach the @System.solv somewhere, we'd really appreciate that. What is the installed libsolv version?
Also related: bug 1051812.
Hmm, it seems to crash when accessing dn[di[i]], which means that the rpm header seems to be very weird (the dirindex being out of range). The core dump should really help here. bug 1051812 is a crash in unchecked_decompress_buf, i.e. a corrupt solv file. Maybe you should change hawkey to first write into a tmp file with a unique name and then rename to the final destination so that you don't get a broken file if two processes refresh the cache or you run out of disk space.
"rpm -qal >/dev/null" works fine? If not, please make your /var/lib/rpm/Packages available somewhere.
random idea: in case it's a certain RPM causing this, can you please also attach /var/log/dnf.log to this? Thank you!
Tried and failed to reproduce with a script that modifies slightly rpmdb and then calls makecache. not reproducible for me even with MALLOC_PERTURB_. We're also getting bitten by bug 990285 on this one.
reproduced! http://paste.fedoraproject.org/70576/79837139 Michael, note that fp=0x0 in repo_add_rpmdb_reffp() which, I conclude, means this is not a corrupted old .solv file but a particular rpmdb state.
Weird. The repeated runs fail again, DNF displays BDB2053 Freeing read locks for locker 0x57: 9821/139736579651392 BDB2053 Freeing read locks for locker 0x58: 9821/139736579651392 before silently aborting. If I then run RPM, I get: [akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages error: rpmdbNextIterator: skipping h# 3423 region trailer: BAD, tag -633323979 type 524303918 offset 2017444319 count 1488993822 [akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages with the second run producing the desired output in 'allpackages'. Panu, do you know what we might be looking at here?
(and after that DNF runs normally again of course)
uploaded the DNF coredump, but I am afraid it's the wrong place to look at http://akozumpl.fedorapeople.org/core.9039
looking at the logs, 'dnf makecache' was running at the same time as the transaction just before the read lock errors started to pop up. So RPM transaction runs, something tries to read RPMDB concurrently, running into an rpm locking/bdb locking bug.
(In reply to Ales Kozumplik from comment #19) > Weird. The repeated runs fail again, DNF displays > > BDB2053 Freeing read locks for locker 0x57: 9821/139736579651392 > BDB2053 Freeing read locks for locker 0x58: 9821/139736579651392 This means a former process whose pid was 9821 exited/died without closing rpmdb properly, such as crashing with read-only locks held. Which would be the previous, crashing run of dnf in this case perhaps. > before silently aborting. If I then run RPM, I get: > > [akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages > error: rpmdbNextIterator: skipping h# 3423 region trailer: BAD, tag > -633323979 type 524303918 offset 2017444319 count 1488993822 > [akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages > > with the second run producing the desired output in 'allpackages'. > > Panu, do you know what we might be looking at here? The rpmdbNextIterator region trailer error comes from a failing headerCheck(), ie rpm's internal header verification thinks there's a corrupted header in the rpmdb. More often than actual rpmdb corruption, such errors are corruption in the BDB environment, ie 'rm -f /var/lib/rpm/__db.*' is enough to fix it. That a second run doesn't show that error is a bit strange though, BDB environment corruption usually doesn't go away by itself, and rpm doesn't automatically remove the environment except when --root is used (or when doing --rebuilddb). IIRC libsolv's repo_rpmdb doesn't know about (and thus doesn't validate either) the header region stuff, so its possible that repo_rpmdb code reads something that rpm itself would skip in case like above.
(In reply to Ales Kozumplik from comment #22) > looking at the logs, 'dnf makecache' was running at the same time as the > transaction just before the read lock errors started to pop up. So RPM > transaction runs, something tries to read RPMDB concurrently, running into > an rpm locking/bdb locking bug. Right, libsolv's repo_rpmdb also doesn't know about this newish rpmdb open/close serialization lock, without which BDB shared environment create/open/close can race: http://rpm.org/gitweb?p=rpm.git;a=commitdiff;h=ad874d60e3804f1bcd64f3510e1e2dfbf81456cd
libsolv-0.4.1-1.gitbcedc98.fc20 has been submitted as an update for Fedora 20. https://admin.fedoraproject.org/updates/libsolv-0.4.1-1.gitbcedc98.fc20
Package libsolv-0.4.1-1.gitbcedc98.fc20: * should fix your issue, * was pushed to the Fedora 20 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing libsolv-0.4.1-1.gitbcedc98.fc20' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2014-1410/libsolv-0.4.1-1.gitbcedc98.fc20 then log in and leave karma (feedback).
libsolv-0.4.1-1.gitbcedc98.fc20 has been pushed to the Fedora 20 stable repository. If problems still persist, please make note of it in this bug report.