1051917 – [abrt] dnf: addfilelist(): python2.7 killed by SIGSEGV

Bug 1051917 - [abrt] dnf: addfilelist(): python2.7 killed by SIGSEGV

Summary: [abrt] dnf: addfilelist(): python2.7 killed by SIGSEGV

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	dnf
Sub Component:
Version:	20
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Ales Kozumplik
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:	https://retrace.fedoraproject.org/faf...
Whiteboard:	abrt_hash:b453a2fa16e02e35e556ba948a1...
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-01-12 21:41 UTC by Torbjørn Lindahl
Modified:	2014-02-07 03:15 UTC (History)
CC List:	7 users (show)
Fixed In Version:	libsolv-0.4.1-1.gitbcedc98.fc20
Clone Of:
Environment:
Last Closed:	2014-02-07 03:15:12 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File: backtrace (83.82 KB, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: cgroup (164 bytes, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: core_backtrace (9.73 KB, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: dso_list (9.35 KB, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: environ (72 bytes, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: exploitable (82 bytes, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: limits (1.29 KB, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: maps (47.95 KB, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: open_fds (517 bytes, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: proc_pid_status (890 bytes, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
File: var_log_messages (2.32 KB, text/plain) 2014-01-12 21:41 UTC, Torbjørn Lindahl	no flags	Details
View All

Description Torbjørn Lindahl 2014-01-12 21:41:18 UTC

Description of problem:
it happens quite randomly i think, but repeatedly

Version-Release number of selected component:
dnf-0.4.11-1.fc20

Additional info:
reporter:       libreport-2.1.11
backtrace_rating: 4
cmdline:        /usr/bin/python /usr/bin/dnf -v makecache timer
crash_function: addfilelist
executable:     /usr/bin/python2.7
kernel:         3.12.6-300.fc20.x86_64
runlevel:       N 5
type:           CCpp
uid:            0

Truncated backtrace:
Thread no. 1 (10 frames)
 #0 addfilelist at /usr/src/debug/libsolv/ext/repo_rpmdb.c:723
 #1 rpm2solv at /usr/src/debug/libsolv/ext/repo_rpmdb.c:951
 #2 repo_add_rpmdb at /usr/src/debug/libsolv/ext/repo_rpmdb.c:1676
 #3 repo_add_rpmdb_reffp at /usr/src/debug/libsolv/ext/repo_rpmdb.c:1872
 #4 hy_sack_load_system_repo at /usr/src/debug/hawkey/py3/src/sack.c:788
 #5 load_system_repo at /usr/src/debug/hawkey/src/python/sack-py.c:386
 #6 do_call at /usr/src/debug/Python-2.7.5/Python/ceval.c:4313
 #7 call_function at /usr/src/debug/Python-2.7.5/Python/ceval.c:4121
 #8 PyEval_EvalFrameEx at /usr/src/debug/Python-2.7.5/Python/ceval.c:2740
 #9 PyEval_EvalCodeEx at /usr/src/debug/Python-2.7.5/Python/ceval.c:3330

Comment 1 Torbjørn Lindahl 2014-01-12 21:41:23 UTC

Created attachment 849011 [details]
File: backtrace

Comment 2 Torbjørn Lindahl 2014-01-12 21:41:26 UTC

Created attachment 849012 [details]
File: cgroup

Comment 3 Torbjørn Lindahl 2014-01-12 21:41:28 UTC

Created attachment 849013 [details]
File: core_backtrace

Comment 4 Torbjørn Lindahl 2014-01-12 21:41:30 UTC

Created attachment 849014 [details]
File: dso_list

Comment 5 Torbjørn Lindahl 2014-01-12 21:41:32 UTC

Created attachment 849015 [details]
File: environ

Comment 6 Torbjørn Lindahl 2014-01-12 21:41:34 UTC

Created attachment 849016 [details]
File: exploitable

Comment 7 Torbjørn Lindahl 2014-01-12 21:41:36 UTC

Created attachment 849017 [details]
File: limits

Comment 8 Torbjørn Lindahl 2014-01-12 21:41:38 UTC

Created attachment 849018 [details]
File: maps

Comment 9 Torbjørn Lindahl 2014-01-12 21:41:41 UTC

Created attachment 849019 [details]
File: open_fds

Comment 10 Torbjørn Lindahl 2014-01-12 21:41:43 UTC

Created attachment 849020 [details]
File: proc_pid_status

Comment 11 Torbjørn Lindahl 2014-01-12 21:41:45 UTC

Created attachment 849021 [details]
File: var_log_messages

Comment 12 Ales Kozumplik 2014-01-13 07:13:28 UTC

Torbjørn,

is it possible for you to update the full core dump somewhere? (I've described how to do that in https://bugzilla.redhat.com/show_bug.cgi?id=1047087#c17.) Also if you could attach the @System.solv somewhere, we'd really appreciate that.

What is the installed libsolv version?

Comment 13 Ales Kozumplik 2014-01-13 07:16:33 UTC

Also related: bug 1051812.

Comment 14 Michael Schröder 2014-01-13 10:25:24 UTC

Hmm, it seems to crash when accessing dn[di[i]], which means that the rpm header seems to be very weird (the dirindex being out of range). The core dump should really help here.

bug 1051812 is a crash in unchecked_decompress_buf, i.e. a corrupt solv file. Maybe you should change hawkey to first write into a tmp file with a unique name and then rename to the final destination so that you don't get a broken file if two processes refresh the cache or you run out of disk space.

Comment 15 Zdeněk Pavlas 2014-01-13 12:51:50 UTC

"rpm -qal >/dev/null" works fine?  If not, please make your /var/lib/rpm/Packages available somewhere.

Comment 16 Ales Kozumplik 2014-01-13 17:37:22 UTC

random idea: in case it's a certain RPM causing this, can you please also attach /var/log/dnf.log to this? Thank you!

Comment 17 Ales Kozumplik 2014-01-14 08:00:21 UTC

Tried and failed to reproduce with a script that modifies slightly rpmdb and then calls makecache. not reproducible for me even with MALLOC_PERTURB_.

We're also getting bitten by bug 990285 on this one.

Comment 18 Ales Kozumplik 2014-01-22 08:38:47 UTC

reproduced!

http://paste.fedoraproject.org/70576/79837139

Michael, note that fp=0x0 in repo_add_rpmdb_reffp() which, I conclude, means this is not a corrupted old .solv file but a particular rpmdb state.

Comment 19 Ales Kozumplik 2014-01-22 09:07:43 UTC

Weird. The repeated runs fail again, DNF displays

BDB2053 Freeing read locks for locker 0x57: 9821/139736579651392
BDB2053 Freeing read locks for locker 0x58: 9821/139736579651392

before silently aborting. If I then run RPM, I get:

[akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages
error: rpmdbNextIterator: skipping h#    3423 region trailer: BAD, tag -633323979 type 524303918 offset 2017444319 count 1488993822
[akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages

with the second run producing the desired output in 'allpackages'.

Panu, do you know what we might be looking at here?

Comment 20 Ales Kozumplik 2014-01-22 09:08:13 UTC

(and after that DNF runs normally again of course)

Comment 21 Ales Kozumplik 2014-01-22 09:19:28 UTC

uploaded the DNF coredump, but I am afraid it's the wrong place to look at http://akozumpl.fedorapeople.org/core.9039

Comment 22 Ales Kozumplik 2014-01-22 09:41:38 UTC

looking at the logs, 'dnf makecache' was running at the same time as the transaction just before the read lock errors started to pop up. So RPM transaction runs, something tries to read RPMDB concurrently, running into an rpm locking/bdb locking bug.

Comment 23 Panu Matilainen 2014-01-22 11:03:35 UTC

(In reply to Ales Kozumplik from comment #19)
> Weird. The repeated runs fail again, DNF displays
> 
> BDB2053 Freeing read locks for locker 0x57: 9821/139736579651392
> BDB2053 Freeing read locks for locker 0x58: 9821/139736579651392

This means a former process whose pid was 9821 exited/died without closing rpmdb properly, such as crashing with read-only locks held. Which would be the previous, crashing run of dnf in this case perhaps.

> before silently aborting. If I then run RPM, I get:
> 
> [akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages
> error: rpmdbNextIterator: skipping h#    3423 region trailer: BAD, tag
> -633323979 type 524303918 offset 2017444319 count 1488993822
> [akozumpl@localhost ~/dnf (dev)]$ sudo rpm -qa >allpackages
> 
> with the second run producing the desired output in 'allpackages'.
> 
> Panu, do you know what we might be looking at here?

The rpmdbNextIterator region trailer error comes from a failing headerCheck(), ie rpm's internal header verification thinks there's a corrupted header in the rpmdb. More often than actual rpmdb corruption, such errors are corruption in the BDB environment, ie 'rm -f /var/lib/rpm/__db.*' is enough to fix it. That a second run doesn't show that error is a bit strange though, BDB environment corruption usually doesn't go away by itself, and rpm doesn't automatically remove the environment except when --root is used (or when doing --rebuilddb).

IIRC libsolv's repo_rpmdb doesn't know about (and thus doesn't validate either) the header region stuff, so its possible that repo_rpmdb code reads something that rpm itself would skip in case like above.

Comment 24 Panu Matilainen 2014-01-22 11:11:30 UTC

(In reply to Ales Kozumplik from comment #22)
> looking at the logs, 'dnf makecache' was running at the same time as the
> transaction just before the read lock errors started to pop up. So RPM
> transaction runs, something tries to read RPMDB concurrently, running into
> an rpm locking/bdb locking bug.

Right, libsolv's repo_rpmdb also doesn't know about this newish rpmdb open/close serialization lock, without which BDB shared environment create/open/close can race:
http://rpm.org/gitweb?p=rpm.git;a=commitdiff;h=ad874d60e3804f1bcd64f3510e1e2dfbf81456cd

Comment 25 Fedora Update System 2014-01-23 13:46:42 UTC

libsolv-0.4.1-1.gitbcedc98.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/libsolv-0.4.1-1.gitbcedc98.fc20

Comment 26 Fedora Update System 2014-01-24 07:37:25 UTC

Package libsolv-0.4.1-1.gitbcedc98.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing libsolv-0.4.1-1.gitbcedc98.fc20'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-1410/libsolv-0.4.1-1.gitbcedc98.fc20
then log in and leave karma (feedback).

Comment 27 Fedora Update System 2014-02-07 03:15:12 UTC

libsolv-0.4.1-1.gitbcedc98.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.