Bug 860500
Summary: | yum gets stuck during cleanup/verification, but running another yum process unsticks it | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Adam Williamson <awilliam> | ||||
Component: | rpm | Assignee: | Fedora Packaging Toolset Team <packaging-team> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 18 | CC: | collura, ffesti, john.ellson, jzeleny, maciek.borzecki, mads, maxamillion, naipaul_ojar, novyjindrich, packaging-team, packaging-team-maint, pknirsch, pmatilai, rjones, tim.lauridsen, tmraz, zpavlas | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | All | ||||||
Whiteboard: | |||||||
Fixed In Version: | rpm-4.10.3.1-2.fc18 | Doc Type: | Bug Fix | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-09-01 23:01:29 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Adam Williamson
2012-09-26 01:49:47 UTC
Might be deadlocking on rpmdb access, yum occasionally holds more than one different handle to the rpmdb and that has caused problems before (bug 678644, in that case there was also a bug in rpm itself making things worse). The next time it happens, try getting a bit more info while its hung: - try getting a backtrace of the whereabouts (attach gdb to the hung-up process by its pid) - strace by the hung pid (probably hanging on a futex but...) - 'db_stat -Cl -h /var/lib/rpm' output Another way to get a bit more data is making rpmdb opens audited: # echo "-w /var/lib/rpm/Packages -p war -k rpmdb" >> /etc/audit/audit.rules # systemctl restart auditd.service Now if/when it gets stuck, the last few results from 'ausearch -k rpmdb' might provide further clues. Any idea when this has started happening, roughly? I haven't such behavior on my f16/17 systems running rpm-4.10.0 (or in-development versions of it) and those have seen quite an amount of updates. Could of course still be an rpm bug, or yum doing something subtly different now, or some yum plugin misbehaving or... Saw this problem, or something very similar, on an x86_84 rawhide vhost. [root@rawhide-64 ~]# yum update --skip-broken -y BDB2053 Freeing read locks for locker 0x19: 595/139829807343552 BDB2053 Freeing read locks for locker 0x1a: 595/139829807343552 BDB2053 Freeing read locks for locker 0x1b: 595/139829807343552 BDB2053 Freeing read locks for locker 0x1c: 595/139829807343552 Loaded plugins: fastestmirror, langpacks, presto, refresh-packagekit Loading mirror speeds from cached hostfile * rawhide: mirror.cogentco.com No Packages marked for Update [root@rawhide-64 ~]# This was shortly after a reboot, although I had run one build on the system before this that might have presented some load. This is the same system that sees the kswapd0 lockups most often (more than a similarly configured i686 rawhide vhost). Bug #866988 No specific evidence that the two bugs are related, but yum does seem to be one of the triggers for the kswapd issue. I'm being hit by this quite regularly since I've upgraded to F18. At first I thought that it's something wrong in anaconda package, since it got stuck installing that twice, but then it got stuck couple of times during update and installation of @kde-desktop. So far I've seen yum getting stuck in install, verification or cleanup phases. Each time strace showed that yum is stuck in some synchronization operation (futex(.., FUTEX_WAIT..)). It did not react to any signals (possibly they were already blocked in process' sigmask) but KILL. Since it happens quite often (especially when installing/updating large number of packages), I've started running yum with debugs (-d10), but magically the issue did not appear in any of the with-debugs runs, possibly suggesting some race/deadlock condition. If you happen to have any suggestion on how to debug the problem further without strace/debugs I'd very much like to hear about that. Relevant packages installed on my system: rpm-4.10.2-1.fc18.x86_64 rpm-build-4.10.2-1.fc18.x86_64 rpm-build-libs-4.10.2-1.fc18.x86_64 rpmdevtools-8.3-1.fc18.noarch rpmfusion-free-release-18-0.4.noarch rpmfusion-nonfree-release-18-0.4.noarch rpm-libs-4.10.2-1.fc18.x86_64 rpmlint-1.4-11.fc18.noarch rpm-python-4.10.2-1.fc18.x86_64 yum-3.4.3-47.fc18.noarch yumex-3.0.10-1.fc18.noarch yum-langpacks-0.3.0-4.fc18.noarch yum-metadata-parser-1.1.4-7.fc18.x86_64 yum-plugin-fastestmirror-1.1.31-6.fc18.noarch yum-plugin-show-leaves-1.1.31-6.fc18.noarch yum-presto-0.9.0-1.fc18.noarch yum-utils-1.1.31-6.fc18.noarch (In reply to comment #3) > > If you happen to have any suggestion on how to debug the problem further > without strace/debugs I'd very much like to hear about that. See comment #1 for further ways to collect information from the hangs. Especially the audit trail from rpmdb accesses has proven useful in tracing odd cases. stace output, as expected, waits on fuxtex: futex(0x7f7da99b2884, FUTEX_WAIT, 1, NULL^CProcess 2738 detached <detached ...> [maciek@corsair:~]$ sudo db_stat -Cl -h /var/lib/rpm@ =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Lock REGINFO information: Environment Region type 1 Region ID /var/lib/rpm/__db.001 Region name 0x7f7017508000 Region address 0x7f70175080a0 Region allocation head 0x7f70175086a8 Region primary address 0 Region maximum allocation 0 Region allocated Region allocations: 101 allocations, 0 failures, 0 frees, 1 longest REGION_SHARED Region flags =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Locks grouped by lockers: Locker Mode Count Status ----------------- Object --------------- 28f dd= 0 locks held 1 write locks 1 pid/thread 2738/140177905407808 flags 0 priority 100 28f WRITE 1 WAIT 0x1bee0 len: 20 data: a701020001080000c6382bd2d00e000000000000 28f IWRITE 1 HELD 0x1bee0 len: 20 data: a701020001080000c6382bd2d00e000000000000 290 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 290 READ 1 HELD (201a8 801 b540fd89 19570 0) handle 0 291 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 291 READ 1 HELD (201a9 801 cd725ff4 31c10 0) handle 0 292 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 292 READ 1 HELD (28fc0 801 b9c12a6b 4a2b0 0) handle 0 293 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 293 READ 1 HELD (29000 801 b5730d2e 62950 0) handle 0 294 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 294 READ 1 HELD (29001 801 e9e7d0c2 7aff0 0) handle 0 295 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 295 READ 1 HELD (29002 801 d322e3b2 93690 0) handle 0 296 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 296 READ 1 HELD (29003 801 92577cdd abd30 0) handle 0 297 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 297 READ 1 HELD (29004 801 a2e6ff7b c43d0 0) handle 0 298 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 298 READ 1 HELD (29005 801 89054ac9 dca70 0) handle 0 299 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 299 READ 1 HELD (29006 801 b567d240 f5110 0) handle 0 29a dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 29a READ 1 HELD (29007 801 b7780930 10d7b0 0) handle 0 29b dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 29b READ 1 HELD (29008 801 f9d71355 125e50 0) handle 0 29c dd= 0 locks held 1 write locks 0 pid/thread 3569/140274069424064 flags 0 priority 100 29c READ 1 HELD (201a7 801 d22b38c6 ed0 0) handle 0 29d dd= 0 locks held 1 write locks 0 pid/thread 3569/140274069424064 flags 0 priority 100 29d READ 1 HELD 0x1bee0 len: 20 data: a701020001080000c6382bd2d00e000000000000 29e dd= 0 locks held 1 write locks 0 pid/thread 3569/140274069424064 flags 0 priority 100 29e READ 1 HELD (201a8 801 b540fd89 19570 0) handle 0 29f dd= 0 locks held 1 write locks 0 pid/thread 3569/140274069424064 flags 0 priority 100 29f READ 1 HELD (201a9 801 cd725ff4 31c10 0) handle 0 19 dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 19 READ 1 HELD (201a7 801 d22b38c6 ed0 0) handle 0 1a dd= 0 locks held 0 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 27f dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 27f READ 1 HELD (201a8 801 b540fd89 19570 0) handle 0 28e dd= 0 locks held 1 write locks 0 pid/thread 2738/140177905407808 flags 0 priority 100 28e READ 1 HELD (201a7 801 d22b38c6 ed0 0) handle 0 Stack trace from pstack below (not sure if I can convince pstask to show a proper python stack trace). Thread 1 stack seems to make some sense. It seems that after SIGSTOP/SIGCONT yum moves forward and continues running the transaction. [root@corsair]/home/maciek# pstack 2738 Thread 3 (Thread 0x7f7da9038700 (LWP 2755)): #0 0x000000364f00d6a0 in sem_wait () from /lib64/libpthread.so.0 #1 0x00000037c9707945 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0 #2 0x00000037c970b454 in ?? () from /lib64/libpython2.7.so.1.0 #3 0x00000037c96dd281 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #4 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #5 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #6 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #7 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #8 0x00000037c96dcef1 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #9 0x00000037c96dcef1 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #10 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #11 0x00000037c966d926 in ?? () from /lib64/libpython2.7.so.1.0 #12 0x00000037c9649c0e in PyObject_Call () from /lib64/libpython2.7.so.1.0 #13 0x00000037c9658200 in ?? () from /lib64/libpython2.7.so.1.0 #14 0x00000037c9649c0e in PyObject_Call () from /lib64/libpython2.7.so.1.0 #15 0x00000037c96d7367 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #16 0x00000037c970b802 in ?? () from /lib64/libpython2.7.so.1.0 #17 0x000000364f007d15 in start_thread () from /lib64/libpthread.so.0 #18 0x000000364ecf246d in clone () from /lib64/libc.so.6 Thread 2 (Thread 0x7f7d9bfff700 (LWP 2756)): #0 0x000000364f00d6a0 in sem_wait () from /lib64/libpthread.so.0 #1 0x00000037c9707945 in PyThread_acquire_lock () from /lib64/libpython2.7.so.1.0 #2 0x00000037c970b454 in ?? () from /lib64/libpython2.7.so.1.0 #3 0x00000037c96dd281 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #4 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #5 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #6 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #7 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #8 0x00000037c96dcef1 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #9 0x00000037c96dcef1 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #10 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #11 0x00000037c966d926 in ?? () from /lib64/libpython2.7.so.1.0 #12 0x00000037c9649c0e in PyObject_Call () from /lib64/libpython2.7.so.1.0 #13 0x00000037c9658200 in ?? () from /lib64/libpython2.7.so.1.0 #14 0x00000037c9649c0e in PyObject_Call () from /lib64/libpython2.7.so.1.0 #15 0x00000037c96d7367 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #16 0x00000037c970b802 in ?? () from /lib64/libpython2.7.so.1.0 #17 0x000000364f007d15 in start_thread () from /lib64/libpthread.so.0 #18 0x000000364ecf246d in clone () from /lib64/libc.so.6 Thread 1 (Thread 0x7f7db641d740 (LWP 2738)): #0 0x0000003667c3178b in __db_tas_mutex_readlock () from /lib64/libdb-5.3.so #1 0x0000003667d49355 in __memp_fget () from /lib64/libdb-5.3.so #2 0x0000003667c4dc32 in __bam_get_root () from /lib64/libdb-5.3.so #3 0x0000003667c4e3f2 in __bam_search () from /lib64/libdb-5.3.so #4 0x0000003667c39fe9 in __bamc_search () from /lib64/libdb-5.3.so #5 0x0000003667c3e457 in __bamc_put () from /lib64/libdb-5.3.so #6 0x0000003667cf2827 in __dbc_iput () from /lib64/libdb-5.3.so #7 0x0000003667d02271 in __dbc_put_pp () from /lib64/libdb-5.3.so #8 0x000000310b0130ba in dbiCursorPut () from /lib64/librpm.so.3 #9 0x000000310b019a47 in addToIndex () from /lib64/librpm.so.3 #10 0x000000310b01d63f in rpmdbAdd () from /lib64/librpm.so.3 #11 0x000000310b02edbb in rpmpsmStage () from /lib64/librpm.so.3 #12 0x000000310b02eff1 in rpmpsmStage () from /lib64/librpm.so.3 #13 0x000000310b02fb6b in rpmpsmRun () from /lib64/librpm.so.3 #14 0x000000310b03f002 in rpmteProcess () from /lib64/librpm.so.3 #15 0x000000310b045bb9 in rpmtsRun () from /lib64/librpm.so.3 #16 0x00007f7dafd2e15e in rpmts_Run () from /usr/lib64/python2.7/site-packages/rpm/_rpmmodule.so #17 0x00000037c9649c0e in PyObject_Call () from /lib64/libpython2.7.so.1.0 #18 0x00000037c96d7367 in PyEval_CallObjectWithKeywords () from /lib64/libpython2.7.so.1.0 #19 0x00000037c965e199 in ?? () from /lib64/libpython2.7.so.1.0 #20 0x00000037c9649c0e in PyObject_Call () from /lib64/libpython2.7.so.1.0 #21 0x00000037c96d9583 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #22 0x00000037c96dcef1 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #23 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #24 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #25 0x00000037c96dcef1 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #26 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #27 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #28 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #29 0x00000037c96dc2b3 in PyEval_EvalFrameEx () from /lib64/libpython2.7.so.1.0 #30 0x00000037c96ddcbf in PyEval_EvalCodeEx () from /lib64/libpython2.7.so.1.0 #31 0x00000037c96ddd92 in PyEval_EvalCode () from /lib64/libpython2.7.so.1.0 #32 0x00000037c96f6f7a in ?? () from /lib64/libpython2.7.so.1.0 #33 0x00000037c96f7d72 in PyRun_FileExFlags () from /lib64/libpython2.7.so.1.0 #34 0x00000037c96f878b in PyRun_SimpleFileExFlags () from /lib64/libpython2.7.so.1.0 #35 0x00000037c9709c52 in Py_Main () from /lib64/libpython2.7.so.1.0 #36 0x000000364ec21a05 in __libc_start_main () from /lib64/libc.so.6 #37 0x0000000000400721 in _start () Thanks Maciek, thats useful. Note the two different pids in the lock dump, and if I'm reading that correctly, both are trying to access the same piece of data: 28f dd= 0 locks held 1 write locks 1 pid/thread 2738/140177905407808 flags 0 priority 100 28f WRITE 1 WAIT 0x1bee0 len: 20 data: a701020001080000c6382bd2d00e000000000000 28f IWRITE 1 HELD 0x1bee0 len: 20 data: a701020001080000c6382bd2d00e000000000000 [...] 29d dd= 0 locks held 1 write locks 0 pid/thread 3569/140274069424064 flags 0 priority 100 29d READ 1 HELD 0x1bee0 len: 20 data: a701020001080000c6382bd2d00e000000000000 Pid 2738 is yum here, the question is who's the other fella there. It's something running as root since its participating in locking and its (obviously) reading the rpm database. That something is probably forgetting to close a db iterator, causing a hangup until a new yum (or rpm) invocation spots and purges the (by now) stale lock, letting yum proceed. Again, please enable auditing on rpmdb access. That'll make it easy to see who's been doing what, even after the fact: # echo "-w /var/lib/rpm/Packages -p war -k rpmdb" >> /etc/audit/audit.rules # systemctl restart auditd.service (of course if auditd is not installed or enabled, you need to do those first) Okay, one suspect found. In bug 907846 (likely a dupe but not sure yet) the hang quite clearly occurs due to abrtd running while a transaction is in progress. Whether its a bug in abrt or just abrt triggering a bug in rpm remains to be seen. Everybody seeing these hangs, please check your system logs around the times of the hangs and see if there are interleaved messages from yum and abrtd. Or not - that's valuable information too. Come to think of it, your suspicion might make sense. It turns out that emacs has a tendency to crash frequently during upgrades due to bug 825098. The emacs bug occurs especially when a font package is changed/updated. Now, looking at transaction history, I see that when the problem happened, some fonts were updated, see the snip below. ... Updated vlgothic-fonts-20121230-1.fc18.noarch @updates Obsoleting vlgothic-fonts-20121230-2.fc18.noarch @updates Obsoleted vlgothic-fonts-common-20121230-1.fc18.noarch @updates Obsoleting vlgothic-p-fonts-20121230-2.fc18.noarch @updates Updated xemacs-filesystem-21.5.33-1.fc18.noarch @updates ... At this point abrt would normally step in once some process has crashed and try to figure out the package it belongs to. Miraculously I do have a logged emacs crash that happened quite close (timewise) to when yum problem was observed and I added the comment. I guess there still might be a race condition involved. Yum would get stuck only when a large update/install was performed. I also have not observed the problem on my other PC which has a regular HDD (one that the problem was found on has a SSD). Still correlation does not necessarily imply causation ;) Created attachment 731324 [details]
snipet of var log messages near a yum gui freeze / yum command line completion
*** Bug 907846 has been marked as a duplicate of this bug. *** The bad boy here is indeed abrt (see bug 918184) but rpm should be more tolerant against such API user mistakes. Rpm in Fedora >= 19 can handle the situation without getting stuck, but keeping this open as a reminder to backport the fix to F18 too. Also reassigning to rpm, yum is nothing but a messenger here. rpm-4.10.3.1-2.fc18 has been submitted as an update for Fedora 18. https://admin.fedoraproject.org/updates/rpm-4.10.3.1-2.fc18 Package rpm-4.10.3.1-2.fc18: * should fix your issue, * was pushed to the Fedora 18 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing rpm-4.10.3.1-2.fc18' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2013-15331/rpm-4.10.3.1-2.fc18 then log in and leave karma (feedback). rpm-4.10.3.1-2.fc18 has been pushed to the Fedora 18 stable repository. If problems still persist, please make note of it in this bug report. |