Bug 1255762

Summary: abrt hooks are stuck due to broken lock link
Product: Red Hat Enterprise Linux 6 Reporter: Marek Mahut <mmahut>
Component: abrtAssignee: abrt <abrt-devel-list>
Status: CLOSED WONTFIX QA Contact: BaseOS QE - Apps <qe-baseos-apps>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 6.7CC: cww, jberan, jgoulding, joelsmith, mhabrnal, mmahut
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1457986 (view as bug list) Environment:
Last Closed: 2017-05-31 22:39:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1277547    
Attachments:
Description Flags
console log from troubleshooting none

Description Marek Mahut 2015-08-21 13:37:37 UTC
Description of problem:

On one of our servers, abrt often generates a higher number of hook processes that are all pending on what appears to be a broken link to the .lock.


Version-Release number of selected component (if applicable):

abrt-addon-kerneloops-2.0.8-34.el6.x86_64
abrt-2.0.8-34.el6.x86_64
abrt-tui-2.0.8-34.el6.x86_64
abrt-cli-2.0.8-34.el6.x86_64
abrt-libs-2.0.8-34.el6.x86_64
abrt-addon-python-2.0.8-34.el6.x86_64
abrt-addon-ccpp-2.0.8-34.el6.x86_64
abrt-python-2.0.8-34.el6.x86_64


How reproducible:

It happens on the server one every few days. I have no way to reproduce it step by step. After I remove the broken symlink to the link, the hooks are processed.

Comment 1 Marek Mahut 2015-08-21 13:38:01 UTC
Created attachment 1065618 [details]
console log from troubleshooting

Comment 3 Jakub Filak 2015-08-24 11:05:49 UTC
Thank you for the report. The lock symlink is not supposed to point to an existing file, but its value is PID of the process that created it. ABRT stores information about the detected problems as regular files that are stored in a single directory and each directory represents a single detected problem. When an abrt tool wants to modify a problem directory, it must lock the directory by creating the '.lock' symlink. If the lock symlink already exists but the process that created it does not exist, the tool trying lock the directory will remove the dangling lock symlink and will create its '.lock' symlink. However it might happened that the lock symlink holds a PID of some other process running within the system because PID values might have rotated.

In your case the lock symlink was pointing to the process with PID=377833. Unfortunately it is not possible to find the process in the attached logs.

If the PID=377833 was an abrt process, we need to find out why the process was hanging.

If the PID=377833 was not an abrt process, we need to find out why the process that created the link did not also remove it. Maybe the process crashed, maybe the process was killed by some other mechanism.

Full system logs could help us to investigate the root cause of your issue.

Comment 4 Andy Grimm 2015-11-03 14:17:58 UTC
*** Bug 1277254 has been marked as a duplicate of this bug. ***

Comment 5 Joel Smith 2016-01-26 17:13:43 UTC
I've been looking at this too and here's what I've found. It looks like abrtd is stuck waiting for a lock that it holds itself.

> [root log]# ps -p 6294 -o lstart,cmd
>                  STARTED CMD
> Thu Dec 17 22:56:24 2015 /usr/sbin/abrtd
> [root log]# date
> Tue Jan 26 11:52:12 EST 2016

So from Dec 17 until today, PID 6294 has been abrtd. Everything was working fine (and I'll include one set of entries from the previous successful abrt collection) until we ran into a too-large dump.

> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Generating core_backtrace
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Duplicate: core backtrace
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2015-12-30-08:05:46-469255
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Deleting problem directory ccpp-2015-12-30-22:37:37-70562 (dup of ccpp-2015-12-30-08:05:46-469255)
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Sending an email...
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Email was sent to: root@localhost
> Dec 30 22:42:01 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-22:41:53-79117' creation detected
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-22:42:13-79732' creation detected
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-22:41:53-79117'
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'init.d': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'rc.d': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'etc': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'sosreport-ex-std-node818.prod.rhcloud.com-20151230224204': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove contents of directory '/var/spool/abrt/ccpp-2015-12-30-22:41:53-79117': Directory not empty
> Dec 30 22:42:18 ex-std-node818.prod.rhcloud.com abrtd: Lock file '.lock' is locked by process 6294

and then this message is repeated 64,748 times. Interspersed in those "Lock file" messages are a few like these:

> Dec 30 22:56:40 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-22:56:37-133085' creation detected
> Dec 30 23:00:34 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-23:00:30-137159' creation detected
> Dec 30 23:00:34 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-20:52:59-207339'
> Dec 30 23:02:01 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-23:01:56-137040' creation detected
> Dec 30 23:02:01 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-21:10:57-388324'
> Dec 30 23:06:55 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-23:06:52-142114' creation detected
> Dec 30 23:06:55 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-15:17:02-447720'

Although the daemon is still running, we haven't had any log messages from it since this final one:

> Dec 31 07:42:06 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-31-07:42:04-299520' creation detected
> Dec 31 07:42:06 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-31-06:59:06-193063'

And now the other aborts are piling up.

> [root ~]# ps -ef --sort=start_time | grep abrt | grep -v grep
> root       6294      1  0  2015 ?        00:07:15 /usr/sbin/abrtd
> root       6304      1  0  2015 ?        00:00:36 abrt-dump-oops -d /var/spool/abrt -rwx /var/log/messages
> root      82842   6294  0  2015 ?        00:00:06 /usr/libexec/abrt-handle-event -e post-create -- /var/spool/abrt/ccpp-2015-12-30-22:41:53-79117
> root      87111     36  0  2015 ?        00:03:31 /usr/libexec/abrt-hook-ccpp 11 0 78468 1409 1409 1451574470 httpd
> root      90428     36  0  2015 ?        00:03:32 /usr/libexec/abrt-hook-ccpp 11 0 88289 1409 1409 1451574526 httpd
> root     129748     36  0  2015 ?        00:03:32 /usr/libexec/abrt-hook-ccpp 11 0 129125 1409 1409 1451575148 httpd
> root     427286     36  0  2015 ?        00:03:33 /usr/libexec/abrt-hook-ccpp 11 0 426202 1409 1409 1451581635 httpd
> root     139218     36  0  2015 ?        00:03:31 /usr/libexec/abrt-hook-ccpp 11 0 136141 1409 1409 1451599667 httpd
> root     332045     36  0  2015 ?        00:03:29 /usr/libexec/abrt-hook-ccpp 11 0 329204 1409 1409 1451603172 httpd
> root     344639     36  0  2015 ?        00:03:31 /usr/libexec/abrt-hook-ccpp 11 0 329828 1409 1409 1451603394 httpd
> root     373826     36  0  2015 ?        00:03:32 /usr/libexec/abrt-hook-ccpp 11 0 372914 1409 1409 1451603528 httpd
> root     302027     36  0 Jan06 ?        00:02:44 /usr/libexec/abrt-hook-ccpp 11 0 253200 1546 1546 1452097238 httpd
> root     174669     36  0 Jan07 ?        00:02:36 /usr/libexec/abrt-hook-ccpp 11 0 140973 1546 1546 1452183639 httpd
> root     447824     36  0 Jan12 ?        00:01:58 /usr/libexec/abrt-hook-ccpp 11 0 421065 1546 1546 1452615638 httpd
> root     273303     36  0 Jan25 ?        00:00:10 /usr/libexec/abrt-hook-ccpp 6 0 273234 1987 1987 1453724037 cc

If I strace the abrt-handle-event process, it shows that it's waiting on a lock held by abrtd:

> [root ~]# timeout 1 strace -s 4096 -p 82842
> Process 82842 attached
> write(2, "Lock file '.lock' is locked by process 6294\n", 44Process 82842 detached
> <detached ...>

I haven't looked at the code, but I suspect a bug in the code path that handles "Size of '/var/spool/abrt' >= 1000 MB, deleting".

I'm happy to provide more datapoints, or a core of such an abrtd or whatever else you might find helpful.

Comment 7 Jakub Filak 2016-02-08 17:43:40 UTC
I believe this bug has the same root cause as bug #1245893. I will try to backport the synchronization from the upstream and provide you with a scratch build to verify my assumption.

Comment 8 Matej Habrnal 2016-02-19 14:29:12 UTC
Hi,

there is a fix for bug #1245893 and it should also fix this issue.
I've created a scratch build which ships the fix:
https://brewweb.devel.redhat.com/taskinfo?taskID=10527510

Could you please verify, if it also fixes this bug?

Thanks

Comment 11 Chris Williams 2017-05-31 22:39:41 UTC
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com

Comment 13 Red Hat Bugzilla 2023-09-14 03:04:05 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days