Bug 1255762 - abrt hooks are stuck due to broken lock link [NEEDINFO]
abrt hooks are stuck due to broken lock link
Status: CLOSED WONTFIX
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: abrt (Show other bugs)
6.7
Unspecified Unspecified
unspecified Severity medium
: rc
: ---
Assigned To: abrt
BaseOS QE - Apps
:
: 1277254 (view as bug list)
Depends On:
Blocks: 1277547 1457986
  Show dependency treegraph
 
Reported: 2015-08-21 09:37 EDT by Marek Mahut
Modified: 2017-06-01 12:59 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1457986 (view as bug list)
Environment:
Last Closed: 2017-05-31 18:39:41 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mhabrnal: needinfo? (mmahut)


Attachments (Terms of Use)
console log from troubleshooting (15.83 KB, text/plain)
2015-08-21 09:38 EDT, Marek Mahut
no flags Details

  None (edit)
Description Marek Mahut 2015-08-21 09:37:37 EDT
Description of problem:

On one of our servers, abrt often generates a higher number of hook processes that are all pending on what appears to be a broken link to the .lock.


Version-Release number of selected component (if applicable):

abrt-addon-kerneloops-2.0.8-34.el6.x86_64
abrt-2.0.8-34.el6.x86_64
abrt-tui-2.0.8-34.el6.x86_64
abrt-cli-2.0.8-34.el6.x86_64
abrt-libs-2.0.8-34.el6.x86_64
abrt-addon-python-2.0.8-34.el6.x86_64
abrt-addon-ccpp-2.0.8-34.el6.x86_64
abrt-python-2.0.8-34.el6.x86_64


How reproducible:

It happens on the server one every few days. I have no way to reproduce it step by step. After I remove the broken symlink to the link, the hooks are processed.
Comment 1 Marek Mahut 2015-08-21 09:38:01 EDT
Created attachment 1065618 [details]
console log from troubleshooting
Comment 3 Jakub Filak 2015-08-24 07:05:49 EDT
Thank you for the report. The lock symlink is not supposed to point to an existing file, but its value is PID of the process that created it. ABRT stores information about the detected problems as regular files that are stored in a single directory and each directory represents a single detected problem. When an abrt tool wants to modify a problem directory, it must lock the directory by creating the '.lock' symlink. If the lock symlink already exists but the process that created it does not exist, the tool trying lock the directory will remove the dangling lock symlink and will create its '.lock' symlink. However it might happened that the lock symlink holds a PID of some other process running within the system because PID values might have rotated.

In your case the lock symlink was pointing to the process with PID=377833. Unfortunately it is not possible to find the process in the attached logs.

If the PID=377833 was an abrt process, we need to find out why the process was hanging.

If the PID=377833 was not an abrt process, we need to find out why the process that created the link did not also remove it. Maybe the process crashed, maybe the process was killed by some other mechanism.

Full system logs could help us to investigate the root cause of your issue.
Comment 4 Andy Grimm 2015-11-03 09:17:58 EST
*** Bug 1277254 has been marked as a duplicate of this bug. ***
Comment 5 Joel Smith 2016-01-26 12:13:43 EST
I've been looking at this too and here's what I've found. It looks like abrtd is stuck waiting for a lock that it holds itself.

> [root@ex-std-node818.prod log]# ps -p 6294 -o lstart,cmd
>                  STARTED CMD
> Thu Dec 17 22:56:24 2015 /usr/sbin/abrtd
> [root@ex-std-node818.prod log]# date
> Tue Jan 26 11:52:12 EST 2016

So from Dec 17 until today, PID 6294 has been abrtd. Everything was working fine (and I'll include one set of entries from the previous successful abrt collection) until we ran into a too-large dump.

> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Generating core_backtrace
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Duplicate: core backtrace
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: DUP_OF_DIR: /var/spool/abrt/ccpp-2015-12-30-08:05:46-469255
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Deleting problem directory ccpp-2015-12-30-22:37:37-70562 (dup of ccpp-2015-12-30-08:05:46-469255)
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Sending an email...
> Dec 30 22:38:00 ex-std-node818.prod.rhcloud.com abrtd: Email was sent to: root@localhost
> Dec 30 22:42:01 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-22:41:53-79117' creation detected
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-22:42:13-79732' creation detected
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-22:41:53-79117'
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'init.d': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'rc.d': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'etc': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove 'sosreport-ex-std-node818.prod.rhcloud.com-20151230224204': Directory not empty
> Dec 30 22:42:16 ex-std-node818.prod.rhcloud.com abrtd: Can't remove contents of directory '/var/spool/abrt/ccpp-2015-12-30-22:41:53-79117': Directory not empty
> Dec 30 22:42:18 ex-std-node818.prod.rhcloud.com abrtd: Lock file '.lock' is locked by process 6294

and then this message is repeated 64,748 times. Interspersed in those "Lock file" messages are a few like these:

> Dec 30 22:56:40 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-22:56:37-133085' creation detected
> Dec 30 23:00:34 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-23:00:30-137159' creation detected
> Dec 30 23:00:34 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-20:52:59-207339'
> Dec 30 23:02:01 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-23:01:56-137040' creation detected
> Dec 30 23:02:01 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-21:10:57-388324'
> Dec 30 23:06:55 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-30-23:06:52-142114' creation detected
> Dec 30 23:06:55 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-30-15:17:02-447720'

Although the daemon is still running, we haven't had any log messages from it since this final one:

> Dec 31 07:42:06 ex-std-node818.prod.rhcloud.com abrtd: Directory 'ccpp-2015-12-31-07:42:04-299520' creation detected
> Dec 31 07:42:06 ex-std-node818.prod.rhcloud.com abrtd: Size of '/var/spool/abrt' >= 1000 MB, deleting 'ccpp-2015-12-31-06:59:06-193063'

And now the other aborts are piling up.

> [root@ex-std-node818.prod ~]# ps -ef --sort=start_time | grep abrt | grep -v grep
> root       6294      1  0  2015 ?        00:07:15 /usr/sbin/abrtd
> root       6304      1  0  2015 ?        00:00:36 abrt-dump-oops -d /var/spool/abrt -rwx /var/log/messages
> root      82842   6294  0  2015 ?        00:00:06 /usr/libexec/abrt-handle-event -e post-create -- /var/spool/abrt/ccpp-2015-12-30-22:41:53-79117
> root      87111     36  0  2015 ?        00:03:31 /usr/libexec/abrt-hook-ccpp 11 0 78468 1409 1409 1451574470 httpd
> root      90428     36  0  2015 ?        00:03:32 /usr/libexec/abrt-hook-ccpp 11 0 88289 1409 1409 1451574526 httpd
> root     129748     36  0  2015 ?        00:03:32 /usr/libexec/abrt-hook-ccpp 11 0 129125 1409 1409 1451575148 httpd
> root     427286     36  0  2015 ?        00:03:33 /usr/libexec/abrt-hook-ccpp 11 0 426202 1409 1409 1451581635 httpd
> root     139218     36  0  2015 ?        00:03:31 /usr/libexec/abrt-hook-ccpp 11 0 136141 1409 1409 1451599667 httpd
> root     332045     36  0  2015 ?        00:03:29 /usr/libexec/abrt-hook-ccpp 11 0 329204 1409 1409 1451603172 httpd
> root     344639     36  0  2015 ?        00:03:31 /usr/libexec/abrt-hook-ccpp 11 0 329828 1409 1409 1451603394 httpd
> root     373826     36  0  2015 ?        00:03:32 /usr/libexec/abrt-hook-ccpp 11 0 372914 1409 1409 1451603528 httpd
> root     302027     36  0 Jan06 ?        00:02:44 /usr/libexec/abrt-hook-ccpp 11 0 253200 1546 1546 1452097238 httpd
> root     174669     36  0 Jan07 ?        00:02:36 /usr/libexec/abrt-hook-ccpp 11 0 140973 1546 1546 1452183639 httpd
> root     447824     36  0 Jan12 ?        00:01:58 /usr/libexec/abrt-hook-ccpp 11 0 421065 1546 1546 1452615638 httpd
> root     273303     36  0 Jan25 ?        00:00:10 /usr/libexec/abrt-hook-ccpp 6 0 273234 1987 1987 1453724037 cc

If I strace the abrt-handle-event process, it shows that it's waiting on a lock held by abrtd:

> [root@ex-std-node818.prod ~]# timeout 1 strace -s 4096 -p 82842
> Process 82842 attached
> write(2, "Lock file '.lock' is locked by process 6294\n", 44Process 82842 detached
> <detached ...>

I haven't looked at the code, but I suspect a bug in the code path that handles "Size of '/var/spool/abrt' >= 1000 MB, deleting".

I'm happy to provide more datapoints, or a core of such an abrtd or whatever else you might find helpful.
Comment 7 Jakub Filak 2016-02-08 12:43:40 EST
I believe this bug has the same root cause as bug #1245893. I will try to backport the synchronization from the upstream and provide you with a scratch build to verify my assumption.
Comment 8 Matej Habrnal 2016-02-19 09:29:12 EST
Hi,

there is a fix for bug #1245893 and it should also fix this issue.
I've created a scratch build which ships the fix:
https://brewweb.devel.redhat.com/taskinfo?taskID=10527510

Could you please verify, if it also fixes this bug?

Thanks
Comment 11 Chris Williams 2017-05-31 18:39:41 EDT
Red Hat Enterprise Linux 6 transitioned to the Production 3 Phase on May 10, 2017.  During the Production 3 Phase, Critical impact Security Advisories (RHSAs) and selected Urgent Priority Bug Fix Advisories (RHBAs) may be released as they become available.

The official life cycle policy can be reviewed here:

http://redhat.com/rhel/lifecycle

This issue does not appear to meet the inclusion criteria for the Production Phase 3 and will be marked as CLOSED/WONTFIX. If this remains a critical requirement, please contact Red Hat Customer Support to request a re-evaluation of the issue, citing a clear business justification.  Red Hat Customer Support can be contacted via the Red Hat Customer Portal at the following URL:

https://access.redhat.com

Note You need to log in before you can comment on or make changes to this bug.