Bug 1130245 - libdaemon: daemon_close() function doesn't actually close open connection to daemon-server
Summary: libdaemon: daemon_close() function doesn't actually close open connection to ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: lvm2
Version: 6.6
Hardware: Unspecified
OS: Unspecified
urgent
high
Target Milestone: pre-dev-freeze
: 6.7
Assignee: Petr Rockai
QA Contact: cluster-qe@redhat.com
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-08-14 15:42 UTC by Marian Csontos
Modified: 2015-07-22 07:35 UTC (History)
11 users (show)

Fixed In Version: lvm2-2.02.118-1.el6
Doc Type: Bug Fix
Doc Text:
Cause: Missing lvmetad socket release after lvmetad use in LVM2 if global/use_lvmetad=1 is configured. Consequence: LVM leaking open sockets and lvmetad keeping threads for existing sockets. Fix: Lvmetad sockets are always closed properly now, releasing unused resources both on LVM client side and lvmetad side. Result: LVM2 is not leaking open lvmetad sockets and lvmetad frees unused threads which previously listened on the other side of the socket.
Clone Of:
Environment:
Last Closed: 2015-07-22 07:35:06 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:1411 normal SHIPPED_LIVE lvm2 bug fix and enhancement update 2015-07-20 18:06:52 UTC
Red Hat Bugzilla 1164234 None None None Never

Internal Links: 1164234

Description Marian Csontos 2014-08-14 15:42:24 UTC
Description of problem:
Resource leaks in dmeventd and lvmetad - looks like:

1. sockets (to lvmetad) in dmeventd are not closed when thread is done
2. resulting in many threads (and sockets) in lvmetad left behind.

Version-Release number of selected component (if applicable):
kernel-2.6.32-494.el6.x86_64
lvm2-2.02.109-0.30.el6.x86_64 (mostly the same as 2.02.109 in RHEL6.6)
udev-147-2.57.el6.x86_64

How reproducible:
???

Steps to Reproduce:
???

RAID repair policy is set to "warn".

This were series of tests creating RAID LVs, offlining leg(s), `pvscan --cache`
them and manually repairing.

Additional info:
After a while there are dozens of threads in lvmetad each waiting on a socket:

    (gdb) bt full
    #0  0x00007f7dde67c75d in read () at ../sysdeps/unix/syscall-template.S:82
    No locals.
    #1  0x00007f7ddf11de9d in buffer_read (fd=5, buffer=0x7f7ddde7cde8) at daemon-io.c:38
            result = 32637
    #2  0x00007f7ddf11c9df in client_thread (state=0x7f7de07dd990) at daemon-server.c:421
            ts = 0x7f7de07dd990
            req = {cft = 0x7f7dd8015970, buffer = {allocated = 32, used = 0, mem = 0x7f7dd80094d0 ""}}
            res = {error = 0, cft = 0x0, buffer = {allocated = 0, used = 0, mem = 0x0}}
    #3  0x00007f7dde6759d1 in start_thread (arg=0x7f7ddde7d700) at pthread_create.c:301
            __res = <value optimized out>
            pd = 0x7f7ddde7d700
            now = <value optimized out>
            unwind_buf = {cancel_jmp_buf = {{jmp_buf = {140178570598144, 8824222711490556402, 140178581136224, 140178570598848, 0, 3, -8895079845552967182, -8895080941217369614}, mask_was_saved = 0}}, priv = {pad = {0x0, 0x0, 
                  0x0, 0x0}, data = {prev = 0x0, cleanup = 0x0, canceltype = 0}}}
            not_first_call = <value optimized out>
            pagesize_m1 = <value optimized out>
            sp = <value optimized out>
            freesize = <value optimized out>

dmeventd has many sockets opened, but only few threads and I assume it is not cleaning up.

Comment 1 Ondrej Kozina 2014-11-28 13:38:34 UTC
Well, I just found damon_close() function in libdaemon actually doesn't close open connections to daemons based on libdaemon. Perhaps this is not bug in dmeventd but rather in other layer...

Comment 2 Alasdair Kergon 2014-11-28 17:46:09 UTC
Many tests.  Can you sort out some easier and specific way to reproduce this and perhaps work out just where this is going wrong?  Attach to dmeventd too and work out exactly where its threads are 'stuck'?

"many sockets opened" - what does this look like?  Are there other processes still present?

In other words, how exactly is "daemon_close failing to close the file descriptor" connected to these threads in lvmetad?  Which code - in a dmeventd plugin? - is actually leading to this?

I don't think the comments on this bug are giving us the full analysis of this problem yet.  (My attempts to reproduce it based purely on the statement that 'daemon_close does not close the socket' have failed - there must be a bit more to this.)

Comment 3 Alasdair Kergon 2014-11-28 21:52:07 UTC
The fix in bug 1164234 deals with fixing daemon_close and is likely to be a necessary patch for this, but we still need to determine precisely how that is connected to this bug which is about dmeventd/lvmetad and whether there is a further problem or not.

Comment 5 Petr Rockai 2015-03-03 09:11:37 UTC
As far as I can tell, this has been caused by the (trivial) omission of a close() call in daemon_close() and fixed by Alasdair's commit 6521c4b2156597a36a986def85449be734336c72 back in November. I think all we are waiting for is a confirmation from Marian that the bug is really gone?

Comment 6 Marian Csontos 2015-03-04 00:31:44 UTC
The bug is not gone yet. I have seen a process with 200+ FDs. No details yet. I will post a reproducer.

Comment 7 Marian Csontos 2015-03-04 16:54:11 UTC
Sorry folks, I am under time pressure today, so I will check the details tomorrow.

The following test results in 200+ FDs open in dmeventd:

revolution_9 -o MASTERNODE -l STS_LROOT -r STS_RROOT -j 1 -i 1 -w EXT -F -O $COLLIE_OUTPUTDIR

I will check it once I have a permanent connection.

Comment 8 Petr Rockai 2015-03-18 10:06:28 UTC
A few things:
1) how do you check for open file descriptors?
2) are there any active monitoring threads in dmeventd at the time of your check?
3) which dmeventd plugin(s) do you use?

Because: I definitely cannot reproduce this with mirrors. I create a mirror, start/stop monitoring it a few times, then start monitoring a bunch of mirrors at the same time, then fail some, wait for the actions. I have also tried creating a mirror, and in a loop, setting up monitoring, causing a fault, waiting for repair, re-extending the mirror, shutting down monitoring and starting over. None of those cause any file descriptor leaks.

The only thing that I noticed that might be confusing is that lsof lists each FD again for each active thread, so if you have 20 dmeventd threads active, you will see the same socket FD 20 times in the output. Check the FD column of lsof output, not the number of lines, if lsof is how you check for FDs.

It is possible that you are seeing an entirely different bug, though. Do you still also see the lvmetad threads as in the original description? Could you provide a dump of dmeventd's state from gdb? ("thread apply all bt full" or such)

Comment 9 Marian Csontos 2015-03-24 09:25:16 UTC
I run STS with lvmetad and though there is one scenario where number of open FDs in dmeventd is larger than expected, there are no dangling threads/FDs anymore. So the resource leak is fixed.

The false alarm was caused by the fact dmeventd continues to monitor LVs even after the PVs are removed, which may be a bug too.

Comment 10 Petr Rockai 2015-03-24 09:43:48 UTC
So this bug was indeed fixed by 6521c4b2156597a36a986def85449be734336c72. Moving to POST, please open new bugs for other issues.

Comment 12 Nenad Peric 2015-04-14 10:57:13 UTC
Tested this as suggested by running a simple run of revolution tests.
There are no lvm sockets/FDs hanging after the tests are done. 
So I will mark this patch verified. 

However, there are hundreds of FDs left for dmeventd

<---snip--->

dmeventd  19103      root  390u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  391u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  392u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  393u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  394u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  395u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  397u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  398u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  399u      CHR              10,58      0t0       6206 /dev/mapper/control
dmeventd  19103      root  400u      CHR              10,58      0t0       6206 /dev/mapper/control

<---snip--->


Will open a new bug for this one, and will post a comment here with the new bug number.

Marking this one VERIFIED with:

lvm2-2.02.118-1.el6.x86_64

Comment 13 Nenad Peric 2015-04-14 11:04:30 UTC
Opened a bug for the new dmeventd problem.

It is under Bug 1211575.

Comment 14 errata-xmlrpc 2015-07-22 07:35:06 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-1411.html


Note You need to log in before you can comment on or make changes to this bug.