Bug 1095585 - Two instances each, of brick processes, glusterfs-nfs and quotad seen after glusterd restart
Summary: Two instances each, of brick processes, glusterfs-nfs and quotad seen after g...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: glusterd
Version: mainline
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
Assignee: Kaushal
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1092429 1105188
TreeView+ depends on / blocked
 
Reported: 2014-05-08 05:16 UTC by Kaushal
Modified: 2014-11-11 08:32 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.6.0beta1
Doc Type: Bug Fix
Doc Text:
Clone Of: 1092429
: 1105188 (view as bug list)
Environment:
Last Closed: 2014-11-11 08:32:00 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Kaushal 2014-05-08 05:16:01 UTC
+++ This bug was initially created as a clone of Bug #1092429 +++

Description of problem:
------------------------

Following glusterd restart, the ps commands reports two instances each, of the brick processes, glusterfs-nfs and quotad. See below for e.g. - 

-------------------------------------------------------------------------

[root@rhs-client25 gluster]# pgrep gluster -fl
16074 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
16284 /usr/sbin/glusterfsd -s rhs-client25 --volfile-id dis_vol.rhs-client25.rhs-brick1-b1 -p /var/lib/glusterd/vols/dis_vol/run/rhs-client25-rhs-brick1-b1.pid -S /var/
run/3df964b178012d04c3a29c339b2465a3.socket --brick-name /rhs/brick1/b1 -l /var/log/glusterfs/bricks/rhs-brick1-b1.log --xlator-option *-posix.glusterd-uuid=6e95ce9b-34
53-4d21-9510-64a50dd6cf12 --brick-port 49152 --xlator-option dis_vol-server.listen-port=49152
16285 /usr/sbin/glusterfsd -s rhs-client25 --volfile-id dis_vol.rhs-client25.rhs-brick1-b1 -p /var/lib/glusterd/vols/dis_vol/run/rhs-client25-rhs-brick1-b1.pid -S /var/
run/3df964b178012d04c3a29c339b2465a3.socket --brick-name /rhs/brick1/b1 -l /var/log/glusterfs/bricks/rhs-brick1-b1.log --xlator-option *-posix.glusterd-uuid=6e95ce9b-34
53-4d21-9510-64a50dd6cf12 --brick-port 49152 --xlator-option dis_vol-server.listen-port=49152
16290 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/8800fd4a270a082e891be2336
fcb0e7f.socket
16291 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/8800fd4a270a082e891be2336
fcb0e7f.socket
16295 /usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/lib/glusterd/quotad/run/quotad.pid -l /var/log/glusterfs/quotad.log -S /var/run/d12f45551ebf8
6f8dbd7a1a5977970e9.socket --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal
=off
16297 /usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/lib/glusterd/quotad/run/quotad.pid -l /var/log/glusterfs/quotad.log -S /var/run/d12f45551ebf8
6f8dbd7a1a5977970e9.socket --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal
=off


The gdb backtrace for the two brick processes is below - 

(gdb) bt
#0  0x0000003a7e80e740 in __read_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000404c0d in read (ctx=0x13bd010) at /usr/include/bits/unistd.h:45
#2  daemonize (ctx=0x13bd010) at glusterfsd.c:1810
#3  0x0000000000407287 in main (argc=19, argv=0x7fff2f4be4c8) at glusterfsd.c:1979
(gdb) q

--------------------------------------------------------------------------

(gdb) bt
#0  0x0000003a7e4e9163 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003a7f86adb7 in event_dispatch_epoll (event_pool=0x13d8ee0) at event-epoll.c:428
#2  0x00000000004072ca in main (argc=19, argv=0x7fff2f4be4c8) at glusterfsd.c:1994
(gdb) q


Version-Release number of selected component (if applicable):
glusterfs-3.5qa2-0.294.git00802b3.el6rhs.x86_64

How reproducible:
Intermittent.

Steps to Reproduce:
Cannot give clear steps for reproducing the issue.

Actual results:
Following glusterd restart, two instances of the above mentioned processes are seen.

Expected results:
There should only be one instance of each of these processes.

Additional info:


--- Additional comment from RamaKasturi on 2014-05-06 15:32:51 IST ---

Hi,

   Saw the similar issue in my setup as well. Here are the steps to reprocuce.

1) Create a distributed volume.

2) Enable quota on the volume by running the command "gluster vol quota <volname> enable.

3) Now stop glusterd and start it again.

Now gluster cli hangs while trying to run the command "gluster v i".

Attaching the sos reports for the same.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1092429/

--- Additional comment from RamaKasturi on 2014-05-06 15:35:47 IST ---

1) Not able to test process plugin Quota in nagios.

2) When quota is enabled on a volume, not able to test any other process plugins i.e glusterd,shd and nfs.

--- Additional comment from Kaushal on 2014-05-06 16:18:35 IST ---

This happens because we glusterd enters a deadlock, which is caused because quotad is started using 'runner_run' which is blocks the calling thread. When glusterd is starting up, this happens in the epoll thread, and the epoll thread gets blocked. With the epoll thread blocked glusterd is not able to serve any requests, including cli commands and volfile fetch requests.

The daemonize process for glusterfs daemons happens in the following way,
* glusterd starts the parent process
* the parent process forks and creates a child daemon process
* the daemon process does the volfile fetching and initializing, and returns to parent on success.
* parent process ends itself after child returns

In this case, since the child process cannot fetch the volfile and will not return to the parent process. This will leave two instances of each process, both of which do not work. In the case of quotad, we'll get into a deadlock. Using runner_run blocks the epoll thread. But since the daemon quotad process cannot get the volfile, it blocks the parent process, which blocks runner_run in glusterd.

Comment 1 Anand Avati 2014-05-08 05:16:34 UTC
REVIEW: http://review.gluster.org/7703 (glusterd: On gaining spawn_daemons using a synctask) posted (#1) for review on master by Kaushal M (kaushal)

Comment 2 Anand Avati 2014-05-08 05:32:53 UTC
REVIEW: http://review.gluster.org/7703 (glusterd: On gaining quorum spawn_daemons in new thread) posted (#2) for review on master by Kaushal M (kaushal)

Comment 3 Anand Avati 2014-05-09 08:51:26 UTC
REVIEW: http://review.gluster.org/7703 (glusterd: On gaining quorum spawn_daemons in new thread) posted (#3) for review on master by Kaushal M (kaushal)

Comment 4 Anand Avati 2014-05-12 04:26:32 UTC
REVIEW: http://review.gluster.org/7703 (glusterd: On gaining quorum spawn_daemons in new thread) posted (#4) for review on master by Kaushal M (kaushal)

Comment 5 Anand Avati 2014-05-12 10:33:45 UTC
COMMIT: http://review.gluster.org/7703 committed in master by Krishnan Parthasarathi (kparthas) 
------
commit 4f905163211f8d439c6e102d3ffd1bffb34f5c26
Author: Kaushal M <kaushal>
Date:   Wed May 7 18:17:11 2014 +0530

    glusterd: On gaining quorum spawn_daemons in new thread
    
    During startup, if a glusterd has peers, it waits till quorum is
    obtained to spawn bricks and other services. If peers are not present,
    the daemons are started during glusterd' startup itself.
    
    The spawning of daemons as a quorum action was done without using a
    seperate thread, unlike the spawn on startup. Since, quotad was launched
    using the blocking runner_run api, this leads to the thread being
    blocked. The calling thread is almost always the epoll thread and this
    leads to a deadlock. The runner_run call blocks the epoll thread waiting
    for quotad to start, as a result glusterd cannot serve any requests. But
    the startup of quotad is blocked as it cannot fetch the volfile from
    glusterd.
    
    The fix for this is to launch the spawn daemons task in a seperate
    thread. This will free up the epoll thread and prevents the above
    deadlock from happening.
    
    Change-Id: Ife47b3591223cdfdfb2b4ea8dcd73e63f18e8749
    BUG: 1095585
    Signed-off-by: Kaushal M <kaushal>
    Reviewed-on: http://review.gluster.org/7703
    Reviewed-by: Krishnan Parthasarathi <kparthas>
    Tested-by: Gluster Build System <jenkins.com>
    Tested-by: Krishnan Parthasarathi <kparthas>

Comment 6 Niels de Vos 2014-09-22 12:39:53 UTC
A beta release for GlusterFS 3.6.0 has been released. Please verify if the release solves this bug report for you. In case the glusterfs-3.6.0beta1 release does not have a resolution for this issue, leave a comment in this bug and move the status to ASSIGNED. If this release fixes the problem for you, leave a note and change the status to VERIFIED.

Packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update (possibly an "updates-testing" repository) infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-September/018836.html
[2] http://supercolony.gluster.org/pipermail/gluster-users/

Comment 7 Niels de Vos 2014-11-11 08:32:00 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.6.1, please reopen this bug report.

glusterfs-3.6.1 has been announced [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://supercolony.gluster.org/pipermail/gluster-users/2014-November/019410.html
[2] http://supercolony.gluster.org/mailman/listinfo/gluster-users


Note You need to log in before you can comment on or make changes to this bug.