Bug 1092429 - Two instances each, of brick processes, glusterfs-nfs and quotad seen after glusterd restart
Summary: Two instances each, of brick processes, glusterfs-nfs and quotad seen after g...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: quota
Version: rhgs-3.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: RHGS 3.0.0
Assignee: Kaushal
QA Contact: Saurabh
URL:
Whiteboard:
Depends On: 1095585
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-04-29 09:53 UTC by Shruti Sampat
Modified: 2016-09-17 12:38 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.6.0-4.0.el6rhs
Doc Type: Bug Fix
Doc Text:
Previously, the quoatd process started by blocking the epoll thread when glusterd started. This led to glusterd being deadlocked during startup. As a result, the daemon processes could not start and daemonize correctly. As a result, two instances of the daemon processes were observed. With this fix, Quotad is started separately leaving the epoll thread free to serve other requests. All the daemon processes start and daemonize properly displaying only a single instance of each process.
Clone Of:
: 1095585 (view as bug list)
Environment:
Last Closed: 2014-09-22 19:36:16 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHEA-2014:1278 0 normal SHIPPED_LIVE Red Hat Storage Server 3.0 bug fix and enhancement update 2014-09-22 23:26:55 UTC

Description Shruti Sampat 2014-04-29 09:53:41 UTC
Description of problem:
------------------------

Following glusterd restart, the ps commands reports two instances each, of the brick processes, glusterfs-nfs and quotad. See below for e.g. - 

-------------------------------------------------------------------------

[root@rhs-client25 gluster]# pgrep gluster -fl
16074 /usr/sbin/glusterd --pid-file=/var/run/glusterd.pid
16284 /usr/sbin/glusterfsd -s rhs-client25 --volfile-id dis_vol.rhs-client25.rhs-brick1-b1 -p /var/lib/glusterd/vols/dis_vol/run/rhs-client25-rhs-brick1-b1.pid -S /var/
run/3df964b178012d04c3a29c339b2465a3.socket --brick-name /rhs/brick1/b1 -l /var/log/glusterfs/bricks/rhs-brick1-b1.log --xlator-option *-posix.glusterd-uuid=6e95ce9b-34
53-4d21-9510-64a50dd6cf12 --brick-port 49152 --xlator-option dis_vol-server.listen-port=49152
16285 /usr/sbin/glusterfsd -s rhs-client25 --volfile-id dis_vol.rhs-client25.rhs-brick1-b1 -p /var/lib/glusterd/vols/dis_vol/run/rhs-client25-rhs-brick1-b1.pid -S /var/
run/3df964b178012d04c3a29c339b2465a3.socket --brick-name /rhs/brick1/b1 -l /var/log/glusterfs/bricks/rhs-brick1-b1.log --xlator-option *-posix.glusterd-uuid=6e95ce9b-34
53-4d21-9510-64a50dd6cf12 --brick-port 49152 --xlator-option dis_vol-server.listen-port=49152
16290 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/8800fd4a270a082e891be2336
fcb0e7f.socket
16291 /usr/sbin/glusterfs -s localhost --volfile-id gluster/nfs -p /var/lib/glusterd/nfs/run/nfs.pid -l /var/log/glusterfs/nfs.log -S /var/run/8800fd4a270a082e891be2336
fcb0e7f.socket
16295 /usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/lib/glusterd/quotad/run/quotad.pid -l /var/log/glusterfs/quotad.log -S /var/run/d12f45551ebf8
6f8dbd7a1a5977970e9.socket --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal
=off
16297 /usr/sbin/glusterfs -s localhost --volfile-id gluster/quotad -p /var/lib/glusterd/quotad/run/quotad.pid -l /var/log/glusterfs/quotad.log -S /var/run/d12f45551ebf8
6f8dbd7a1a5977970e9.socket --xlator-option *replicate*.data-self-heal=off --xlator-option *replicate*.metadata-self-heal=off --xlator-option *replicate*.entry-self-heal
=off


The gdb backtrace for the two brick processes is below - 

(gdb) bt
#0  0x0000003a7e80e740 in __read_nocancel () from /lib64/libpthread.so.0
#1  0x0000000000404c0d in read (ctx=0x13bd010) at /usr/include/bits/unistd.h:45
#2  daemonize (ctx=0x13bd010) at glusterfsd.c:1810
#3  0x0000000000407287 in main (argc=19, argv=0x7fff2f4be4c8) at glusterfsd.c:1979
(gdb) q

--------------------------------------------------------------------------

(gdb) bt
#0  0x0000003a7e4e9163 in epoll_wait () from /lib64/libc.so.6
#1  0x0000003a7f86adb7 in event_dispatch_epoll (event_pool=0x13d8ee0) at event-epoll.c:428
#2  0x00000000004072ca in main (argc=19, argv=0x7fff2f4be4c8) at glusterfsd.c:1994
(gdb) q


Version-Release number of selected component (if applicable):
glusterfs-3.5qa2-0.294.git00802b3.el6rhs.x86_64

How reproducible:
Intermittent.

Steps to Reproduce:
Cannot give clear steps for reproducing the issue.

Actual results:
Following glusterd restart, two instances of the above mentioned processes are seen.

Expected results:
There should only be one instance of each of these processes.

Additional info:

Comment 4 RamaKasturi 2014-05-06 10:02:51 UTC
Hi,

   Saw the similar issue in my setup as well. Here are the steps to reprocuce.

1) Create a distributed volume.

2) Enable quota on the volume by running the command "gluster vol quota <volname> enable.

3) Now stop glusterd and start it again.

Now gluster cli hangs while trying to run the command "gluster v i".

Attaching the sos reports for the same.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1092429/

Comment 5 RamaKasturi 2014-05-06 10:05:47 UTC
1) Not able to test process plugin Quota in nagios.

2) When quota is enabled on a volume, not able to test any other process plugins i.e glusterd,shd and nfs.

Comment 6 Kaushal 2014-05-06 10:48:35 UTC
This happens because we glusterd enters a deadlock, which is caused because quotad is started using 'runner_run' which is blocks the calling thread. When glusterd is starting up, this happens in the epoll thread, and the epoll thread gets blocked. With the epoll thread blocked glusterd is not able to serve any requests, including cli commands and volfile fetch requests.

The daemonize process for glusterfs daemons happens in the following way,
* glusterd starts the parent process
* the parent process forks and creates a child daemon process
* the daemon process does the volfile fetching and initializing, and returns to parent on success.
* parent process ends itself after child returns

In this case, since the child process cannot fetch the volfile and will not return to the parent process. This will leave two instances of each process, both of which do not work. In the case of quotad, we'll get into a deadlock. Using runner_run blocks the epoll thread. But since the daemon quotad process cannot get the volfile, it blocks the parent process, which blocks runner_run in glusterd.

Comment 8 Nagaprasad Sathyanarayana 2014-05-19 10:56:35 UTC
Setting flags required to add BZs to RHS 3.0 Errata

Comment 12 Pavithra 2014-07-23 06:37:30 UTC
Kaushal,

Can you please review the edited doc text for technical accuracy and sign off?

Comment 14 errata-xmlrpc 2014-09-22 19:36:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHEA-2014-1278.html


Note You need to log in before you can comment on or make changes to this bug.