Bug 1647506

Summary: glusterd_brick_start wrongly discovers already-running brick
Product: [Community] GlusterFS Reporter: patrice
Component: glusterdAssignee: Sanju <srakonde>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: 4.1CC: amukherj, bugs, moagrawa, pasik, patrice, srakonde
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-24 11:02:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
/var/log/glusterfs files
none
glusterfs source rpm none

Description patrice 2018-11-07 16:03:27 UTC
Description of problem:
After a gluster node restart, a brick process is not started


Version-Release number of selected component (if applicable):
glusterfs 4.1.1


How reproducible:
The gluster node runs in a docker container
We have 2 volumes (vol1 and vol2)
At the first start the bricks process pid are
 - 27 for vol1
 - 28 for vol2
After the restart, the pid files are not removed 
Gluster checks if the brick for vol1 is started, and because there is no process with pid 27, it starts it. Unfortunately this new process has the pid 28
Then gluster checks if the brick for vol2 is started, and read the old pid file which contains 28. And unfortunately pid 28 is the pid of the brick of vol1
-> and gluster doesn't start the brick of vol2

Gluster has to check the process found in pid file is really a process for the good brick (not a process started for a previous brick)


Steps to Reproduce:

Actual results:


Expected results:


Additional info:

Comment 1 Atin Mukherjee 2018-11-08 04:24:27 UTC
Mohit - can you please check it?

Comment 2 Atin Mukherjee 2018-11-08 08:37:36 UTC
I think this is addressed through https://bugzilla.redhat.com/show_bug.cgi?id=1595320 which is fixed in glusters-5. 

Mohit - would you check if this can be backported to 4.1 branch?

Comment 3 Mohit Agrawal 2018-11-08 09:08:40 UTC
Hi,

Can you please share the dump of /var/log/glusterfs along with below information with a timestamp when the issue was reproduced ??
1) ps -aef | grep gluster
2) gluster v info


Thanks,
Mohit Agrawal

Comment 4 patrice 2018-11-12 12:37:44 UTC
Here are some trace :
[root@pb-gluster-ope-0 ~]# docker exec -it gluster bash
[root@pb-gluster-ope-0 /]# ps -eaf
UID        PID  PPID  C STIME TTY          TIME CMD
root         1     0  0 13:31 ?        00:00:00 /usr/bin/python2 /usr/bin/supervisord -c /etc/supervisord.conf
root         8     1  0 13:31 ?        00:00:00 /usr/sbin/glusterd -N -l /var/log/glusterfs/glusterfs.log --log-level INFO
root         9     1  0 13:31 ?        00:00:00 /usr/bin/python /usr/bin/gmanager.py
root        18     0  0 13:31 pts/1    00:00:00 bash
root        33     1  0 13:32 ?        00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/d7c25a447b41ab7d.socket --xlator-option *replicate*.nod
root        42     1  0 13:32 ?        00:00:00 /usr/sbin/glusterfsd -s pilot-0 --volfile-id glusterPGSQL.pilot-0.mnt-glusterPGSQL-1 -p /var/run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid -S /var/run/gluster/8ae1d08240e9da74.socket --brick-name /mnt/gluster
root        97    18  0 13:32 pts/1    00:00:00 ps -eaf
[root@pb-gluster-ope-0 /]# find /run -name *.pid
/run/supervisor/supervisord.pid
/run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid
/run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid
/run/gluster/glustershd/glustershd.pid
[root@pb-gluster-ope-0 /]# cat /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid
42
[root@pb-gluster-ope-0 /]# cat /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid
42


[root@pb-gluster-ope-0 /]# ls -l /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid
-rw-r--r--. 1 root root 3 Nov 12 13:32 /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid
-rw-r--r--. 1 root root 3 Nov 12 13:31 /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid


and the /var/log/glusterfs file are in attached file

BR

P.

Comment 5 patrice 2018-11-12 12:38:57 UTC
Created attachment 1504697 [details]
/var/log/glusterfs files

Comment 6 patrice 2018-11-12 12:45:59 UTC
And gluster v info result :

[root@pb-gluster-ope-0 /]# gluster v info

Volume Name: glusterPGSQL
Type: Replicate
Volume ID: 322d4e1f-483a-4dae-8a3c-8f6fbc51dd57
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: pilot-0:/mnt/glusterPGSQL/1
Brick2: pilot-1:/mnt/glusterPGSQL/1
Brick3: pilot-2:/mnt/glusterPGSQL/1
Options Reconfigured:
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Volume Name: glustervol1
Type: Replicate
Volume ID: 10dca74f-2cf6-476c-89e1-69e6a67a7bde
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: pilot-0:/local/glusterV0/1
Brick2: pilot-1:/local/glusterV0/1
Brick3: pilot-2:/local/glusterV0/1
Options Reconfigured:
cluster.self-heal-daemon: enable
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Comment 7 Mohit Agrawal 2018-11-12 13:17:42 UTC
Hi,

Thanks for sharing the info. After analyzing the logs it is difficult to map pid of running brick process. Usually, we got this type  of situation only in the container environment while user do configure brick multiplex feature (multiple bricks are attached with same brick)
but in this case you are getting an issue even without enabling brick multiplex.

Would it be possible for you to test the patch if I do share the patch with you or how can I share a test build(rpm) with you ??

Thanks,
Mohit Agrawal

Comment 8 patrice 2018-11-12 14:19:42 UTC
Hi,

Yes, i can test a rpm if you provide me it! (but it could take few days to test it)

BR

P.

Comment 9 Mohit Agrawal 2018-11-13 03:18:06 UTC
Created attachment 1505089 [details]
glusterfs source rpm

Comment 10 Mohit Agrawal 2018-11-13 03:23:10 UTC
Hi,

I have attached gluster_src_rpm, Please build the other rpms from this source rpm and share the result.

To build the rpm need to follow the instruction
1) rpm -hiv <source_rpm>
2) cd rpmbuild/SPECS; rpmbuild -ba glusterfs.spec

The command will build rpms in rpmbuild/RPMS/x86_64


Thanks,
Mohit Agrawal

Comment 11 Atin Mukherjee 2019-07-17 08:36:34 UTC
Have you tested out the fix?

Comment 12 Sanju 2019-09-19 11:34:54 UTC
Did you get a chance to test the rpm provided?

Comment 13 patrice 2019-12-03 08:26:35 UTC
Hi,
Sorry I have no way to test the rpm (no more labs!)

BR

Comment 14 Sanju 2020-02-24 11:02:26 UTC
We are not seeing this issue with latest master, closing the bug.