Description of problem: After a gluster node restart, a brick process is not started Version-Release number of selected component (if applicable): glusterfs 4.1.1 How reproducible: The gluster node runs in a docker container We have 2 volumes (vol1 and vol2) At the first start the bricks process pid are - 27 for vol1 - 28 for vol2 After the restart, the pid files are not removed Gluster checks if the brick for vol1 is started, and because there is no process with pid 27, it starts it. Unfortunately this new process has the pid 28 Then gluster checks if the brick for vol2 is started, and read the old pid file which contains 28. And unfortunately pid 28 is the pid of the brick of vol1 -> and gluster doesn't start the brick of vol2 Gluster has to check the process found in pid file is really a process for the good brick (not a process started for a previous brick) Steps to Reproduce: Actual results: Expected results: Additional info:
Mohit - can you please check it?
I think this is addressed through https://bugzilla.redhat.com/show_bug.cgi?id=1595320 which is fixed in glusters-5. Mohit - would you check if this can be backported to 4.1 branch?
Hi, Can you please share the dump of /var/log/glusterfs along with below information with a timestamp when the issue was reproduced ?? 1) ps -aef | grep gluster 2) gluster v info Thanks, Mohit Agrawal
Here are some trace : [root@pb-gluster-ope-0 ~]# docker exec -it gluster bash [root@pb-gluster-ope-0 /]# ps -eaf UID PID PPID C STIME TTY TIME CMD root 1 0 0 13:31 ? 00:00:00 /usr/bin/python2 /usr/bin/supervisord -c /etc/supervisord.conf root 8 1 0 13:31 ? 00:00:00 /usr/sbin/glusterd -N -l /var/log/glusterfs/glusterfs.log --log-level INFO root 9 1 0 13:31 ? 00:00:00 /usr/bin/python /usr/bin/gmanager.py root 18 0 0 13:31 pts/1 00:00:00 bash root 33 1 0 13:32 ? 00:00:00 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/d7c25a447b41ab7d.socket --xlator-option *replicate*.nod root 42 1 0 13:32 ? 00:00:00 /usr/sbin/glusterfsd -s pilot-0 --volfile-id glusterPGSQL.pilot-0.mnt-glusterPGSQL-1 -p /var/run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid -S /var/run/gluster/8ae1d08240e9da74.socket --brick-name /mnt/gluster root 97 18 0 13:32 pts/1 00:00:00 ps -eaf [root@pb-gluster-ope-0 /]# find /run -name *.pid /run/supervisor/supervisord.pid /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid /run/gluster/glustershd/glustershd.pid [root@pb-gluster-ope-0 /]# cat /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid 42 [root@pb-gluster-ope-0 /]# cat /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid 42 [root@pb-gluster-ope-0 /]# ls -l /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid -rw-r--r--. 1 root root 3 Nov 12 13:32 /run/gluster/vols/glusterPGSQL/pilot-0-mnt-glusterPGSQL-1.pid -rw-r--r--. 1 root root 3 Nov 12 13:31 /run/gluster/vols/glustervol1/pilot-0-local-glusterV0-1.pid and the /var/log/glusterfs file are in attached file BR P.
Created attachment 1504697 [details] /var/log/glusterfs files
And gluster v info result : [root@pb-gluster-ope-0 /]# gluster v info Volume Name: glusterPGSQL Type: Replicate Volume ID: 322d4e1f-483a-4dae-8a3c-8f6fbc51dd57 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: pilot-0:/mnt/glusterPGSQL/1 Brick2: pilot-1:/mnt/glusterPGSQL/1 Brick3: pilot-2:/mnt/glusterPGSQL/1 Options Reconfigured: cluster.self-heal-daemon: enable transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: glustervol1 Type: Replicate Volume ID: 10dca74f-2cf6-476c-89e1-69e6a67a7bde Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: pilot-0:/local/glusterV0/1 Brick2: pilot-1:/local/glusterV0/1 Brick3: pilot-2:/local/glusterV0/1 Options Reconfigured: cluster.self-heal-daemon: enable transport.address-family: inet nfs.disable: on performance.client-io-threads: off
Hi, Thanks for sharing the info. After analyzing the logs it is difficult to map pid of running brick process. Usually, we got this type of situation only in the container environment while user do configure brick multiplex feature (multiple bricks are attached with same brick) but in this case you are getting an issue even without enabling brick multiplex. Would it be possible for you to test the patch if I do share the patch with you or how can I share a test build(rpm) with you ?? Thanks, Mohit Agrawal
Hi, Yes, i can test a rpm if you provide me it! (but it could take few days to test it) BR P.
Created attachment 1505089 [details] glusterfs source rpm
Hi, I have attached gluster_src_rpm, Please build the other rpms from this source rpm and share the result. To build the rpm need to follow the instruction 1) rpm -hiv <source_rpm> 2) cd rpmbuild/SPECS; rpmbuild -ba glusterfs.spec The command will build rpms in rpmbuild/RPMS/x86_64 Thanks, Mohit Agrawal
Have you tested out the fix?
Did you get a chance to test the rpm provided?
Hi, Sorry I have no way to test the rpm (no more labs!) BR
We are not seeing this issue with latest master, closing the bug.