Description of problem: ========================= with brick multiplexing it is quite easy to end up with two brick processes(glusterfsd) pointing to same socket and volfile-id While, I need to understand the implications(which i suspect can be severe), this problem is consistently reproducible Version-Release number of selected component (if applicable): ===================== 3.8.4-22 How reproducible: ====== 2/2 Steps to Reproduce: 1.have a gluster setup (i have 6 nodes) with brick multiplexing enabled 2.create a volume say v1 which is 1x3 spanning on n1,n2,n3 Now the glusterfsd will be something like this when checked on a node n1 root 20014 1 0 19:22 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.35.23 --volfile-id cross31.10.70.35.23.rhs-brick10-cross31 -p /var/lib/glusterd/vols/cross31/run/10.70.35.23-rhs-brick10-cross31.pid -S /var/lib/glusterd/vols/cross31/run/daemon-10.70.35.23.socket --brick-name /rhs/brick10/cross31 -l /var/log/glusterfs/bricks/rhs-brick10-cross31.log --xlator-option *-posix.glusterd-uuid=2b1a4ca7-5c9b-4169-add4-23530cea101a --brick-port 49153 --xlator-option cross31-server.listen-port=49153 3.Now create another 1x3 vol say v2, the bricks of this vol will be attached to same PIDS 4. Now enable USS on v1(or any option which will result in the v1 to get new PIDS for bricks on a restart) 5. Now stop and start v1 Actual results: ========== it can be seen that a new PID for the bricks is spawned(due to vol option changed and hence avoiding to be connected to the first PID) but the problem is the new PID is connected with same socket as first PID and so is the volfile-id and log file as below [root@dhcp35-23 3.8.4-22]# ps -ef|grep glusterfsd ===>old PID root 20014 1 0 19:22 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.35.23 --volfile-id cross31.10.70.35.23.rhs-brick10-cross31 -p /var/lib/glusterd/vols/cross31/run/10.70.35.23-rhs-brick10-cross31.pid -S /var/lib/glusterd/vols/cross31/run/daemon-10.70.35.23.socket --brick-name /rhs/brick10/cross31 -l /var/log/glusterfs/bricks/rhs-brick10-cross31.log --xlator-option *-posix.glusterd-uuid=2b1a4ca7-5c9b-4169-add4-23530cea101a --brick-port 49153 --xlator-option cross31-server.listen-port=49153 ==>new PID root 20320 1 0 19:27 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.35.23 --volfile-id cross31.10.70.35.23.rhs-brick10-cross31 -p /var/lib/glusterd/vols/cross31/run/10.70.35.23-rhs-brick10-cross31.pid -S /var/lib/glusterd/vols/cross31/run/daemon-10.70.35.23.socket --brick-name /rhs/brick10/cross31 -l /var/log/glusterfs/bricks/rhs-brick10-cross31.log --xlator-option *-posix.glusterd-uuid=2b1a4ca7-5c9b-4169-add4-23530cea101a --brick-port 49152 --xlator-option cross31-server.listen-port=49152 root 20340 1 0 19:27 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/cross31 -p /var/lib/glusterd/vols/cross31/run/cross31-snapd.pid -l /var/log/glusterfs/snaps/cross31/snapd.log --brick-name snapd-cross31 -S /var/run/gluster/d451ea3d83a68af025cee105cafdd8a2.socket --brick-port 49154 --xlator-option cross31-server.listen-port=49154 --no-mem-accounting root 20472 30155 0 19:38 pts/0 00:00:00 grep --color=auto glusterfsd
As discussed later, this is a bug, hence moving it to right state
Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596 Downstream patches: https://code.engineering.redhat.com/gerrit/#/c/105595/ https://code.engineering.redhat.com/gerrit/#/c/105596/
On 3.8.4-25 the problem still exists Hence will have to move to failed_qa performed same steps [root@dhcp35-45 ~]# ps -ef|grep glusterfsd root 29765 1 11 13:07 ? 00:00:22 /usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id bali-1.10.70.35.45.rhs-brick1-bali-1 -p /var/lib/glusterd/vols/bali-1/run/10.70.35.45-rhs-brick1-bali-1.pid -S /var/run/gluster/0725de02cc65e3bd10bbccdbf07631e6.socket --brick-name /rhs/brick1/bali-1 -l /var/log/glusterfs/bricks/rhs-brick1-bali-1.log --xlator-option *-posix.glusterd-uuid=e4f737cd-59a2-4392-aa3d-4230f698f128 --brick-port 49152 --xlator-option bali-1-server.listen-port=49152 root 30083 1 0 13:10 ? 00:00:00 /usr/sbin/glusterfsd -s 10.70.35.45 --volfile-id bali-1.10.70.35.45.rhs-brick1-bali-1 -p /var/lib/glusterd/vols/bali-1/run/10.70.35.45-rhs-brick1-bali-1.pid -S /var/run/gluster/0725de02cc65e3bd10bbccdbf07631e6.socket --brick-name /rhs/brick1/bali-1 -l /var/log/glusterfs/bricks/rhs-brick1-bali-1.log --xlator-option *-posix.glusterd-uuid=e4f737cd-59a2-4392-aa3d-4230f698f128 --brick-port 49153 --xlator-option bali-1-server.listen-port=49153 root 30103 1 0 13:10 ? 00:00:00 /usr/sbin/glusterfsd -s localhost --volfile-id snapd/bali-1 -p /var/lib/glusterd/vols/bali-1/run/bali-1-snapd.pid -l /var/log/glusterfs/snaps/bali-1/snapd.log --brick-name snapd-bali-1 -S /var/run/gluster/b0c28a9b87c703e8435212615395783b.socket --brick-port 49154 --xlator-option bali-1-server.listen-port=49154 --no-mem-accounting root 30157 28566 0 13:10 pts/0 00:00:00 grep --color=auto glusterfsd [root@dhcp35-45 ~]# gluster v status Status of volume: bali-1 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick1/bali-1 49153 0 Y 30083 Brick 10.70.35.130:/rhs/brick1/bali-1 49153 0 Y 2132 Brick 10.70.35.122:/rhs/brick1/bali-1 49153 0 Y 1441 Snapshot Daemon on localhost 49154 0 Y 30103 Self-heal Daemon on localhost N/A N/A Y 30112 Snapshot Daemon on 10.70.35.23 49152 0 Y 30666 Self-heal Daemon on 10.70.35.23 N/A N/A Y 30675 Snapshot Daemon on 10.70.35.122 49154 0 Y 1461 Self-heal Daemon on 10.70.35.122 N/A N/A Y 1470 Snapshot Daemon on 10.70.35.130 49154 0 Y 2173 Self-heal Daemon on 10.70.35.130 N/A N/A Y 2199 Task Status of Volume bali-1 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: bali-2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick2/bali-2 49152 0 Y 29765 Brick 10.70.35.130:/rhs/brick2/bali-2 49152 0 Y 1794 Brick 10.70.35.122:/rhs/brick2/bali-2 49152 0 Y 1225 Self-heal Daemon on localhost N/A N/A Y 30112 Self-heal Daemon on 10.70.35.23 N/A N/A Y 30675 Self-heal Daemon on 10.70.35.122 N/A N/A Y 1470 Self-heal Daemon on 10.70.35.130 N/A N/A Y 2199 Task Status of Volume bali-2 ------------------------------------------------------------------------------ There are no active volume tasks [root@dhcp35-45 ~]#
On_qa validation: While I see the problem mentioned in description still there, ie 2 different bricks processes ie glusterfsd pointing to same socket and vol file, I am only moving this to verified as I dont see any IO impact(as suggested in comment#12 and comment#11) ie when i stop vol_1(where uss was enabled later), IOs to vol_2 was still in progress. Hence moving to verified version:3.8.4-33
also I am changing the title and creating a new bug seperately to track the same socket file issue (due to my comment in comment#13)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774