Description of problem: ======================= I am able to recreate a scenario where bricks of one volume can go down if there is a second volume pointing to the same socket file and pid file and that second volume is brought down even after the first volume gets a different PID on vol configuration change. Steps to reproduce will help understand better the situation Version-Release number of selected component (if applicable): ==== 3.8.4-23 How reproducible: ============== 2/2 Steps to Reproduce: ===================== 1.have a 6 node setup with multiple bricks, DO NOT enable brick mux yet 2.create a 2x2 volume on n3..n6 say the vol is v1 and start it 3.create another volume same as above say, v3 and start it (note i am creating v3 before v2) 4. again create another vol v2 same as above and start it (note i am creating v3 before v2) 5. now stop v1 and delete it 6. enable brick multiplexing 7. now create v4 volume, same config as previous volumes and start it 8. now enable uss on v2 9. now stop v2 and restart v2 10. now stop v4 and it can be seen that even v2 bricks go down (possibly coz the pid and socket file of v4 were pointing to old v2 details) 11. now check the status of all volumes, it can be seen that v2 is still down and hence cannot mount the volume ################### pasting exact commands############ 1134 gluster peer status 1135 history 1136 ' 1137 cd ~ 1138 ls 1139 gluster v create v1 rep 2 10.70.35.122:/rhs/brick1/v1 10.70.35.23:/rhs/brick1/v1 10.70.35.112:/rhs/brick1/v1 10.70.35.138:/rhs/brick1/v1 1140 gluster v start v1 1141 gluster v get all all 1142 gluster v status v1 1143 gluster v create v2 rep 2 10.70.35.122:/rhs/brick2/v2 10.70.35.23:/rhs/brick2/v2 10.70.35.112:/rhs/brick2/v2 10.70.35.138:/rhs/brick2/v2 1144 gluster v create v3 rep 2 10.70.35.122:/rhs/brick3/v3 10.70.35.23:/rhs/brick3/v3 10.70.35.112:/rhs/brick3/v3 10.70.35.138:/rhs/brick3/v3 1145 gluster v start v3 1146 gluster v start v2 1147 gluster v status 1148 clear 1149 gluster v status 1150 gluster v stop v1 1151 gluster v dele v1 1152 gluster v status 1153 gluster v set all cluster.brick-multiplex enable 1154 gluster v create v4 rep 2 10.70.35.122:/rhs/brick4/v4 10.70.35.23:/rhs/brick4/v4 10.70.35.112:/rhs/brick4/v4 10.70.35.138:/rhs/brick4/v4 1155 gluster v start v4 1156 gluster v status 1157 gluster v set v2 features.uss enable 1158 gluster v stop v2 1159 gluster v start v2 1160 gluster v status v2 1161 gluster v status v4 1162 gluster v status v2 1163 gluster v status v4 1164 gluster v status v2 1165 gluster v status v1 1166 gluster v status v2 1167 gluster v status v3 1168 gluster v status v4 1169 gluster v stop v4 1170 gluster v status v2 1171 history 1172 gluster v status v4 1173 gluster v status v2 1174 gluster v start v4 1175 gluster v status v4 1176 gluster v status v2 1177 history|grep gluster 1178 history [root@dhcp35-45 ~]# gluster v info Volume Name: v2 Type: Distributed-Replicate Volume ID: 02261f5c-b7df-4dbb-86ce-6419efd93152 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.122:/rhs/brick2/v2 Brick2: 10.70.35.23:/rhs/brick2/v2 Brick3: 10.70.35.112:/rhs/brick2/v2 Brick4: 10.70.35.138:/rhs/brick2/v2 Options Reconfigured: features.uss: enable transport.address-family: inet nfs.disable: on cluster.brick-multiplex: enable Volume Name: v3 Type: Distributed-Replicate Volume ID: 8fb3daca-03ff-4022-ba2f-b475231fdcce Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.122:/rhs/brick3/v3 Brick2: 10.70.35.23:/rhs/brick3/v3 Brick3: 10.70.35.112:/rhs/brick3/v3 Brick4: 10.70.35.138:/rhs/brick3/v3 Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: enable Volume Name: v4 Type: Distributed-Replicate Volume ID: c5477eda-eaea-474a-b1ee-a55dee58c461 Status: Started Snapshot Count: 0 Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.70.35.122:/rhs/brick4/v4 Brick2: 10.70.35.23:/rhs/brick4/v4 Brick3: 10.70.35.112:/rhs/brick4/v4 Brick4: 10.70.35.138:/rhs/brick4/v4 Options Reconfigured: transport.address-family: inet nfs.disable: on cluster.brick-multiplex: enable [root@dhcp35-45 ~]# gluster v status Status of volume: v2 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.122:/rhs/brick2/v2 N/A N/A N N/A Brick 10.70.35.23:/rhs/brick2/v2 N/A N/A N N/A Brick 10.70.35.112:/rhs/brick2/v2 N/A N/A N N/A Brick 10.70.35.138:/rhs/brick2/v2 N/A N/A N N/A Snapshot Daemon on localhost 49152 0 Y 23875 Self-heal Daemon on localhost N/A N/A Y 24303 Snapshot Daemon on 10.70.35.130 49152 0 Y 12063 Self-heal Daemon on 10.70.35.130 N/A N/A Y 12312 Snapshot Daemon on 10.70.35.112 49155 0 Y 31066 Self-heal Daemon on 10.70.35.112 N/A N/A Y 31328 Snapshot Daemon on 10.70.35.23 49155 0 Y 31262 Self-heal Daemon on 10.70.35.23 N/A N/A Y 31523 Snapshot Daemon on 10.70.35.138 49155 0 Y 11405 Self-heal Daemon on 10.70.35.138 N/A N/A Y 11667 Snapshot Daemon on 10.70.35.122 49155 0 Y 13063 Self-heal Daemon on 10.70.35.122 N/A N/A Y 13324 Task Status of Volume v2 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: v3 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.122:/rhs/brick3/v3 49153 0 Y 12743 Brick 10.70.35.23:/rhs/brick3/v3 49153 0 Y 30943 Brick 10.70.35.112:/rhs/brick3/v3 49153 0 Y 30745 Brick 10.70.35.138:/rhs/brick3/v3 49153 0 Y 11084 Self-heal Daemon on localhost N/A N/A Y 24303 Self-heal Daemon on 10.70.35.130 N/A N/A Y 12312 Self-heal Daemon on 10.70.35.23 N/A N/A Y 31523 Self-heal Daemon on 10.70.35.122 N/A N/A Y 13324 Self-heal Daemon on 10.70.35.112 N/A N/A Y 31328 Self-heal Daemon on 10.70.35.138 N/A N/A Y 11667 Task Status of Volume v3 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: v4 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.122:/rhs/brick4/v4 49153 0 Y 12743 Brick 10.70.35.23:/rhs/brick4/v4 49153 0 Y 30943 Brick 10.70.35.112:/rhs/brick4/v4 49153 0 Y 30745 Brick 10.70.35.138:/rhs/brick4/v4 49153 0 Y 11084 Self-heal Daemon on localhost N/A N/A Y 24303 Self-heal Daemon on 10.70.35.130 N/A N/A Y 12312 Self-heal Daemon on 10.70.35.112 N/A N/A Y 31328 Self-heal Daemon on 10.70.35.23 N/A N/A Y 31523 Self-heal Daemon on 10.70.35.122 N/A N/A Y 13324 Self-heal Daemon on 10.70.35.138 N/A N/A Y 11667 Task Status of Volume v4 ------------------------------------------------------------------------------ There are no active volume tasks
sosreports http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1444861/
upstream patch : https://review.gluster.org/#/c/17101/
Upstream patches : https://review.gluster.org/#/q/topic:bug-1444596 Downstream patches: https://code.engineering.redhat.com/gerrit/#/c/105595/ https://code.engineering.redhat.com/gerrit/#/c/105596/
QA validation: Moving to failed_qa If i bring down a brick of one volume , it still is disconnecting all the bricks from the glusterfsd did below steps 1)have a cluster with brick mux enabled 2) created 10 vols of 1x3 type 3) brought down b1 of vol7 (using umount of lv) 4) now mount vol7 and vol1(base vol) and vol3(any other vol) 5)do IOs to all the above vols ==>you will see that all the bricks associated with the same glusterfsd as b1 of vol7 would not be receiving any IO, effectively losing the brick availability you can check even the heal info for the volume, it will show files as heal pending and checked the backend brick test version ==== 3.8.4-25
Nag, This is a known issue and currently it(scenario) is not handled completely.The issue will come only when brick has down in some ungraceful manner and as per bugzilla earlier brick was down in some graceful way(through the cli). So please verified to this bugzilla followed same procedure as you mentioned in comment 1 For specific to handle this kind of scenario fix in under progress from below patch https://review.gluster.org/17287 Regards Mohit Agrawal
I agree with Mohit. The steps which were followed to file this bug and the steps which were followed to verify this bug are different. please follow the same steps and reconfirm.
I cannot verify this until BZ#1450630 is fixed
Patch for (BZ#1450630) is already merged in downstream from this bugzilla https://bugzilla.redhat.com/show_bug.cgi?id=1450806. Regards Mohit Agrawal
Mohit - the current build doesn't have the fix, so Nag's comment is valid. Nag - as this bug has been moved to MODIFIED state, expect this fix to land in the next build.
On_qa validation: 3.8.4-33 is the test build ran both the cases mentioned in 1)description 2)comment#8 not seeing the issue anymore hence moving to verified
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774