Description of problem: In order to reproduce an issue,while performing testing around block pvc creation/deletion followed by node shutdown,observed one of the heketi brick went offline. Steps performed y'day and today while doing this testing- 1.Create 20 block pvc's.Assign 20 block pvc's to 20 cirros pods 2.Power off 1 gluster node (choose the node which host 1-1 brick of heketi volume and block hosting volume) so that 1 of the gluster pod goes down. 3.Now when 1 of the node is down,create 20 more block pvc's 4.Power on the gluster node and wait for all the pods to be in 1/1 state.Check for gluster brick inside 1 of the pod.All bricks were up. 5.Now again power down 1 node (criteria to select node is same mentioned in comment#2). 6.Again create 20 more pvc and assign it to 20 cirros pod.Wait for pods to come up 7.Power on the node Here now in total 2 volumes are present (1 is block hosting volume and 1 is heketi volume). 60 block devices are present and 60 cirros pod consuming those 60 block volumes. 8.Delete all the pvc's and cirros pods 9.Create 20 more block pvc's (5 GB each) hosting on 2 block hosting volume When the node came up,all brick came up and running at step 7. But after sometime,one of the heketi brick went offline. ============== # oc rsh glusterfs-storage-2zr8j sh-4.2# gluster v status Status of volume: heketidbstorage Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.141:/var/lib/heketi/mounts/v g_ede1e924e59132dadfe79c5515986f70/brick_d5 f28da6882efa9fde21055004e08c35/brick N/A N/A N N/A Brick 10.70.46.38:/var/lib/heketi/mounts/vg _d2108b159f0ef9188ef8c63ec7295540/brick_010 35b4a8dc955f63c06274610b0476c/brick 49152 0 Y 401 Brick 10.70.47.190:/var/lib/heketi/mounts/v g_10350b0a86976eaef1350b375c24abb4/brick_72 f9013ae459a3d868be53df72342616/brick 49152 0 Y 406 Self-heal Daemon on localhost N/A N/A Y 42600 Self-heal Daemon on 10.70.47.115 N/A N/A Y 56933 Self-heal Daemon on dhcp47-190.lab.eng.blr. redhat.com N/A N/A Y 92848 Self-heal Daemon on 10.70.46.38 N/A N/A Y 46806 Task Status of Volume heketidbstorage ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vol_1269c6d46399c7dab9f88fe1f4efcb02 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.141:/var/lib/heketi/mounts/v g_ede1e924e59132dadfe79c5515986f70/brick_e4 f058aecea07ec8c7c517ecf56de002/brick 49152 0 Y 41946 Brick 10.70.46.38:/var/lib/heketi/mounts/vg _d2108b159f0ef9188ef8c63ec7295540/brick_597 e233138a5b245c9e8fc5845d0e882/brick 49153 0 Y 46785 Brick 10.70.47.190:/var/lib/heketi/mounts/v g_10350b0a86976eaef1350b375c24abb4/brick_9d da7996e6d03862c26a391ec805cc4a/brick 49153 0 Y 91813 Self-heal Daemon on localhost N/A N/A Y 42600 Self-heal Daemon on 10.70.47.115 N/A N/A Y 56933 Self-heal Daemon on dhcp47-190.lab.eng.blr. redhat.com N/A N/A Y 92848 Self-heal Daemon on 10.70.46.38 N/A N/A Y 46806 Task Status of Volume vol_1269c6d46399c7dab9f88fe1f4efcb02 ------------------------------------------------------------------------------ There are no active volume tasks Status of volume: vol_c6100fe5fc5eae87bdd4972aa79853c7 Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.190:/var/lib/heketi/mounts/v g_e68a5ba833ae758bcb4e0e52c21126dc/brick_05 08a00d31f9426d3a4541cc43c77c5a/brick 49153 0 Y 91813 Brick 10.70.47.115:/var/lib/heketi/mounts/v g_bd92a54d28d1ad649c0cb3e349e58e5d/brick_5a 8043c95bff8c5951ebd1905c3a2025/brick 49152 0 Y 56377 Brick 10.70.47.141:/var/lib/heketi/mounts/v g_ede1e924e59132dadfe79c5515986f70/brick_0e 348632c4fcbd91bcaa16f728b2695d/brick 49152 0 Y 41946 Self-heal Daemon on localhost N/A N/A Y 42600 Self-heal Daemon on 10.70.46.38 N/A N/A Y 46806 Self-heal Daemon on 10.70.47.115 N/A N/A Y 56933 Self-heal Daemon on dhcp47-190.lab.eng.blr. redhat.com N/A N/A Y 92848 Task Status of Volume vol_c6100fe5fc5eae87bdd4972aa79853c7 ------------------------------------------------------------------------------ There are no active volume tasks # gluster peer status Number of Peers: 3 Hostname: dhcp47-190.lab.eng.blr.redhat.com Uuid: 6fce4225-bca3-4e04-98f4-f3e82a364566 State: Peer in Cluster (Connected) Hostname: 10.70.47.115 Uuid: a07581ec-4a32-4da0-ab71-3d2612bdf100 State: Peer in Cluster (Connected) Hostname: 10.70.46.38 Uuid: 5f14ee56-6a79-4e1f-975a-4f9c7aea1721 State: Peer in Cluster (Connected) ================ Version-Release number of selected component (if applicable): sh-4.2# rpm -qa|grep gluster glusterfs-fuse-3.12.2-27.el7rhgs.x86_64 python2-gluster-3.12.2-27.el7rhgs.x86_64 glusterfs-server-3.12.2-27.el7rhgs.x86_64 gluster-block-0.2.1-29.el7rhgs.x86_64 glusterfs-api-3.12.2-27.el7rhgs.x86_64 glusterfs-cli-3.12.2-27.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-27.el7rhgs.x86_64 glusterfs-libs-3.12.2-27.el7rhgs.x86_64 glusterfs-3.12.2-27.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-27.el7rhgs.x86_64 # rpm -qa|grep tcmu-runner tcmu-runner-1.2.0-27.el7rhgs.x86_64 # rpm -qa|grep heketi heketi-client-8.0.0-1.el7rhgs.x86_64 heketi-8.0.0-1.el7rhgs.x86_64 How reproducible: Hit while doing testing across pv creation/deletion followed by node power off Steps to Reproduce: Mentioned above Actual results: One of the heketi brick went offline Expected results: No brick should be offline Additional info: # oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME dhcp46-120.lab.eng.blr.redhat.com Ready infra 2d v1.11.0+d4cacc0 10.70.46.120 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp46-159.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.46.159 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp46-168.lab.eng.blr.redhat.com Ready infra 2d v1.11.0+d4cacc0 10.70.46.168 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp46-238.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.46.238 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp46-38.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.46.38 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp47-115.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.47.115 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp47-141.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.47.141 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp47-148.lab.eng.blr.redhat.com Ready infra 2d v1.11.0+d4cacc0 10.70.47.148 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp47-190.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.47.190 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp47-31.lab.eng.blr.redhat.com Ready compute 2d v1.11.0+d4cacc0 10.70.47.31 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 dhcp47-89.lab.eng.blr.redhat.com Ready master 2d v1.11.0+d4cacc0 10.70.47.89 <none> Employee SKU 3.10.0-957.el7.x86_64 docker://1.13.1 Attaching sosreports shortly
Post performing above test case,all heketi volume and gluster block hosting volumes were online.No issue was observed