Description of problem: ++++++++++++++++++++++++ While creating pvc, re-spinned a glusterfs pod. So the pvc creation was in pending state till the pod resumed (expected). Heketi had already created brick for for the pvc when the pod was brought down. Since pod was re-spinned , attempt to create a volumes(vol_796de4d1aa78be36cf0bb10f8c97969d) ERROR'd out as expected . Then it tried to delete the 3 bricks. Brick delete on the pod(glusterfs-storage-mgp4l) failed while it was still rebooting but even after the pod was UP, the brick was not removed. It stayed behind as a "dangling" brick. Thus even though we have 14 volumes in total, df -kh on "glusterfs-storage-mgp4l" lists 15 mounts Issues: ============== 1. /var/lib/heketi/fstab has 14 entries but df -kh has 15 mounts # oc rsh glusterfs-storage-mgp4l cat /var/lib/heketi/fstab|wc -l 14 # oc rsh glusterfs-storage-mgp4l df -kh|grep heketi|wc -l 15 Extra mounted brick: ------------------------- /dev/mapper/vg_60ecbe6973055ed0c2dad42f49970964-brick_80548ae35a7269c74c3cc67408c70e12 4.0G 33M 4.0G 1% /var/lib/heketi/mounts/vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12 2. Because of #1, the heketi-cli topology info shows Used size of /dev/sdd as 133 GiB instead of 129 GiB(bricks listed in this device add up to 129) Id:60ecbe6973055ed0c2dad42f49970964 Name:/dev/sdd State:online Size (GiB):1023 Used (GiB):133 Free (GiB):890 Some logs from heketi side =========================== [kubeexec] ERROR 2018/10/04 15:47:33 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:242: Failed to run command [gluster --mode=script volume create vol_796de4d1aa78be36cf0bb10f8c97969d replica 3 10.70.46.89:/var/lib/heketi/mounts/vg_4c9c906a913170789d2c6650c5992720/brick_03117942d3ed6bf559a6aa27759c95cb/brick 10.70.42.218:/var/lib/heketi/mounts/vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12/brick 10.70.46.146:/var/lib/heketi/mounts/vg_8c0b9044f791cc2f7f1e4758f62baaf2/brick_a2ed089e25e169bf00e49f5e3820c5d3/brick] on glusterfs-storage-pfpsn: Err[command terminated with exit code 1]: Stdout []: Stderr [volume create: vol_796de4d1aa78be36cf0bb10f8c97969d: failed: Host 10.70.42.218 not connected [heketi] INFO 2018/10/04 15:47:37 Deleting brick 80548ae35a7269c74c3cc67408c70e12 [kubeexec] DEBUG 2018/10/04 15:47:38 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:246: Host: dhcp42-218.lab.eng.blr.redhat.com Pod: glusterfs-storage-mgp4l Command: umount /var/lib/heketi/mounts/vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12 [kubeexec] DEBUG 2018/10/04 15:47:39 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:246: Host: dhcp42-218.lab.eng.blr.redhat.com Pod: glusterfs-storage-mgp4l Command: sed -i.save "/brick_80548ae35a7269c74c3cc67408c70e12/d" /var/lib/heketi/fstab [kubeexec] ERROR 2018/10/04 15:47:41 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:242: Failed to run command [lvremove --autobackup=n -f vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12] on glusterfs-storage-mgp4l: Err[command terminated with exit code 5]: Stdout []: Stderr [ /run/lvm/lvmetad.socket: connect failed: Connection refused Logical volume vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12 contains a filesystem in use. [heketi] ERROR 2018/10/04 15:47:41 /src/github.com/heketi/heketi/apps/glusterfs/brick_create.go:60: error destroying brick 80548ae35a7269c74c3cc67408c70e12: /run/lvm/lvmetad.socket: connect failed: Connection refused Logical volume vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12 contains a filesystem in use. Logical volume vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12 contains a filesystem in use. Logical volume vg_60ecbe6973055ed0c2dad42f49970964/brick_80548ae35a7269c74c3cc67408c70e12 contains a filesystem in use. [root@dhcp47-135 bz1]# oc rsh glusterfs-storage-mgp4l sh-4.2# vgs /run/lvm/lvmetad.socket: connect failed: Connection refused WARNING: Failed to connect to lvmetad. Falling back to device scanning. VG #PV #LV #SN Attr VSize VFree docker-vg 1 1 0 wz--n- <100.00g 60.00g rhel_dhcp47-20 1 3 0 wz--n- 95.00g 4.00m vg_60ecbe6973055ed0c2dad42f49970964 1 20 0 wz--n- 1023.87g <889.68g vg_7957c71c22126246a74f74810f1878bb 1 10 0 wz--n- 1023.87g 1003.73g sh-4.2# # oc get pods|grep storage glusterblock-storage-provisioner-dc-1-w66dm 1/1 Running 0 17h glusterfs-storage-4fv4c 1/1 Running 0 14h <--- this node is disabled, hence no brick exists glusterfs-storage-lbnv9 1/1 Running 1 18h glusterfs-storage-mgp4l 1/1 Running 0 14h glusterfs-storage-pfpsn 1/1 Running 0 18h heketi-storage-1-kp5kq 1/1 Running 0 17h Version-Release number of selected component (if applicable): ++++++++++++++++++++++++ # oc rsh glusterfs-storage-4fv4c rpm -qa|grep gluster glusterfs-libs-3.12.2-18.1.el7rhgs.x86_64 glusterfs-3.12.2-18.1.el7rhgs.x86_64 glusterfs-api-3.12.2-18.1.el7rhgs.x86_64 python2-gluster-3.12.2-18.1.el7rhgs.x86_64 glusterfs-fuse-3.12.2-18.1.el7rhgs.x86_64 glusterfs-server-3.12.2-18.1.el7rhgs.x86_64 gluster-block-0.2.1-27.el7rhgs.x86_64 glusterfs-client-xlators-3.12.2-18.1.el7rhgs.x86_64 glusterfs-cli-3.12.2-18.1.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-18.1.el7rhgs.x86_64 # oc rsh heketi-storage-1-kp5kq rpm -qa|grep heketi heketi-client-7.0.0-14.el7rhgs.x86_64 heketi-7.0.0-14.el7rhgs.x86_64 # oc version oc v3.11.16 How reproducible: ++++++++++++++ This use case was tested only once when we hit the issue. Steps to Reproduce: +++++++++++++++++++++ Note: It was a 4 node setup , but one node was permamnenetly in disabled state, so all bricks were created on the same set of 3 nodes 1. System already has 10 file volumes created, 1 BHV and 1 heketidb volume 2. Started a for loop to create 2 more file volumes # for i in {11..12}; do ./pvc-create_file.sh file$i 4; sleep 5; done persistentvolumeclaim/file11 created persistentvolumeclaim/file12 created 3. Immediately re-spinned the glusterfs pod - glusterfs-storage-mgp4l #oc delete pod glusterfs-storage-mgp4l 4. As expected, the pvcs went into pending state. Heketi had already created the brick, hence it tried deleting the same since vol create failed. # oc get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE file11 Pending glusterfs-storage 33s file12 Pending glusterfs-storage 28s 4. Once the pod came up, checked the pvc status. It was BOUND successfully after some time. 5. Checked heketi-cli topology info and df -kh on each of the glusterfs pod to check for any dangling brick . One extra brick was found on the pod which was re-spinned. # oc rsh glusterfs-storage-mgp4l df -kh|grep heketi|wc -l 15 ]# oc rsh glusterfs-storage-mgp4l cat /var/lib/heketi/fstab|wc -l 14 Actual results: ++++++++++++ Once the pod comes back up, the nummber of bricks in df -kh was one more than the ctual brick count on the pod . There was a dangling brick which couldn't get deleted from the vg. Expected results: +++++++++++++++++ There should not be any dangling brick staying up in the df -kh of any gluster pod. Also, the brick should have got deleted as the volume was not created successfully and a cleanup was being performed by heketi(since one of the pod was re-spinned in that time frame)
Eventhough not critical, it is not expected to see this warning either: /run/lvm/lvmetad.socket: connect failed: Connection refused WARNING: Failed to connect to lvmetad. Falling back to device scanning. The ocs/rhgs-server-rhel7:3.11.0-2 image should have configured /run/lvm bind-mounted so that the lvm-commands inside the pod can use lvmetad from the host. Please check the version of the glusterfs-storage-mgp4l pod and verify that there is no problem with lvmetad on the host where it is running.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2019:0286