Description of problem: In a 4 node setup: on creating and deleting 1GB pvs in a loop clean up of heketi volumes failed. Heketi volume and gluster volume list shows 6 volumes but heketi topology info shows entire 1TB space occupied. Version-Release number of selected component (if applicable): # oc rsh heketi-storage-1-55bw4 sh-4.2# rpm -qa| grep heketi python-heketi-6.0.0-7.4.el7rhgs.x86_64 heketi-client-6.0.0-7.4.el7rhgs.x86_64 heketi-6.0.0-7.4.el7rhgs.x86_64 # rpm -qa| grep openshift openshift-ansible-roles-3.9.31-1.git.34.154617d.el7.noarch atomic-openshift-excluder-3.9.31-1.git.0.ef9737b.el7.noarch atomic-openshift-master-3.9.31-1.git.0.ef9737b.el7.x86_64 atomic-openshift-sdn-ovs-3.9.31-1.git.0.ef9737b.el7.x86_64 atomic-openshift-3.9.31-1.git.0.ef9737b.el7.x86_64 openshift-ansible-docs-3.9.31-1.git.34.154617d.el7.noarch openshift-ansible-playbooks-3.9.31-1.git.34.154617d.el7.noarch atomic-openshift-docker-excluder-3.9.31-1.git.0.ef9737b.el7.noarch atomic-openshift-node-3.9.31-1.git.0.ef9737b.el7.x86_64 atomic-openshift-clients-3.9.31-1.git.0.ef9737b.el7.x86_64 openshift-ansible-3.9.31-1.git.34.154617d.el7.noarch sh-4.2# rpm -qa| grep gluster glusterfs-client-xlators-3.8.4-54.8.el7rhgs.x86_64 glusterfs-cli-3.8.4-54.8.el7rhgs.x86_64 glusterfs-fuse-3.8.4-54.8.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-54.8.el7rhgs.x86_64 gluster-block-0.2.1-14.1.el7rhgs.x86_64 glusterfs-libs-3.8.4-54.8.el7rhgs.x86_64 glusterfs-3.8.4-54.8.el7rhgs.x86_64 glusterfs-api-3.8.4-54.8.el7rhgs.x86_64 glusterfs-server-3.8.4-54.8.el7rhgs.x86_64 How reproducible: 1:1 Steps to Reproduce: 1. On a CNS 3.9 setup initiated pvs creation and deletion simultaneously. Was also running the gluster v heal in all the 4 gluster pods. while true do for i in {101..150} do ./pvc_create.sh c$i 1; sleep 10; done sleep 30 for i in {101..150} do oc delete pvc c$i; sleep 5; done done 2. After running it for sometime observed that the number of volumes in heketi is 6 , in gluster its 7 but heketi-cli topology info shows the entire 999GB used with 1GB bricks on all nodes. Some nodes have free space = 2GB. 3. Faced the shd crash issue reported ----- snip of heketi log -------------------------- Result: [kubeexec] ERROR 2018/07/13 21:26:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:240: Failed to run command [gluster --mode=script volume create cns-vol_glusterfs_c115_60892937-86e3-11e8-aca0-005056a5a62b replica 3 10.70.46.29:/var/lib/heketi/mounts/vg_258cfffb4ca720953c6224286ce775a3/brick_5d0f13c27863545da3dc705aa5c1225b/brick 10.70.46.124:/var/lib/heketi/mounts/vg_7ad51cc79e80230702aebe4a1f67da7e/brick_1dff5c85227341dbaa83a5afcd8e8b4d/brick 10.70.46.210:/var/lib/heketi/mounts/vg_8b5aa693c98fbca5d4f666886869cdad/brick_afa7fd041a4a42ba1ff7534220576680/brick] on glusterfs-storage-pq7l9: Err[command terminated with exit code 1]: Stdout []: Stderr [volume create: cns-vol_glusterfs_c115_60892937-86e3-11e8-aca0-005056a5a62b: failed: Brick: 10.70.46.29:/var/lib/heketi/mounts/vg_258cfffb4ca720953c6224286ce775a3/brick_5d0f13c27863545da3dc705aa5c1225b/brick not available. Brick may be containing or be contained by an existing brick. ] [kubeexec] DEBUG 2018/07/13 21:26:24 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:244: Host: dhcp46-124.lab.eng.blr.redhat.com Pod: glusterfs-storage-kfhp6 Command: mount -o rw,inode64,noatime,nouuid /dev/mapper/vg_7ad51cc79e80230702aebe4a1f67da7e-brick_4e3173d5f96d48661c2b2bbaee089717 /var/lib/heketi/mounts/vg_7ad51cc79e80230702aebe4a1f67da7e/brick_4e3173d5f96d48661c2b2bbaee089717 -----------------------snip of heketi log -------------------------- ---- snip of heketi topology ---------------- Node Id: f57fd045067b4f11685b36c7c812fa2f State: online Cluster Id: 0b33bcaf4015ceace8d12202aac4883a Zone: 1 Management Hostnames: dhcp47-70.lab.eng.blr.redhat.com Storage Hostnames: 10.70.47.70 Devices: Id:a45588231a0f181cf9c5066c1b2d906e Name:/dev/sdd State:online Size (GiB):999 Used (GiB):999 Free (GiB):0 Bricks: Id:002f0e6048291f5625ca4ec9966b0a96 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_002f0e6048291f5625ca4ec9966b0a96/brick Id:011681334874a3d431e7cf005cb0226d Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_011681334874a3d431e7cf005cb0226d/brick Id:011ecfa848996351e072df1c4201bbf7 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_011ecfa848996351e072df1c4201bbf7/brick Id:013635c4eb0542dda91fa2f6af7115e7 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_013635c4eb0542dda91fa2f6af7115e7/brick Id:017bfb8e83fc10c49436298c73221d42 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_017bfb8e83fc10c49436298c73221d42/brick Id:01a1b253b024592a2e719e5706c2efe4 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_01a1b253b024592a2e719e5706c2efe4/brick Id:01e46ab2d35cc0fec3b5ee36d4b1e6dd Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_01e46ab2d35cc0fec3b5ee36d4b1e6dd/brick Id:02918cf4ce08a503964aa008083ed9cc Size (GiB):1 Path: /var/lib/heketi/mounts/vg_a45588231a0f181cf9c5066c1b2d906e/brick_02918cf4ce08a503964aa008083ed9cc/brick --------------------snip of topology ---------------- Actual results: The volume list and topology output mismatch Expected results: The volume list and topology output should match Additional info: Will attach logs
Version-Release number of selected component (if applicable): # oc rsh heketi-storage-1-55bw4 sh-4.2# rpm -qa| grep heketi python-heketi-6.0.0-7.4.el7rhgs.x86_64 heketi-client-6.0.0-7.4.el7rhgs.x86_64 heketi-6.0.0-7.4.el7rhgs.x86_64 Why is CNS 3.9 build being used?
( In reply to Raghavendra Talur comment #4 ) This setup was created as part of Experian Hotfix testing. So the CNS 3.9 builds were used before upgrading the setup to the hotfix build.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:2686