Description of problem: On a CNS setup when volume provisioning + device remove operations are run in parallel, although all the operations succeed, device info doesn't list all the bricks. i.e., only 6 out of 8 bricks are listed in device info. Device remove operation on such a device which has inconsistent device info could lead into situation where bricks are not cleaned up completely. [root@dhcp46-202 ~]# heketi-cli device info 11b2fd1b238afb0b62dbd4a7d3d42263 Device Id: 11b2fd1b238afb0b62dbd4a7d3d42263 Name: /dev/sde State: online Size (GiB): 299 Used (GiB): 10 Free (GiB): 289 Bricks: Id:0092b14fbdf176ead4339881527a07a8 Size (GiB):2 Path: /var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_0092b14fbdf176ead4339881527a07a8/brick Id:b66a520fddb08b8632c22aabe9319e24 Size (GiB):2 Path: /var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_b66a520fddb08b8632c22aabe9319e24/brick Id:bb1c1eb34ed64c8894adb2392711e9ec Size (GiB):2 Path: /var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_bb1c1eb34ed64c8894adb2392711e9ec/brick Id:cf16d487542871c69b0abc4a6adf6bd3 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_cf16d487542871c69b0abc4a6adf6bd3/brick Id:e7d45280d809f5fa41cb47e042d3eb19 Size (GiB):1 Path: /var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_e7d45280d809f5fa41cb47e042d3eb19/brick Id:fb7950d913ab25f360709183741fabc0 Size (GiB):2 Path: /var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_fb7950d913ab25f360709183741fabc0/brick [please refer topology info for detailed output] [root@dhcp47-78 ~]# pvs PV VG Fmt Attr PSize PFree /dev/sda2 rhel_dhcp47-183 lvm2 a-- 100.00g 0 /dev/sdd vg_0ab42999bff09fb8519983c08747a5ae lvm2 a-- 299.87g 299.87g /dev/sde vg_11b2fd1b238afb0b62dbd4a7d3d42263 lvm2 a-- 299.87g 287.78g /dev/sdf vg_a0d0eb24659c776d974d7f9ade26c425 lvm2 a-- 99.87g 99.87g [root@dhcp47-78 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert home rhel_dhcp47-183 -wi-ao---- 50.00g root rhel_dhcp47-183 -wi-ao---- 50.00g brick_0092b14fbdf176ead4339881527a07a8 vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 2.00g tp_0092b14fbdf176ead4339881527a07a8 0.74 brick_59712ce42fe1412fb2259be73e164e56 vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 1.00g tp_59712ce42fe1412fb2259be73e164e56 1.12 brick_b66a520fddb08b8632c22aabe9319e24 vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 2.00g tp_b66a520fddb08b8632c22aabe9319e24 10.47 brick_bb1c1eb34ed64c8894adb2392711e9ec vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 2.00g tp_bb1c1eb34ed64c8894adb2392711e9ec 10.47 brick_bd202a73e2e54924bd2dfc94927569ad vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 1.00g tp_bd202a73e2e54924bd2dfc94927569ad 1.12 brick_cf16d487542871c69b0abc4a6adf6bd3 vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 1.00g tp_cf16d487542871c69b0abc4a6adf6bd3 20.65 brick_e7d45280d809f5fa41cb47e042d3eb19 vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 1.00g tp_e7d45280d809f5fa41cb47e042d3eb19 1.15 brick_fb7950d913ab25f360709183741fabc0 vg_11b2fd1b238afb0b62dbd4a7d3d42263 Vwi-aotz-- 2.00g tp_fb7950d913ab25f360709183741fabc0 10.56 tp_0092b14fbdf176ead4339881527a07a8 vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 2.00g 0.74 0.33 tp_59712ce42fe1412fb2259be73e164e56 vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 1.00g 1.12 0.49 tp_b66a520fddb08b8632c22aabe9319e24 vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 2.00g 10.47 0.52 tp_bb1c1eb34ed64c8894adb2392711e9ec vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 2.00g 10.47 0.52 tp_bd202a73e2e54924bd2dfc94927569ad vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 1.00g 1.12 0.49 tp_cf16d487542871c69b0abc4a6adf6bd3 vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 1.00g 20.65 0.78 tp_e7d45280d809f5fa41cb47e042d3eb19 vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 1.00g 1.15 0.49 tp_fb7950d913ab25f360709183741fabc0 vg_11b2fd1b238afb0b62dbd4a7d3d42263 twi-aotz-- 2.00g 10.56 0.52 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_fb7950d913ab25f360709183741fabc0/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_cf16d487542871c69b0abc4a6adf6bd3/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_b66a520fddb08b8632c22aabe9319e24/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_59712ce42fe1412fb2259be73e164e56/brick --> Missing 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_0092b14fbdf176ead4339881527a07a8/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_e7d45280d809f5fa41cb47e042d3eb19/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_bb1c1eb34ed64c8894adb2392711e9ec/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_bd202a73e2e54924bd2dfc94927569ad/brick --> Missing Following two brick information is missing under device info: 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_59712ce42fe1412fb2259be73e164e56/brick 10.70.47.78:/var/lib/heketi/mounts/vg_11b2fd1b238afb0b62dbd4a7d3d42263/brick_bd202a73e2e54924bd2dfc94927569ad/brick However, all volumes are created, [pls refer gluster vol info output attached] heketi-cli node info 65deb4410d8147343febf3bb5643e176 Node Id: 65deb4410d8147343febf3bb5643e176 State: online Cluster Id: 27ac988a3e7bf097c3c289f402c8cf24 Zone: 2 Management Hostname: dhcp47-78.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.78 Devices: Id:11b2fd1b238afb0b62dbd4a7d3d42263 Name:/dev/sde State:online Size (GiB):299 Used (GiB):10 Free (GiB):289 Id:a0d0eb24659c776d974d7f9ade26c425 Name:/dev/sdf State:failed Size (GiB):99 Used (GiB):0 Free (GiB):99 corresponding volumes and pvc details: vol_f767bea3158ccc5f4f80a1d27cc93b22 --> glusterfs-dynamic-mongodb-19 vol_a1f30eda3cb826ab34c3d8cad35cfa23 --> glusterfs-dynamic-mongodb-6 Version-Release number of selected component (if applicable): heketi-client-4.0.0-4.el7rhgs.x86_64 How reproducible: 1/1 Steps to Reproduce: 1. Have a CNS setup with node-{1,2,3}, device-{1,2} 2. Keep creating volumes until the test is over 3. Run device disable and remove on node-1, device-{1,2} one by one. device remove will proceed for device-1 while device-2 becomes offline, remove device on device-2 will fail 4. Once remove device on node-1, device 1 completes, stop volume creation 5. check if the brick count on node-1 device-2 corresponds to the number of volumes created in the background Actual results: brick information for 2 volumes are missing Expected results: brick information of all the volumes should be present Additional info: All logs and cli outputs shall be attached shortly. oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterfs-1rksf 1/1 Running 1 1d 10.70.47.180 dhcp47-180.lab.eng.blr.redhat.com glusterfs-3t02m 1/1 Running 4 1d 10.70.47.51 dhcp47-51.lab.eng.blr.redhat.com glusterfs-ks6zl 1/1 Running 1 1d 10.70.47.65 dhcp47-65.lab.eng.blr.redhat.com glusterfs-nh11g 1/1 Running 1 1d 10.70.47.21 dhcp47-21.lab.eng.blr.redhat.com glusterfs-qzrdm 1/1 Running 1 1d 10.70.47.78 dhcp47-78.lab.eng.blr.redhat.com glusterfs-z89sm 1/1 Running 1 1d 10.70.46.165 dhcp46-165.lab.eng.blr.redhat.com heketi-1-j6b9n 1/1 Running 1 1d 10.130.2.10 dhcp46-165.lab.eng.blr.redhat.com mongodb-1-1-tmrc5 1/1 Running 1 1d 10.129.2.5 dhcp47-65.lab.eng.blr.redhat.com mongodb-19-1-tgpd1 1/1 Running 0 56m 10.130.2.12 dhcp46-165.lab.eng.blr.redhat.com mongodb-2-1-cgcqv 1/1 Running 1 1d 10.128.2.12 dhcp47-21.lab.eng.blr.redhat.com mongodb-20-1-0l4b8 1/1 Running 0 57m 10.129.2.7 dhcp47-65.lab.eng.blr.redhat.com mongodb-3-1-0518g 1/1 Running 2 1h 10.129.0.8 dhcp47-51.lab.eng.blr.redhat.com mongodb-4-1-wlprf 1/1 Running 0 1h 10.128.2.13 dhcp47-21.lab.eng.blr.redhat.com mongodb-5-1-cxf1h 1/1 Running 0 57m 10.130.0.6 dhcp47-78.lab.eng.blr.redhat.com mongodb-6-1-p7nw1 1/1 Running 0 56m 10.131.0.7 dhcp47-180.lab.eng.blr.redhat.com storage-project-router-1-l68vj 1/1 Running 1 2d 10.70.47.78 dhcp47-78.lab.eng.blr.redhat.com
Created attachment 1267412 [details] topology info, heketi logs, gluster vol info outputs
We have an RCA and a patch is underway. This is very critical to fix, even though it is not a regression. Hence giving devel-ack.
Patch posted and merged upstream https://github.com/heketi/heketi/pull/736
The issue reported is not seen anymore with heketi-4.0.0-6.el7rhgs. Ran the following test 1) volume create + device remove - 3 iterations 2) volume delete + device remove Moving the bug to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1111