Description of problem: If we disable device, then remove it, heketi shows that bricks were moved to other existing device, BUT, removed device continues to have "used" space equal to the size of bricks which were there. Then, we are able to delete such device. And at this step, we are not able to add it back, even after run of "wipefs" command on the appropriate glusterfs POD. Version-Release number of selected component (if applicable): Heketi server: heketi-7.0.0-11.el7rhgs.x86_64 Heketi client: heketi-client-7.0.0-8.el7rhgs.x86_64 Storage release version: Red Hat Gluster Volume Manager 3.4.0 ( Container) Image: brew-pulp-docker01.web.prod.ext.phx2.redhat.com:8888/rhgs3/rhgs-volmanager-rhel7:v3.10 How reproducible: In scope of one node it failed for both devices from 2. But it failed only for one node, from 2. Steps to Reproduce: 1. Create heketi topology using couple of devices 2. Create couple of volumes 3. Add one more device to the heketi 4. Disable one of devices which contains some bricks 5. Remove disabled device 6. Delete removed device 7. Add deleted device back Actual results: Following error as answer for "re-add device" attempt: """ Error: Can't open /dev/sdd exclusively. Mounted filesystem? """ Prior to it, "Used space" was not changed after "remove device" operation. Expected results: All bricks are evacuated after "remove device" operation and "Used space" is set as 0. Ok result trying to add device back.
Created attachment 1484065 [details] Heketi DB dump Adding Heketi DB dump
Created attachment 1484066 [details] Heketi server logs Adding Heketi server logs
Also, after seeing error, I was told to try to run "wipefs" command and retry again. It didn't help, results: [root@vp-ansible-v310-ga-master-0 ~]# heketi-cli device add --name /dev/sdd --node 8966bb38131b92389d55e09862ca9cce Error: Can't open /dev/sdd exclusively. Mounted filesystem? [root@vp-ansible-v310-ga-master-0 ~]# ssh root@vp-ansible-v310-ga-app-cns-1 Last login: Mon Sep 17 08:06:55 2018 from vp-ansible-v310-ga-master-0 [root@vp-ansible-v310-ga-app-cns-1 ~]# wipefs /dev/sdd offset type ---------------------------------------------------------------- 0x218 LVM2_member [raid] UUID: gIk6eH-yHuU-py77-8PCo-lbuk-98fI-K9OveQ ==================================================== [root@vp-ansible-v310-ga-app-cns-1 ~]# wipefs -a /dev/sdd wipefs: error: /dev/sdd: probing initialization failed: Device or resource busy ==================================================== [root@vp-ansible-v310-ga-app-cns-1 ~]# pvscan PV /dev/sdd VG vg_68c9ef3f5a2e31d2976565f9f187a6cf lvm2 [99.87 GiB / <97.85 GiB free] PV /dev/sda2 VG rhel_dhcp46-210 lvm2 [<39.00 GiB / 0 free] PV /dev/sdb1 VG docker-vol lvm2 [<40.00 GiB / 0 free] PV /dev/sdf VG vg_1517317e9a1fc96c830c9493c49833f9 lvm2 [99.87 GiB / <97.85 GiB free] PV /dev/sde VG vg_ca1fd84ddfbd19783537ea7d61e39f9b lvm2 [199.87 GiB / <196.84 GiB free] Total: 5 [<478.61 GiB] / in use: 5 [<478.61 GiB] / in no VG: 0 [0 ] ==================================================== [root@vp-ansible-v310-ga-app-cns-1 ~]# wipefs --force --all /dev/sdf /dev/sdf: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31 [root@vp-ansible-v310-ga-app-cns-1 ~]# wipefs --force --all /dev/sdd /dev/sdd: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31 ==================================================== [root@vp-ansible-v310-ga-app-cns-1 ~]# pvscan PV /dev/sda2 VG rhel_dhcp46-210 lvm2 [<39.00 GiB / 0 free] PV /dev/sdb1 VG docker-vol lvm2 [<40.00 GiB / 0 free] PV /dev/sde VG vg_ca1fd84ddfbd19783537ea7d61e39f9b lvm2 [199.87 GiB / <196.84 GiB free] Total: 3 [278.86 GiB] / in use: 3 [278.86 GiB] / in no VG: 0 [0 ] =================================================== [root@vp-ansible-v310-ga-master-0 ~]# heketi-cli device add --name /dev/sdd --node 8966bb38131b92389d55e09862ca9cce Error: Can't open /dev/sdd exclusively. Mounted filesystem? [root@vp-ansible-v310-ga-master-0 ~]# heketi-cli device add --name /dev/sdf --node 8966bb38131b92389d55e09862ca9cce Error: Can't open /dev/sdf exclusively. Mounted filesystem?
Do you have the cluster in this state such that I could log on to the cluster and reproduce the error condition myself? If not, does the problem go away after you either (a) restart the gluster pod or (b) reboot the node?
John, I didn't reboot neither pod nor node. I have cluster used for it. Will send you cluster creds in email.
It's not so much that Heketi is failing to try to clean up the node but rather that the actual resources are still in use by the kernel (device mapper). You an see that certain vgs are missing from the lvm output but are present in the device mapper list. == vgs inside the container == sh-4.2# vgs VG #PV #LV #SN Attr VSize VFree docker-vol 1 1 0 wz--n- <40.00g 0 rhel_dhcp46-210 1 2 0 wz--n- <39.00g 0 vg_ca1fd84ddfbd19783537ea7d61e39f9b 1 4 0 wz--n- 199.87g <196.84g == lvs inside the container == sh-4.2# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert dockerlv docker-vol -wi-ao---- <40.00g root rhel_dhcp46-210 -wi-ao---- <35.00g swap rhel_dhcp46-210 -wi-a----- 4.00g brick_83a6d5e8b1072640197ae231f52a48af vg_ca1fd84ddfbd19783537ea7d61e39f9b Vwi-aotz-- 1.00g tp_5c46b2de672ea8f555d523b679a4c2d6 1.39 brick_aea6efb6d0301b9b35b66e3b7821606e vg_ca1fd84ddfbd19783537ea7d61e39f9b Vwi-aotz-- 2.00g tp_534b36a164af13c58dd8d4f14368e7f0 0.70 tp_534b36a164af13c58dd8d4f14368e7f0 vg_ca1fd84ddfbd19783537ea7d61e39f9b twi-aotz-- 2.00g 0.70 0.33 tp_5c46b2de672ea8f555d523b679a4c2d6 vg_ca1fd84ddfbd19783537ea7d61e39f9b twi-aotz-- 1.00g 1.39 0.49 == lvs on the node == [root@vp-ansible-v310-ga-app-cns-1 ~]# lvs LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert dockerlv docker-vol -wi-ao---- <40.00g root rhel_dhcp46-210 -wi-ao---- <35.00g swap rhel_dhcp46-210 -wi-a----- 4.00g brick_83a6d5e8b1072640197ae231f52a48af vg_ca1fd84ddfbd19783537ea7d61e39f9b Vwi-aotz-- 1.00g tp_5c46b2de672ea8f555d523b679a4c2d6 1.39 brick_aea6efb6d0301b9b35b66e3b7821606e vg_ca1fd84ddfbd19783537ea7d61e39f9b Vwi-aotz-- 2.00g tp_534b36a164af13c58dd8d4f14368e7f0 0.70 tp_534b36a164af13c58dd8d4f14368e7f0 vg_ca1fd84ddfbd19783537ea7d61e39f9b twi-aotz-- 2.00g 0.70 0.33 tp_5c46b2de672ea8f555d523b679a4c2d6 vg_ca1fd84ddfbd19783537ea7d61e39f9b twi-aotz-- 1.00g 1.39 0.49 == dmsetup ls on the node == [root@vp-ansible-v310-ga-app-cns-1 ~]# dmsetup ls vg_1517317e9a1fc96c830c9493c49833f9-tp_ff12ee2104892099503b087db5b8aefd-tpool (253:10) vg_1517317e9a1fc96c830c9493c49833f9-tp_ff12ee2104892099503b087db5b8aefd_tdata (253:9) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_5c46b2de672ea8f555d523b679a4c2d6-tpool (253:15) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_5c46b2de672ea8f555d523b679a4c2d6_tdata (253:14) vg_ca1fd84ddfbd19783537ea7d61e39f9b-brick_83a6d5e8b1072640197ae231f52a48af (253:17) vg_1517317e9a1fc96c830c9493c49833f9-tp_ff12ee2104892099503b087db5b8aefd_tmeta (253:8) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_5c46b2de672ea8f555d523b679a4c2d6_tmeta (253:13) vg_68c9ef3f5a2e31d2976565f9f187a6cf-tp_e04cc4729e00e4ec21aa1b028651d599-tpool (253:5) vg_68c9ef3f5a2e31d2976565f9f187a6cf-tp_e04cc4729e00e4ec21aa1b028651d599_tdata (253:4) vg_68c9ef3f5a2e31d2976565f9f187a6cf-tp_e04cc4729e00e4ec21aa1b028651d599_tmeta (253:3) docker--vol-dockerlv (253:2) vg_68c9ef3f5a2e31d2976565f9f187a6cf-brick_e04cc4729e00e4ec21aa1b028651d599 (253:7) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_534b36a164af13c58dd8d4f14368e7f0-tpool (253:20) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_534b36a164af13c58dd8d4f14368e7f0_tdata (253:19) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_534b36a164af13c58dd8d4f14368e7f0_tmeta (253:18) vg_1517317e9a1fc96c830c9493c49833f9-tp_ff12ee2104892099503b087db5b8aefd (253:11) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_5c46b2de672ea8f555d523b679a4c2d6 (253:16) vg_ca1fd84ddfbd19783537ea7d61e39f9b-tp_534b36a164af13c58dd8d4f14368e7f0 (253:21) rhel_dhcp46--210-swap (253:1) rhel_dhcp46--210-root (253:0) vg_ca1fd84ddfbd19783537ea7d61e39f9b-brick_aea6efb6d0301b9b35b66e3b7821606e (253:22) vg_1517317e9a1fc96c830c9493c49833f9-brick_651d564bd98fb500018dadbe35ee60b0 (253:12) I think this is not directly related to Heketi, but rather, the way we're running lvm on the gluster pods. More to come...
The problem is that wipefs does not completely remove the LVM metadata from the block devices. When a device is scanned after wipefs, the VolumeGroup and LogicalVolumes may return again. I have seen this in my test environments on occasion too. This is what I do to completely remove everything: # lvremove -y $(cd /dev/mapper ; ls vg_* | sed s,-,/,) # pvremove --force --force -y /dev/vdb /dev/vdc # wipefs --force --all /dev/vdb /dev/vdc # sync # reboot It might be overdoing it a bit, but it works for me :)
I can confirm that this is a bug in heketi, it did not successfully delete the vg but removed the device from the db anyway. It should not have allowed this.
Updated the Doc Text field, Kindly verify.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2986