Description of problem: When heketi device remove command is run on a device which is not reachable(hardware failure, device physically removed from server etc), the command fails. [root@dhcp46-202 ~]# heketi-cli device remove 301cb7b4373bb3a5efbd32c087015054 Error: Failed to remove device, error: Unable to replace brick 10.70.47.180:/var/lib/heketi/mounts/vg_301cb7b4373bb3a5efbd32c087015054/brick_b5fb46645e458fe92299702bd46fbd91/brick with 10.70.47.78:/var/lib/heketi/mounts/vg_2d34e9b4df49e05d81aa76c4cc9a5904/brick_dab32f0fa4d0ced7f7e3e3b75d6a9955/brick for volume vol_8031ba884c76a70a186974a6a461a65f Heketi remove device should work for devices which are not reachable. Version-Release number of selected component (if applicable): heketi-client-4.0.0-3.el7rhgs.x86_64 How reproducible: always Steps to Reproduce: 1. Have node {1..3}, device{1..2} in a CNS setup 2. Have volume created from node 1 device 1 3. Remove the hardware disk for node 1 device 1 (I had mimicked this by removing the virtual disk from the VM used as node) 4. Run heketi device remove on node 1 device 1. Ideally, node 1 device 1 should be replaced by node 1 device 2. Instead device remove failed. Additional info: test setup information: # oc get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE glusterfs-4k47s 1/1 Running 5 2d 10.70.47.180 dhcp47-180.lab.eng.blr.redhat.com glusterfs-60dvm 1/1 Running 0 2d 10.70.47.65 dhcp47-65.lab.eng.blr.redhat.com glusterfs-hcp7j 1/1 Running 0 2d 10.70.46.165 dhcp46-165.lab.eng.blr.redhat.com glusterfs-jg4kw 1/1 Running 0 2d 10.70.47.21 dhcp47-21.lab.eng.blr.redhat.com glusterfs-nxnk1 1/1 Running 0 2d 10.70.47.78 dhcp47-78.lab.eng.blr.redhat.com glusterfs-vx1s0 1/1 Running 0 2d 10.70.47.51 dhcp47-51.lab.eng.blr.redhat.com heketi-1-93lgh 1/1 Running 1 1d 10.130.0.11 dhcp47-78.lab.eng.blr.redhat.com heketi-cli node list Id:0caf00da1c9dd2cfa275589eee5a3e2c Cluster:ee0be395eee24de0af625fb70b598342 Id:1bf58eba8401828a90223c45f753b607 Cluster:ee0be395eee24de0af625fb70b598342 Id:21438725a596e7a26203244a73c93e41 Cluster:ee0be395eee24de0af625fb70b598342 Id:76c04cd33916422802b3d14e6088ef75 Cluster:ee0be395eee24de0af625fb70b598342 Id:b5bb6e7ca6a8b74e2ccf776b79d121a8 Cluster:ee0be395eee24de0af625fb70b598342 heketi-cli node info 0caf00da1c9dd2cfa275589eee5a3e2c Node Id: 0caf00da1c9dd2cfa275589eee5a3e2c State: online Cluster Id: ee0be395eee24de0af625fb70b598342 Zone: 1 Management Hostname: dhcp47-180.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.180 Devices: Id:301cb7b4373bb3a5efbd32c087015054 Name:/dev/sdd State:offline Size (GiB):199 Used (GiB):9 Free (GiB):190 heketi-cli node info 21438725a596e7a26203244a73c93e41 Node Id: 21438725a596e7a26203244a73c93e41 State: online Cluster Id: ee0be395eee24de0af625fb70b598342 Zone: 3 Management Hostname: dhcp47-78.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.78 Devices: Id:2d34e9b4df49e05d81aa76c4cc9a5904 Name:/dev/sdd State:online Size (GiB):299 Used (GiB):100 Free (GiB):199 [root@dhcp46-202 ~]# heketi-cli device remove 301cb7b4373bb3a5efbd32c087015054 Error: Failed to remove device, error: Unable to replace brick 10.70.47.180:/var/lib/heketi/mounts/vg_301cb7b4373bb3a5efbd32c087015054/brick_b5fb46645e458fe92299702bd46fbd91/brick with 10.70.47.78:/var/lib/heketi/mounts/vg_2d34e9b4df49e05d81aa76c4cc9a5904/brick_dab32f0fa4d0ced7f7e3e3b75d6a9955/brick for volume vol_8031ba884c76a70a186974a6a461a65f oc rsh glusterfs-4k47s sh-4.2# sh-4.2# sh-4.2# gluster vol status vol_8031ba884c76a70a186974a6a461a65f Status of volume: vol_8031ba884c76a70a186974a6a461a65f Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.47.51:/var/lib/heketi/mounts/vg _bb05ed5f5dde8bb6446468a7dad56552/brick_2e4 1e432e0ccb6a30d25c57ee46132a2/brick 49164 0 Y 36351 Brick 10.70.46.165:/var/lib/heketi/mounts/v g_8b3ab766c0146956f00d5d928c60fa50/brick_e9 9c95bace2b213f031922e889db9524/brick 49165 0 Y 36943 Brick 10.70.47.180:/var/lib/heketi/mounts/v g_301cb7b4373bb3a5efbd32c087015054/brick_b5 fb46645e458fe92299702bd46fbd91/brick N/A N/A N N/A Self-heal Daemon on localhost N/A N/A Y 405 Self-heal Daemon on 10.70.47.78 N/A N/A Y 115904 Self-heal Daemon on 10.70.47.21 N/A N/A Y 57937 Self-heal Daemon on 10.70.47.51 N/A N/A Y 41510 Self-heal Daemon on dhcp46-165.lab.eng.blr. redhat.com N/A N/A Y 42098 Task Status of Volume vol_8031ba884c76a70a186974a6a461a65f ------------------------------------------------------------------------------ There are no active volume tasks - Brick process on 10.70.47.180 is down as the underlying disk is removed from the server - device remove was run on device 301cb7b4373bb3a5efbd32c087015054 - The only volume which is carved out of the above device is vol_8031ba884c76a70a186974a6a461a65f - sosreports from both these nodes and hekei logs shall be attached
I tried this on my setup. When the disk is removed from the VM, we get these logs from glusterfsd(brick process) ``` Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:29 UTC): var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down Message from syslogd@localhost at Mar 27 12:14:29 ... var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down ``` When the kill signal is sent to the same brick process as part of replace-brick, we get ``` Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:59 UTC): var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM Message from syslogd@localhost at Mar 27 12:14:59 ... var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM Shared connection to 192.168.21.14 closed. ``` It is found that glusterd has crashed on the system. I have filed a bug on glusterd and made this depend on that bug. A better way for testing this would be to use systemtap and fail all writes to the disk instead of removing it from the system.
Patch upstream: https://github.com/heketi/heketi/pull/735
Heketi remove device now hangs when run on a device which is inaccessible. '/dev/sdd' was made inaccessible by running 'echo offline > /sys/block/sdd/device/state' on node 10.70.47.176 # heketi-cli node info a4a3353715414fa78778865fd873f554 Node Id: a4a3353715414fa78778865fd873f554 State: online Cluster Id: f19f0be52fa5147aad0071491b0f8da7 Zone: 1 Management Hostname: dhcp47-176.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.176 Devices: Id:29f67ead9c4daf3dea14e8cf2010ab9a Name:/dev/sde State:offline Size (GiB):99 Used (GiB):0 Free (GiB):99 Id:724d4c878d4f406cfeb4bca3bcc15bb0 Name:/dev/sdd State:online Size (GiB):99 Used (GiB):10 Free (GiB):89 [root@dhcp47-175 ~]# heketi-cli device enable 29f67ead9c4daf3dea14e8cf2010ab9a Device 29f67ead9c4daf3dea14e8cf2010ab9a is now online [root@dhcp47-175 ~]# heketi-cli device disable 724d4c878d4f406cfeb4bca3bcc15bb0 Device 724d4c878d4f406cfeb4bca3bcc15bb0 is now offline [root@dhcp47-175 ~]# heketi-cli device remove 724d4c878d4f406cfeb4bca3bcc15bb0 # heketi-cli node info a4a3353715414fa78778865fd873f554 Node Id: a4a3353715414fa78778865fd873f554 State: online Cluster Id: f19f0be52fa5147aad0071491b0f8da7 Zone: 1 Management Hostname: dhcp47-176.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.176 Devices: Id:29f67ead9c4daf3dea14e8cf2010ab9a Name:/dev/sde State:online Size (GiB):99 Used (GiB):6 Free (GiB):93 Id:724d4c878d4f406cfeb4bca3bcc15bb0 Name:/dev/sdd State:offline Size (GiB):99 Used (GiB):8 Free (GiB):91 rpm -qa | grep 'heketi' heketi-client-4.0.0-6.el7rhgs.x86_64 There seems to be some inconsistency with the used space as well. gluster logs & heketi logs will be attached shortly.
remove device on a device whose node is down works though.
As seen in heketi logs, we have [kubeexec] ERROR 2017/04/09 10:22:21 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [lvremove -f vg_724d4c878d4f406cfeb4bca3bcc15bb0/tp_3c78af6a861180417f8763a1fbbaf8e6] on glusterfs-mm42d: Err[command terminated with exit code 5]: Stdout []: Stderr [ /dev/sdd: open failed: No such device or address I am not sure this is the same behavior of lv commands when the device goes corrupt. Here the kernel is rejecting the commands on a disabled device. I will try the dd method and update the bug tomorrow.
As finding the correct reproducer of this bug is difficult and also because we are in edge of this release, I am deferring this bug from this release and we will continue analysis in next release cycle.