Bug 1435613
Summary: | heketi remove device fails when the source disk being removed is down | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | krishnaram Karthick <kramdoss> |
Component: | heketi | Assignee: | Raghavendra Talur <rtalur> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | krishnaram Karthick <kramdoss> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | cns-3.5 | CC: | hchiramm, jmulligan, madam, rcyriac, rhs-bugs, rtalur, storage-qa-internal |
Target Milestone: | --- | ||
Target Release: | CNS 3.5 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2019-03-12 19:59:23 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1436197 | ||
Bug Blocks: | 1415762, 1641915 |
Description
krishnaram Karthick
2017-03-24 11:35:03 UTC
I tried this on my setup. When the disk is removed from the VM, we get these logs from glusterfsd(brick process) ``` Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:29 UTC): var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down Message from syslogd@localhost at Mar 27 12:14:29 ... var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down ``` When the kill signal is sent to the same brick process as part of replace-brick, we get ``` Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:59 UTC): var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM Message from syslogd@localhost at Mar 27 12:14:59 ... var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM Shared connection to 192.168.21.14 closed. ``` It is found that glusterd has crashed on the system. I have filed a bug on glusterd and made this depend on that bug. A better way for testing this would be to use systemtap and fail all writes to the disk instead of removing it from the system. Patch upstream: https://github.com/heketi/heketi/pull/735 Heketi remove device now hangs when run on a device which is inaccessible. '/dev/sdd' was made inaccessible by running 'echo offline > /sys/block/sdd/device/state' on node 10.70.47.176 # heketi-cli node info a4a3353715414fa78778865fd873f554 Node Id: a4a3353715414fa78778865fd873f554 State: online Cluster Id: f19f0be52fa5147aad0071491b0f8da7 Zone: 1 Management Hostname: dhcp47-176.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.176 Devices: Id:29f67ead9c4daf3dea14e8cf2010ab9a Name:/dev/sde State:offline Size (GiB):99 Used (GiB):0 Free (GiB):99 Id:724d4c878d4f406cfeb4bca3bcc15bb0 Name:/dev/sdd State:online Size (GiB):99 Used (GiB):10 Free (GiB):89 [root@dhcp47-175 ~]# heketi-cli device enable 29f67ead9c4daf3dea14e8cf2010ab9a Device 29f67ead9c4daf3dea14e8cf2010ab9a is now online [root@dhcp47-175 ~]# heketi-cli device disable 724d4c878d4f406cfeb4bca3bcc15bb0 Device 724d4c878d4f406cfeb4bca3bcc15bb0 is now offline [root@dhcp47-175 ~]# heketi-cli device remove 724d4c878d4f406cfeb4bca3bcc15bb0 # heketi-cli node info a4a3353715414fa78778865fd873f554 Node Id: a4a3353715414fa78778865fd873f554 State: online Cluster Id: f19f0be52fa5147aad0071491b0f8da7 Zone: 1 Management Hostname: dhcp47-176.lab.eng.blr.redhat.com Storage Hostname: 10.70.47.176 Devices: Id:29f67ead9c4daf3dea14e8cf2010ab9a Name:/dev/sde State:online Size (GiB):99 Used (GiB):6 Free (GiB):93 Id:724d4c878d4f406cfeb4bca3bcc15bb0 Name:/dev/sdd State:offline Size (GiB):99 Used (GiB):8 Free (GiB):91 rpm -qa | grep 'heketi' heketi-client-4.0.0-6.el7rhgs.x86_64 There seems to be some inconsistency with the used space as well. gluster logs & heketi logs will be attached shortly. remove device on a device whose node is down works though. As seen in heketi logs, we have [kubeexec] ERROR 2017/04/09 10:22:21 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [lvremove -f vg_724d4c878d4f406cfeb4bca3bcc15bb0/tp_3c78af6a861180417f8763a1fbbaf8e6] on glusterfs-mm42d: Err[command terminated with exit code 5]: Stdout []: Stderr [ /dev/sdd: open failed: No such device or address I am not sure this is the same behavior of lv commands when the device goes corrupt. Here the kernel is rejecting the commands on a disabled device. I will try the dd method and update the bug tomorrow. As finding the correct reproducer of this bug is difficult and also because we are in edge of this release, I am deferring this bug from this release and we will continue analysis in next release cycle. |