1435613 – heketi remove device fails when the source disk being removed is down

Bug 1435613 - heketi remove device fails when the source disk being removed is down

Summary: heketi remove device fails when the source disk being removed is down

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	heketi
Sub Component:
Version:	cns-3.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	CNS 3.5
Assignee:	Raghavendra Talur
QA Contact:	krishnaram Karthick
Docs Contact:
URL:
Whiteboard:
Depends On:	1436197
Blocks:	1415762 OCS-3.11.1-devel-triage-done
TreeView+	depends on / blocked

Reported:	2017-03-24 11:35 UTC by krishnaram Karthick
Modified:	2019-04-22 22:41 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-03-12 19:59:23 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1562485	0	unspecified	CLOSED	device remove operation is hung when the source device is removed from backend	2021-02-22 00:41:40 UTC

Internal Links: 1562485

Description krishnaram Karthick 2017-03-24 11:35:03 UTC

Description of problem:

When heketi device remove command is run on a device which is not reachable(hardware failure, device physically removed from server etc), the command fails.

[root@dhcp46-202 ~]# heketi-cli device remove 301cb7b4373bb3a5efbd32c087015054
Error: Failed to remove device, error: Unable to replace brick 10.70.47.180:/var/lib/heketi/mounts/vg_301cb7b4373bb3a5efbd32c087015054/brick_b5fb46645e458fe92299702bd46fbd91/brick with 10.70.47.78:/var/lib/heketi/mounts/vg_2d34e9b4df49e05d81aa76c4cc9a5904/brick_dab32f0fa4d0ced7f7e3e3b75d6a9955/brick for volume vol_8031ba884c76a70a186974a6a461a65f

Heketi remove device should work for devices which are not reachable.

Version-Release number of selected component (if applicable):
heketi-client-4.0.0-3.el7rhgs.x86_64

How reproducible:
always

Steps to Reproduce:
1. Have node {1..3}, device{1..2} in a CNS setup
2. Have volume created from node 1 device 1
3. Remove the hardware disk for node 1 device 1 (I had mimicked this by removing the virtual disk from the VM used as node)
4. Run heketi device remove on node 1 device 1. 

Ideally, node 1 device 1 should be replaced by node 1 device 2. Instead device remove failed.

Additional info:

test setup information:

# oc get pods -o wide
NAME                             READY     STATUS    RESTARTS   AGE       IP             NODE
glusterfs-4k47s                  1/1       Running   5          2d        10.70.47.180   dhcp47-180.lab.eng.blr.redhat.com
glusterfs-60dvm                  1/1       Running   0          2d        10.70.47.65    dhcp47-65.lab.eng.blr.redhat.com
glusterfs-hcp7j                  1/1       Running   0          2d        10.70.46.165   dhcp46-165.lab.eng.blr.redhat.com
glusterfs-jg4kw                  1/1       Running   0          2d        10.70.47.21    dhcp47-21.lab.eng.blr.redhat.com
glusterfs-nxnk1                  1/1       Running   0          2d        10.70.47.78    dhcp47-78.lab.eng.blr.redhat.com
glusterfs-vx1s0                  1/1       Running   0          2d        10.70.47.51    dhcp47-51.lab.eng.blr.redhat.com
heketi-1-93lgh                   1/1       Running   1          1d        10.130.0.11    dhcp47-78.lab.eng.blr.redhat.com

heketi-cli node list
Id:0caf00da1c9dd2cfa275589eee5a3e2c	Cluster:ee0be395eee24de0af625fb70b598342
Id:1bf58eba8401828a90223c45f753b607	Cluster:ee0be395eee24de0af625fb70b598342
Id:21438725a596e7a26203244a73c93e41	Cluster:ee0be395eee24de0af625fb70b598342
Id:76c04cd33916422802b3d14e6088ef75	Cluster:ee0be395eee24de0af625fb70b598342
Id:b5bb6e7ca6a8b74e2ccf776b79d121a8	Cluster:ee0be395eee24de0af625fb70b598342

heketi-cli node info 0caf00da1c9dd2cfa275589eee5a3e2c
Node Id: 0caf00da1c9dd2cfa275589eee5a3e2c
State: online
Cluster Id: ee0be395eee24de0af625fb70b598342
Zone: 1
Management Hostname: dhcp47-180.lab.eng.blr.redhat.com
Storage Hostname: 10.70.47.180
Devices:
Id:301cb7b4373bb3a5efbd32c087015054   Name:/dev/sdd            State:offline   Size (GiB):199     Used (GiB):9       Free (GiB):190     


heketi-cli node info 21438725a596e7a26203244a73c93e41
Node Id: 21438725a596e7a26203244a73c93e41
State: online
Cluster Id: ee0be395eee24de0af625fb70b598342
Zone: 3
Management Hostname: dhcp47-78.lab.eng.blr.redhat.com
Storage Hostname: 10.70.47.78
Devices:
Id:2d34e9b4df49e05d81aa76c4cc9a5904   Name:/dev/sdd            State:online    Size (GiB):299     Used (GiB):100     Free (GiB):199     

[root@dhcp46-202 ~]# heketi-cli device remove 301cb7b4373bb3a5efbd32c087015054
Error: Failed to remove device, error: Unable to replace brick 10.70.47.180:/var/lib/heketi/mounts/vg_301cb7b4373bb3a5efbd32c087015054/brick_b5fb46645e458fe92299702bd46fbd91/brick with 10.70.47.78:/var/lib/heketi/mounts/vg_2d34e9b4df49e05d81aa76c4cc9a5904/brick_dab32f0fa4d0ced7f7e3e3b75d6a9955/brick for volume vol_8031ba884c76a70a186974a6a461a65f

 oc rsh glusterfs-4k47s
sh-4.2# 
sh-4.2# 
sh-4.2# gluster vol status vol_8031ba884c76a70a186974a6a461a65f
Status of volume: vol_8031ba884c76a70a186974a6a461a65f
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.47.51:/var/lib/heketi/mounts/vg
_bb05ed5f5dde8bb6446468a7dad56552/brick_2e4
1e432e0ccb6a30d25c57ee46132a2/brick         49164     0          Y       36351
Brick 10.70.46.165:/var/lib/heketi/mounts/v
g_8b3ab766c0146956f00d5d928c60fa50/brick_e9
9c95bace2b213f031922e889db9524/brick        49165     0          Y       36943
Brick 10.70.47.180:/var/lib/heketi/mounts/v
g_301cb7b4373bb3a5efbd32c087015054/brick_b5
fb46645e458fe92299702bd46fbd91/brick        N/A       N/A        N       N/A  
Self-heal Daemon on localhost               N/A       N/A        Y       405  
Self-heal Daemon on 10.70.47.78             N/A       N/A        Y       115904
Self-heal Daemon on 10.70.47.21             N/A       N/A        Y       57937
Self-heal Daemon on 10.70.47.51             N/A       N/A        Y       41510
Self-heal Daemon on dhcp46-165.lab.eng.blr.
redhat.com                                  N/A       N/A        Y       42098
 
Task Status of Volume vol_8031ba884c76a70a186974a6a461a65f
------------------------------------------------------------------------------
There are no active volume tasks
 
 - Brick process on 10.70.47.180 is down as the underlying disk is removed from the server
 - device remove was run on device 301cb7b4373bb3a5efbd32c087015054
 - The only volume which is carved out of the above device is vol_8031ba884c76a70a186974a6a461a65f
 - sosreports from both these nodes and hekei logs shall be attached

Comment 4 Raghavendra Talur 2017-03-27 12:33:20 UTC

I tried this on my setup.

When the disk is removed from the VM, we get these logs from glusterfsd(brick process)

```
Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:29 UTC):

var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down


Message from syslogd@localhost at Mar 27 12:14:29 ...
 var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down
```

When the kill signal is sent to the same brick process as part of replace-brick, we get 

```
Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:59 UTC):

var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM


Message from syslogd@localhost at Mar 27 12:14:59 ...
 var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM
Shared connection to 192.168.21.14 closed.
```

It is found that glusterd has crashed on the system. I have filed a bug on glusterd and made this depend on that bug.

A better way for testing this would be to use systemtap and fail all writes to the disk instead of removing it from the system.

Comment 8 Mohamed Ashiq 2017-04-07 05:06:55 UTC

Patch upstream:

https://github.com/heketi/heketi/pull/735

Comment 12 krishnaram Karthick 2017-04-09 10:54:17 UTC

Heketi remove device now hangs when run on a device which is inaccessible. '/dev/sdd' was made inaccessible by running 'echo offline > /sys/block/sdd/device/state' on node 10.70.47.176

# heketi-cli node info a4a3353715414fa78778865fd873f554
Node Id: a4a3353715414fa78778865fd873f554
State: online
Cluster Id: f19f0be52fa5147aad0071491b0f8da7
Zone: 1
Management Hostname: dhcp47-176.lab.eng.blr.redhat.com
Storage Hostname: 10.70.47.176
Devices:
Id:29f67ead9c4daf3dea14e8cf2010ab9a   Name:/dev/sde            State:offline   Size (GiB):99      Used (GiB):0       Free (GiB):99
Id:724d4c878d4f406cfeb4bca3bcc15bb0   Name:/dev/sdd            State:online    Size (GiB):99      Used (GiB):10      Free (GiB):89
[root@dhcp47-175 ~]# heketi-cli device enable 29f67ead9c4daf3dea14e8cf2010ab9a
Device 29f67ead9c4daf3dea14e8cf2010ab9a is now online
[root@dhcp47-175 ~]# heketi-cli device disable 724d4c878d4f406cfeb4bca3bcc15bb0
Device 724d4c878d4f406cfeb4bca3bcc15bb0 is now offline
[root@dhcp47-175 ~]# heketi-cli device remove 724d4c878d4f406cfeb4bca3bcc15bb0


# heketi-cli node info a4a3353715414fa78778865fd873f554
Node Id: a4a3353715414fa78778865fd873f554
State: online
Cluster Id: f19f0be52fa5147aad0071491b0f8da7
Zone: 1
Management Hostname: dhcp47-176.lab.eng.blr.redhat.com
Storage Hostname: 10.70.47.176
Devices:
Id:29f67ead9c4daf3dea14e8cf2010ab9a   Name:/dev/sde            State:online    Size (GiB):99      Used (GiB):6       Free (GiB):93      
Id:724d4c878d4f406cfeb4bca3bcc15bb0   Name:/dev/sdd            State:offline   Size (GiB):99      Used (GiB):8       Free (GiB):91     

rpm -qa | grep 'heketi'
heketi-client-4.0.0-6.el7rhgs.x86_64

There seems to be some inconsistency with the used space as well.

gluster logs & heketi logs will be attached shortly.

Comment 14 krishnaram Karthick 2017-04-09 14:48:41 UTC

remove device on a device whose node is down works though.

Comment 17 Raghavendra Talur 2017-08-17 13:58:05 UTC

As seen in heketi logs, we have

[kubeexec] ERROR 2017/04/09 10:22:21 /src/github.com/heketi/heketi/executors/kubeexec/kubeexec.go:247: Failed to run command [lvremove -f vg_724d4c878d4f406cfeb4bca3bcc15bb0/tp_3c78af6a861180417f8763a1fbbaf8e6] on glusterfs-mm42d: Err[command terminated with exit code 5]: Stdout []: Stderr [  /dev/sdd: open failed: No such device or address


I am not sure this is the same behavior of lv commands when the device goes corrupt. Here the kernel is rejecting the commands on a disabled device.

I will try the dd method and update the bug tomorrow.

Comment 18 Humble Chirammal 2017-08-31 06:59:00 UTC

As finding the correct reproducer of this bug is difficult and also because we are in edge of this release, I am deferring this bug from this release and we will continue analysis in next release cycle.

Note You need to log in before you can comment on or make changes to this bug.