Bug 1436197

Summary:	glusterd crashes when disk hosting a brick is removed from system
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Raghavendra Talur <rtalur>
Component:	glusterd	Assignee:	Atin Mukherjee <amukherj>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Byreddy <bsrirama>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	rhs-bugs, rtalur, sasundar, sbairagy, storage-qa-internal, vbellur
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-07-26 10:49:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1435613

Description Raghavendra Talur 2017-03-27 12:29:31 UTC

Description of problem:

When the disk is removed from the VM, we get these logs from glusterfsd(brick process)

```
Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:29 UTC):

var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down


Message from syslogd@localhost at Mar 27 12:14:29 ...
 var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:29.751829] M [MSGID: 113075] [posix-helpers.c:1841:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: health-check failed, going down
```

When the kill signal is sent to the same brick process as part of replace-brick, we get 

```
Broadcast message from systemd-journald@node1 (Mon 2017-03-27 12:14:59 UTC):

var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]: [2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM


Message from syslogd@localhost at Mar 27 12:14:59 ...
 var-lib-heketi-mounts-vg_3dc59eba73cb2070a96c0fec5a2b5e82-brick_455ecb9f475cf6bd7dd201457799736c-brick[14090]:[2017-03-27 12:14:59.752367] M [MSGID: 113075] [posix-helpers.c:1847:posix_health_check_thread_proc] 0-vol_71bda80b7a159f08ad795e4f4f244bd4-posix: still alive! -> SIGTERM
Shared connection to 192.168.21.14 closed.
```

It is found that glusterd has crashed on the system and the system goes to emergency mode.


This might not be the best way to test, if a better way proves that gluster is resilient to disk crashes, this bug might be closed.

Comment 5 Samikshan Bairagya 2017-07-26 10:49:50 UTC

Closing this bug since needinfo hasn't been addressed. Please open this bug if this issue persists.