Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1633318

Summary:	health check fails on restart from crash
Product:	[Community] GlusterFS	Reporter:	Joe Julian <joe>
Component:	posix	Assignee:	Mohit Agrawal <moagrawa>
Status:	CLOSED WONTFIX	QA Contact:
Severity:	high	Docs Contact:
Priority:	high
Version:	4.1	CC:	amukherj, atumball, bugs, pasik, rabhat, ryan
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-17 14:49:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Joe Julian 2018-09-26 16:22:02 UTC

Description of problem:
If a brick process exits without cleaning up .glusterfs/health_check, the next start causes the health check to fail - apparently because the health_check file exists.

Version-Release number of selected component (if applicable):
4.1.2

How reproducible:
always

Steps to Reproduce:
1. pkill -9 -f glusterfs
2. systemctl start glusterd

Actual results:
[2018-09-26 16:04:14.154211] W [MSGID: 113117] [posix-metadata.c:671:posix_set_parent_ctime] 0-vminstances-posix: posix parent set mdata failed on file [File exists]
[2018-09-26 16:04:14.262538] W [MSGID: 113117] [posix-metadata.c:671:posix_set_parent_ctime] 0-vminstances-posix: posix parent set mdata failed on file [Invalid argument]
[2018-09-26 16:04:20.169870] W [MSGID: 113075] [posix-helpers.c:1895:posix_fs_health_check] 0-vminstances-posix: aio_read_cmp_buf() on /data/gluster/instances/brick/.glusterfs/health_check returned
[2018-09-26 16:04:20.169993] M [MSGID: 113075] [posix-helpers.c:1962:posix_health_check_thread_proc] 0-vminstances-posix: health-check failed, going down
[2018-09-26 16:04:20.186505] M [MSGID: 113075] [posix-helpers.c:1981:posix_health_check_thread_proc] 0-vminstances-posix: still alive! -> SIGTERM


Expected results:
health check succeeds

Additional info:

Comment 1 Mohit Agrawal 2019-06-17 04:18:54 UTC

Hi,

As per health check code, I don't think the existence of health check file (.glusterfs/health_check) could be the reason for brick failure but I will try to reproduce it.

In health check thread we do always open  health_check file with mode (O_CREAT|O_WRONLY|O_TRUNC, 0644)) so even if a file is present open truncate the data from the file so health check always writes the latest timestamp in the health_check file.
Here in logs, we can see it is showing the error at the time of comparing timestamp with health_check file, it means somehow timestamp updated in health check file is not matching
at the time of reading timestamp from a health_check file.

Are you sure after sending kill signal brick was stopped because somehow if more than one instances are running then this type of scenario can arise?

1) Please check ps output if the brick was stopped completely.
2) If the brick was stopped completely kindly share volume configuration, I will try to reproduce the same.


Regards,
Mohit Agrawal

Comment 2 Joe Julian 2019-06-17 14:49:11 UTC

I'll just close this. I filed this 10 months ago and have turned off health checking and upgraded several times since then. I am quite sure that no more than one brick instance was running at the time.