Bug 1633318 - health check fails on restart from crash
Summary: health check fails on restart from crash
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: GlusterFS
Classification: Community
Component: posix
Version: 4.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
Assignee: Mohit Agrawal
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-09-26 16:22 UTC by Joe Julian
Modified: 2019-06-17 14:49 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-17 14:49:11 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Joe Julian 2018-09-26 16:22:02 UTC
Description of problem:
If a brick process exits without cleaning up .glusterfs/health_check, the next start causes the health check to fail - apparently because the health_check file exists.

Version-Release number of selected component (if applicable):
4.1.2

How reproducible:
always

Steps to Reproduce:
1. pkill -9 -f glusterfs
2. systemctl start glusterd

Actual results:
[2018-09-26 16:04:14.154211] W [MSGID: 113117] [posix-metadata.c:671:posix_set_parent_ctime] 0-vminstances-posix: posix parent set mdata failed on file [File exists]
[2018-09-26 16:04:14.262538] W [MSGID: 113117] [posix-metadata.c:671:posix_set_parent_ctime] 0-vminstances-posix: posix parent set mdata failed on file [Invalid argument]
[2018-09-26 16:04:20.169870] W [MSGID: 113075] [posix-helpers.c:1895:posix_fs_health_check] 0-vminstances-posix: aio_read_cmp_buf() on /data/gluster/instances/brick/.glusterfs/health_check returned
[2018-09-26 16:04:20.169993] M [MSGID: 113075] [posix-helpers.c:1962:posix_health_check_thread_proc] 0-vminstances-posix: health-check failed, going down
[2018-09-26 16:04:20.186505] M [MSGID: 113075] [posix-helpers.c:1981:posix_health_check_thread_proc] 0-vminstances-posix: still alive! -> SIGTERM


Expected results:
health check succeeds

Additional info:

Comment 1 Mohit Agrawal 2019-06-17 04:18:54 UTC
Hi,

As per health check code, I don't think the existence of health check file (.glusterfs/health_check) could be the reason for brick failure but I will try to reproduce it.

In health check thread we do always open  health_check file with mode (O_CREAT|O_WRONLY|O_TRUNC, 0644)) so even if a file is present open truncate the data from the file so health check always writes the latest timestamp in the health_check file.
Here in logs, we can see it is showing the error at the time of comparing timestamp with health_check file, it means somehow timestamp updated in health check file is not matching
at the time of reading timestamp from a health_check file.

Are you sure after sending kill signal brick was stopped because somehow if more than one instances are running then this type of scenario can arise?

1) Please check ps output if the brick was stopped completely.
2) If the brick was stopped completely kindly share volume configuration, I will try to reproduce the same.


Regards,
Mohit Agrawal

Comment 2 Joe Julian 2019-06-17 14:49:11 UTC
I'll just close this. I filed this 10 months ago and have turned off health checking and upgraded several times since then. I am quite sure that no more than one brick instance was running at the time.


Note You need to log in before you can comment on or make changes to this bug.