Bug 971774
Summary: | [RFE] Improve handling of failure of a disk, raid array or raid controller | ||||||
---|---|---|---|---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Niels de Vos <ndevos> | ||||
Component: | posix | Assignee: | Niels de Vos <ndevos> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | mainline | CC: | gluster-bugs, primeroznl | ||||
Target Milestone: | --- | Keywords: | FutureFeature, Patch, Triaged | ||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Enhancement | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2013-07-24 17:24:05 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 970960 | ||||||
Attachments: |
|
Description
Niels de Vos
2013-06-07 09:06:04 UTC
REVIEW: http://review.gluster.org/5176 (posix: add a simple health-checker based on timers) posted (#1) for review on master by Niels de Vos (ndevos) REVIEW: http://review.gluster.org/5176 (posix: add a simple health-checker based on timers) posted (#2) for review on master by Niels de Vos (ndevos) REVIEW: http://review.gluster.org/5176 (posix: add a simple health-checker) posted (#3) for review on master by Niels de Vos (ndevos) REVIEW: http://review.gluster.org/5176 (posix: add a simple health-checker) posted (#4) for review on master by Niels de Vos (ndevos) Created attachment 767432 [details]
Test script
REVIEW: http://review.gluster.org/5176 (posix: add a simple health-checker) posted (#5) for review on master by Niels de Vos (ndevos) REVIEW: http://review.gluster.org/5176 (posix: add a simple health-checker) posted (#6) for review on master by Niels de Vos (ndevos) COMMIT: http://review.gluster.org/5176 committed in master by Vijay Bellur (vbellur) ------ commit 98f62a731ca13296b937bfff14d0a2f8dfc49a54 Author: Niels de Vos <ndevos> Date: Mon Jun 24 14:05:58 2013 +0200 posix: add a simple health-checker Goal of this health-checker is to detect fatal issues of the underlying storage that is used for exporting a brick. The current implementation requires the filesystem to detect the storage error, after which it will notify the parent xlators and exit the glusterfsd (brick) process to prevent further troubles. The interval the health-check runs can be configured per volume with the storage.health-check-interval option. The default interval is 30 seconds. It is not trivial to write an automated test-case with the current prove-framework. These are the manual steps that can be done to verify the functionality: - setup a Logical Volume (/dev/bz970960/xfs) and format is as XFS for brick usage - create a volume with the one brick # gluster volume create failing_xfs glufs1:/bricks/failing_xfs/data # gluster volume start failing_xfs - mount the volume and verify the functionality - make the storage fail (use device-mapper, or pull disks) # dmsetup table .. bz970960-xfs: 0 196608 linear 7:0 2048 # echo 0 196608 error > dmsetup-error-target # dmsetup load bz970960-xfs dmsetup-error-target # dmsetup resume bz970960-xfs # dmsetup table ... bz970960-xfs: 0 196608 error - notice the errors caught by syslog: Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): metadata I/O error: block 0x0 ("xfs_buf_iodone_callbacks") error 5 buf count 512 Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): I/O Error Detected. Shutting down filesystem Jun 24 11:31:49 vm130-32 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s) Jun 24 11:31:49 vm130-32 kernel: VFS:Filesystem freeze failed Jun 24 11:31:50 vm130-32 GlusterFS[1969]: [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down Jun 24 11:32:09 vm130-32 kernel: XFS (dm-2): xfs_log_force: error 5 returned. Jun 24 11:32:20 vm130-32 GlusterFS[1969]: [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM - these errors are in the log of the brick as well: [2013-06-24 10:31:50.500607] W [posix-helpers.c:1102:posix_health_check_thread_proc] 0-failing_xfs-posix: stat() on /bricks/failing_xfs/data returned: Input/output error [2013-06-24 10:31:50.500674] M [posix-helpers.c:1114:posix_health_check_thread_proc] 0-failing_xfs-posix: health-check failed, going down [2013-06-24 10:32:20.508690] M [posix-helpers.c:1119:posix_health_check_thread_proc] 0-failing_xfs-posix: still alive! -> SIGTERM - the glusterfsd process has exited correctly: # gluster volume status Status of volume: failing_xfs Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick glufs1:/bricks/failing_xfs/data N/A N N/A NFS Server on localhost 2049 Y 1897 Change-Id: Ic247fbefb97f7e861307a5998a9a7a3ecc80aa07 BUG: 971774 Signed-off-by: Niels de Vos <ndevos> Reviewed-on: http://review.gluster.org/5176 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Vijay Bellur <vbellur> |