Bug 964038
Summary: | longevity: glusterfsd was hung for more than 120s which made the node unreachable. | ||
---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Vijaykumar Koppad <vkoppad> |
Component: | glusterd | Assignee: | Bug Updates Notification Mailing List <rhs-bugs> |
Status: | CLOSED WONTFIX | QA Contact: | amainkar |
Severity: | high | Docs Contact: | |
Priority: | medium | ||
Version: | 2.0 | CC: | bfoster, david.macdonald, dchinner, rhs-bugs, rwheeler, vbellur |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2013-05-28 07:31:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vijaykumar Koppad
2013-05-17 06:27:05 UTC
Vijay, Keeping it in your name to follow up with RHEL team. [2013-05-13 16:20:03.294622] E [posix-helpers.c:420:posix_pstat] 0-long1-posix: lstat failed on /bricks/long1_brick0/ and return value is 5 instead of -1. Please see dmesg output to check whether the failure is due to backend filesystem issue *** This bug has been marked as a duplicate of bug 908158 *** Perhaps I'm missing something, but what suggests this bug is a duplicate of 908158? I'm going from memory a bit here so I could be wrong, but IIRC 908158 was an fs shutdown and wasn't associated with any kind of hangs or stalls on its own... Is the following error observed in the system logs for this issue? xfs_iunlink_remove: xfs_inotobp() returned error 22. The report doesn't indicate which server had the issue, but from the messages files in the sosreports it appears to be rhs01: May 17 11:04:10 longevity-rhs01 kernel: INFO: task glusterfsd:1730 blocked for more than 120 seconds. May 17 11:04:10 longevity-rhs01 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. May 17 11:04:10 longevity-rhs01 kernel: glusterfsd D 0000000000000002 0 1730 1 0x00000000 May 17 11:04:10 longevity-rhs01 kernel: ffff880119bcd698 0000000000000082 0000000000000000 0000000000000000 May 17 11:04:10 longevity-rhs01 kernel: 0000000001000000 ffff880028313b00 0000000000000001 ffff880028315f80 May 17 11:04:10 longevity-rhs01 kernel: ffff88011877fab8 ffff880119bcdfd8 000000000000f4e8 ffff88011877fab8 May 17 11:04:10 longevity-rhs01 kernel: Call Trace: May 17 11:04:10 longevity-rhs01 kernel: [<ffffffff814eea15>] schedule_timeout+0x215/0x2e0 May 17 11:04:10 longevity-rhs01 kernel: [<ffffffff814ef932>] __down+0x72/0xb0 May 17 11:04:10 longevity-rhs01 kernel: [<ffffffffa01ab402>] ? _xfs_buf_find+0x102/0x280 [xfs] May 17 11:04:10 longevity-rhs01 kernel: [<ffffffff81096e01>] down+0x41/0x50 ... There are hangs prior to this in the log, e.g.: May 17 10:34:23 longevity-rhs01 kernel: INFO: task glusterfsd:1802 blocked for more than 120 seconds. ... and others, going back to the following (which looks similar to bug 967593, except at unlink rather than log recovery and slightly different free space conditions): May 12 13:53:05 longevity-rhs01 kernel: XFS: Internal error XFS_WANT_CORRUPTED_GOTO at line 1638 of file fs/xfs/xfs_alloc.c. Caller 0xffffffffa0151801 May 12 13:53:05 longevity-rhs01 kernel: May 12 13:53:05 longevity-rhs01 kernel: Pid: 1727, comm: glusterfsd Not tainted 2.6.32-220.32.1.el6.x86_64 #1 May 12 13:53:05 longevity-rhs01 kernel: Call Trace: May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa0178f1f>] ? xfs_error_report+0x3f/0x50 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa0151801>] ? xfs_free_extent+0x101/0x130 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa014e709>] ? xfs_alloc_lookup_eq+0x19/0x20 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa014f946>] ? xfs_free_ag_extent+0x626/0x750 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa0151801>] ? xfs_free_extent+0x101/0x130 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa015ae4d>] ? xfs_bmap_finish+0x15d/0x1a0 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa019adaf>] ? xfs_remove+0x2af/0x3a0 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffffa01a8718>] ? xfs_vn_unlink+0x48/0x90 [xfs] May 12 13:53:05 longevity-rhs01 kernel: [<ffffffff8118481f>] ? vfs_unlink+0x9f/0xe0 May 12 13:53:05 longevity-rhs01 kernel: [<ffffffff8118356a>] ? lookup_hash+0x3a/0x50 May 12 13:53:05 longevity-rhs01 kernel: [<ffffffff81186da3>] ? do_unlinkat+0x183/0x1c0 May 12 13:53:05 longevity-rhs01 kernel: [<ffffffff8117c1a6>] ? sys_newlstat+0x36/0x50 May 12 13:53:05 longevity-rhs01 kernel: [<ffffffff81186df6>] ? sys_unlink+0x16/0x20 May 12 13:53:05 longevity-rhs01 kernel: [<ffffffff8100b0f2>] ? system_call_fastpath+0x16/0x1b May 12 13:53:05 longevity-rhs01 kernel: XFS (dm-2): xfs_do_force_shutdown(0x8) called from line 3859 of file fs/xfs/xfs_bmap.c. Return address = 0xffffffffa015ae86 May 12 13:53:05 longevity-rhs01 kernel: XFS (dm-2): Corruption of in-memory data detected. Shutting down filesystem May 12 13:53:05 longevity-rhs01 kernel: XFS (dm-2): Please umount the filesystem and rectify the problem(s) The hung task messages start right around this time and are interleaved with xfs_log_force() errors due to the shutdown until May 13 17:16:00, at which point the system resets, the log is replayed (no errors) and the same restart/log replay sequence repeats a few times until the original hung task message reported above. Given all that, I suspect the initial issue here is the aforementioned corruption and this should probably be marked as a duplicate of bug 967593. Vijaykumar or Sudhir, was there any recovery (i.e., xfs_repair) attempt or reformat after the corruption but before the more recent hangs? If not and nobody else objects, I'll plan to dup this to 967593. Brian, there was no attempt of recovery or reformat. The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version. [1] https://rhn.redhat.com/errata/RHSA-2014-0821.html The product version of Red Hat Storage on which this issue was reported has reached End Of Life (EOL) [1], hence this bug report is being closed. If the issue is still observed on a current version of Red Hat Storage, please file a new bug report on the current version. [1] https://rhn.redhat.com/errata/RHSA-2014-0821.html |