RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 639025 - [RHEL6] a GFS2 filesystem mount operation hangs after a successfull power fence operation
Summary: [RHEL6] a GFS2 filesystem mount operation hangs after a successfull power fen...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: cluster
Version: 6.0
Hardware: All
OS: Linux
medium
medium
Target Milestone: rc
: ---
Assignee: David Teigland
QA Contact: Cluster QE
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-30 16:22 UTC by Debbie Johnson
Modified: 2018-11-14 17:00 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-05-25 13:51:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch two of two for a test kernel (765 bytes, patch)
2011-02-08 12:42 UTC, Steve Whitehouse
no flags Details | Diff

Description Debbie Johnson 2010-09-30 16:22:46 UTC
Description of problem:
In a 2-node RHEL6 cluster we observed the following problem:

NodeA and NodeB concurrently mount a gfs2 filesystem. During I/O operation to the filesystem on NodeB, we force a kernel panic on node NodeB.
After NodeB has been successfully fenced, we try to reboot it. During this reboot, the mount operation for the gfs2 filesystem hangs.

Our expectation would be, that the reboot is successful and all gfs2 filesystems can be mounted. 

Version-Release number of selected component (if applicable):
RHEL 6.0 BETA

How reproducible:


Steps to Reproduce:

2 node cluster (rhel6) 
lilc052a-ics0 192.168.51.1 
lilc052b-ics0 192.168.51.2 

Date of the problem: Sep 28 09:09:07 

Quoting customer: 
""" In a 2-node RHEL6 cluster we observed the following problem: NodeA and NodeB concurrently mount a gfs2 filesystem. During I/O operation to the filesystem on NodeB, we force a kernel panic on node NodeB. After NodeB has been successfully fenced, we try to reboot it. During this reboot, the mount operation for the gfs2 filesystem hangs. Our expectation would be, that the reboot is successful and all gfs2 filesystems can be mounted. """ 

nodeB is lilc052b 

Looks like some fencing failed (may be it s normal, may be it was while the node was rebooting after the manual panic ?) 

Sep 28 09:11:58 lilc052a fenced[7465]: fence lilc052b-ics0 dev 0.0 agent fence_apc result: error from agent 
Sep 28 09:11:58 lilc052a fenced[7465]: fence lilc052b-ics0 dev 1.0 agent fence_ilo result: error from agent 


However, the custmoer says the fencing did work:"After NodeB has been successfully fenced, we try to reboot it." 

Another entry that may be related: 
Sep 28 09:15:27 lilc052a kernel: INFO: task gfs2_quotad:7836 blocked for more than 120 seconds. 
Sep 28 09:15:27 lilc052a kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 28 09:15:27 lilc052a kernel: gfs2_quotad D ffff88047fc24300 0 7836 2 0x00000080 
Sep 28 09:15:27 lilc052a kernel: ffff88085239da88 0000000000000046 0000000000000000 ffff88085239db00 
Sep 28 09:15:27 lilc052a kernel: ffff88085239da70 ffff88048e4115e8 ffff88085239dbd8 0000000104189514 
Sep 28 09:15:27 lilc052a kernel: ffff88086ac74678 ffff88085239dfd8 0000000000010518 ffff88086ac74678 

lilc052b has a call trace in /var/log/messages for 

Sep 28 09:55:25 followed by 
Sep 28 10:13:25 lilc052b gfs_controld[7791]: mount_done: lt_products not found Sep 28 10:13:25 lilc052b gfs_controld[7791]: do_leave: lt_products not found ... 
... 

and:

Sep 28 10:13:33 lilc052b acpid: waiting for events: event logging is off 
Sep 28 10:13:33 lilc052b kernel: GFS2: fsid=: Trying to join cluster "lock_dlm", "lilc052:lt_products" 
Sep 28 10:13:34 lilc052b kernel: GFS2: fsid=lilc052:lt_products.1: Joined cluster. Now mounting FS... 
Sep 28 10:15:38 lilc052b kernel: INFO: task mount.gfs2:8388 blocked for more than 120 seconds. 
Sep 28 10:15:38 lilc052b kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. 
Sep 28 10:15:38 lilc052b kernel: mount.gfs2 D ffff88047fc24400 0 8388 8387 0x00000080 Sep 28 10:15:38 lilc052b kernel: ffff881065f018d0 0000000000000082 0000000000000000 0000000000000000 Sep 28 10:15:38 lilc052b kernel: ffffffff81bf8af0 0000000000000800 0000000000000003 00000001002037b7 Sep 28 10:15:38 lilc052b kernel: ffff88106c671a58 ffff881065f01fd8 0000000000010518 ffff88106c671a58 Sep 28 10:15:38 lilc052b kernel: Call Trace:

Comment 1 RHEL Program Management 2010-09-30 16:27:39 UTC
Thank you for your bug report. This issue was evaluated for inclusion
in the current release of Red Hat Enterprise Linux. Unfortunately, we
are unable to address this request in the current release. Because we
are in the final stage of Red Hat Enterprise Linux 6 development, only
significant, release-blocking issues involving serious regressions and
data corruption can be considered.

If you believe this issue meets the release blocking criteria as
defined and communicated to you by your Red Hat Support representative,
please ask your representative to file this issue as a blocker for the
current release. Otherwise, ask that it be evaluated for inclusion in
the next minor release of Red Hat Enterprise Linux.

Comment 3 Lon Hohberger 2010-10-27 15:50:22 UTC
We need to know if this happens with:

a) firewall disabled, and
b) selinux in permissive mode.

Comment 5 Steve Whitehouse 2010-11-12 16:45:21 UTC
I don't understand this part of the report:

> However, the custmoer says the fencing did work:"After NodeB has been
> successfully fenced, we try to reboot it." 

The fencing appears to be ilo and power fencing, both of which should be able to reboot the node, so if fencing was successful, why did the node not reboot automatically?

Also, with respect to the log messages:

> Sep 28 10:13:25 lilc052b gfs_controld[7791]: mount_done: lt_products not found
> Sep 28 10:13:25 lilc052b gfs_controld[7791]: do_leave: lt_products not found

The mount done message is sent to gfs_controld by mount.gfs2 after the mount syscall has returned in order to inform it of the result. gfs_controld is complaining that it has no record of that particular mount group, so chances are that something has failed after the original join of the mount group, or maybe the join failed but was not reported for some reason.

Output from gfs_control dump etc. from a stuck cluster would help us figure out what is going on here.

Comment 6 Ric Wheeler 2011-01-31 18:19:00 UTC
Can we please get answers to the questions in https://bugzilla.redhat.com/show_bug.cgi?id=639025#c5 ?

If this is still an issue, we would like to try and fix it. If no update or current information is posted, we will likely close this BZ this week.

Thanks!

Comment 10 Mark Hlawatschek 2011-02-08 08:26:34 UTC
the issue occured during an internal qa-test of rhel6 and rhcs including gfs2. 
I'm sorry, but I cannot provide any more information, as this project is already closed.
Nevertheless, in my opinion this issue needs further investigation. In our tests the filesystem on the surviving node was unresponive for more than 30 minutes.

Comment 11 Steve Whitehouse 2011-02-08 08:43:21 UTC
Mark, I have a hunch about the causes of this which you might be able to help me answer. When you saw this happen, did the nodes in question have a large amount of RAM (say >32G) and is it likely that the nodes were caching a large number of inodes (based on the number of inodes on the fs, and whether each node is likely to have accessed a large number of them just before the issue occurred)

Assuming that both of these things are true, then we have another report which may have provided the clue required in order to fix this.

I'd still be rather wary of saying that particular problem was the one originally reported though, without a bit more information from the original reporter, but it does seem likely.

Comment 12 Mark Hlawatschek 2011-02-08 10:02:34 UTC
Steve,
I can confirm your guesses. The nodes have a large amount of RAM (>32G) and they where both caching a large number of inodes on the fs, as the failover test has been done during a file creation test on both nodes.
Best Regards,
Mark

Comment 13 Steve Whitehouse 2011-02-08 12:38:46 UTC
Mark, if we give you a test kernel, are you in a position to run a test to see whether it resolves the issue?

Comment 16 Mark Hlawatschek 2011-02-08 13:00:29 UTC
Steve,
I'm sorry, but I don't have the required test setup available at the moment.

Comment 17 Steve Whitehouse 2011-02-08 13:10:48 UTC
Ok, thats not a problem. We are definitely working on this and we'll be testing internally anyway to try and ensure that we resolve the issue. I believe we have a fix, we just need to verify that there are no side-effects to it.

Comment 18 David Teigland 2011-02-08 14:40:39 UTC
Be advised that there is not even a remote possibility that the patch in comment 14 is acceptable.

Comment 19 Ric Wheeler 2011-02-08 15:18:01 UTC
Dave, do you have an alternative patch in mind or revisions to the patch proposed in https://bugzilla.redhat.com/show_bug.cgi?id=639025#c14?

Comment 20 David Teigland 2011-02-08 16:45:24 UTC
I just found out from Steve what the specific performance issue was.  Now that I know that, I can develop a patch targetted at the specific issue.  Steve's patch fundamentally changes the entire design of the dlm -- it's like destroying your house and building a new one because a light bulb burned out.

Comment 23 David Teigland 2011-02-14 15:56:59 UTC
I do not believe Steve's comments (and my replies) are related to whatever problem the customer was having.  Steve's concern is about slow lock recovery.
That issue belongs in a different bug.

The reported problems seem to have been related to gfs_controld, but without more information we cannot say what the problem was, so this will have to wait for more information.

Comment 25 Steve Whitehouse 2011-05-25 13:51:25 UTC
We have had no further updates from the customer, and we've not got a reproducer. I think we have little option but to close this. If it happens again, or we have some further insight into what happened, then please reopen.


Note You need to log in before you can comment on or make changes to this bug.