Bug 128179 (IT_48483) - Kernel Panic after failover using RHCM
Summary: Kernel Panic after failover using RHCM
Keywords:
Status: CLOSED WONTFIX
Alias: IT_48483
Product: Red Hat Enterprise Linux 2.1
Classification: Red Hat
Component: kernel
Version: 2.1
Hardware: i686
OS: Linux
high
high
Target Milestone: ---
Assignee: Jason Baron
QA Contact: Brian Brock
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2004-07-19 19:12 UTC by Paul Hansen
Modified: 2013-03-06 05:57 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-10-19 19:22:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Traceback from e49 kernel panic (20.90 KB, text/plain)
2004-08-23 13:52 UTC, Paul Hansen
no flags Details
/var/log/messages sample showing failed umounts of shared partition (577.28 KB, text/plain)
2004-08-26 20:54 UTC, Paul Hansen
no flags Details
Cluster file from same period as messages file. (401.94 KB, text/plain)
2004-08-27 20:31 UTC, Paul Hansen
no flags Details

Description Paul Hansen 2004-07-19 19:12:36 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.1)
Gecko/20040707

Description of problem:
I'm using RHCM and testing cluster failover.  After a host is running
for a few minutes, an at-job triggers a reboot on the current host
such that the other host mounts the shared drive and resumes processing. 

If there is heavy I/O activity (copying a 6 GB file) on the shared
drive before the failover, after I few iterations of this test, I
eventually encounter a kernel panic when the new host uses the shared
drive.  This problem is reproducible, but apparently only with this
heavy I/O activity.  The shared drive is mounted as ext3, as indicated
by the stack trace.

Here's what I copied from the screen.  At the top, the values are:
f884b077 f5393c60 f5ca1200 f3e765a0 f3e36000 f3e37000 f5411400 20736e61

Call trace:
ext_free_data [ext3] 0xed
__wait_on_buffer [kernel] 0x8e
ext3_free_branches [ext3] 0x1b7
ext3_free_branches [ext3] 0xcc
ext3_free_branches [ext3] 0xcc
ext3_truncate [ext3] 0x2bc
start_this_handle [jbd] 0x125
journal_start_Rsmp_ec53be73 [jbd] 0xbf
start_transaction [jbd] 0x4e
ext3_delete_inode [ext3] 0xe7
ext3_sops [ext3] 0x0
iput_free [kernel] 0x14c
in_group_p [kernel] 0x1e
dentry_iput [kernel] 0x4b
vfs_unlock [kernel] 0x1eb
lookup_hash [kernel] 0x4a
sys_unlink [kernel] 0x9a
do_page_fault [kernel] 0x0
system_call [kernel] 0x33

Code: 8b 43 14 85 c0 75 37 68 60 ef 83 f8 68 5b 04 00 00 68 b0 0f
<0> Kernel panic: not continuing


Version-Release number of selected component (if applicable):
kernel-2.4.9-e.42

How reproducible:
Always

Steps to Reproduce:
1. On both systems in the cluster, edit /etc/rc.d/rc.local to set up
an at-job to start copying a big file on the shared drive 3 minutes
from now.  The script run by the at-job should use a sleep loop to
wait for for the shared drive to be mounted before starting to copy
the big file.
2. Set up an at-job to reboot one system 7 minutes from now and
another 13 minutes from now.
3. After the systems have bounced a few times, file corruption will
develop on the shared drive and eventually result in a kernel panic.
    

Actual Results:  Kernel Panic as described earlier.

Expected Results:  No disk corruption (and no panic), or non-fatal
detection of filesystem problems on the shared drive.

Additional info:

Comment 1 Michael Waite 2004-08-04 22:54:39 UTC
The customer filed this on the 19th of July and have asked me on what
the status is. I still see it listed as "NEW". I have changed the
priority to "high" in hopes of having it looked at a bit faster.


Comment 2 Paul Hansen 2004-08-05 14:37:16 UTC
I'm wondering if this is a result of the cluster manager switching
over to the other scsi port before the cache is flushed on the 
rebooting host.  The drive can be repaired with fsck and the inode
with the most errors is the one representing the file receiving the
large copy.  Not being familiar with the RHCM code that moves the
shared drives over to the other system, is this movement coordinated
with the shutdown?  From the behavior of the system, it almost looks
like a race condition is occurring between the flush and the port
transfer.  

Comment 3 Lon Hohberger 2004-08-11 22:55:35 UTC
Cluster Manager shuts down the services and unmounts the file systems
before exiting; if you're doing a 'shutdown -r now' or 'reboot' call
from your at job, it won't finish reboot until the file systems are
unmounted.

A few seconds later, the other node will notice that the first is
down, and take over the services and mount the disks -- there's a
fairly large gap between cluster shutdown and subsequent relocation.

If you're calling 'reboot -fn', there really isn't much in the way of
buffers to worry about.



Comment 4 Paul Hansen 2004-08-12 14:21:22 UTC
I see your point, but wonder if the surviving node sees that the
cluster manager has gone down, and based on that info mounts the
shared drives.  This may happen before the 'rebooting' host has
actually stopped or more importantly, flushed it's buffers.  If
failure detection is determined by the cluster manager' status
on a node (and I believe that it is) doesn't the possibility of
this race exist?  Is a few seconds adequate time for the initial
node to shutdown?  Given what we've seen, I can't help but feel
that the few seconds aren't enough time, and that the remount 
of the shared drives should only occur after the rebooting machine
has fully shutdown.

Comment 5 Lon Hohberger 2004-08-12 14:58:06 UTC
It does mount the shared drives based on that, yes.  In the event of a
'reboot' or 'shutdown -r now' (but not 'reboot -fn') command,
clumanager will stop all services before stopping the quorum daemon -
which means services won't relocate until long after the device has
been unmounted.

IIRC, the 2.4 kernel flushes dirty buffers to a file system's device
at unmount time.

Just in case it doesn't (e.g., the 2.2 kernel didn't in some cases),
there is a program to force any remaning buffers to be flushed which
is run by the service scripts immediately after the file system is
unmounted.  This happens in the 'stop' phase of the service, so the
quorum daemon won't exit until after this program has run.

If buffers somehow survived both the umount and buf-flush passes, then
we have a fairly serious problem.

BTW, What exactly are you triggering with atd?

Comment 6 Paul Hansen 2004-08-12 16:21:21 UTC
A reboot command is triggered with atd- no parameters.  

To stress the system, as noted above, we are also copying a large
file to the shared drive.  This copy is independent of the cluster
manager- just a cp from a terminal.  When I've run fsck, this is
the inode that always comes up corrupted, although others have been
found to be bad.

Could this transfer induce an non-zero open count on the shared 
partition that causes the umount to fail or not behave correctly
such that i/o's are still being accepted- I'm just thinking out
loud.  The file system thinks that the shared partition has been
umounted cleanly, so the superblock flags are set correctly.

This can only be reproduced with the cluster manager doing the 
remount on the other host.  Reboots (and power cycles without 
shutdowns) on a single host running the cluster manager all work
fine.  It's only when a standby host is available that the problem
occurs.

Don't know if i've helped, but it's more information.

Comment 7 Lon Hohberger 2004-08-12 16:51:50 UTC
If 'force-unmount' support is enabled for that file system in the
cluster configuration, any PIDs with open files on the partition will
be killed with prejudice (first by SIGTERM, then by SIGKILL if
necessary) by the service scripts.

Now, if something in the service depending on the file system wasn't
getting stopped (e.g. a user script returned failure, preventing the
umount [including force-unmount] from happening, there's a fix for
that waiting to go out:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=118886
https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=100182&action=view

However, I don't think this is the problem, as in the bugzilla above,
the superblock certainly was not clean.  The journal gets replayed, etc.

Have you considered introducing power switches to the mix if you have
not already done so?

Comment 8 Jason Baron 2004-08-12 17:05:02 UTC
A full trace of the oops would be very helpful as well. thanks.

Comment 9 Paul Hansen 2004-08-13 17:22:08 UTC
I've talked with another engineer here that has done some of the
testing on this problem and have come up with the following matrix
regarding the conditions under which the filesystem on the shared
drives is corrupted.  The first list is the results from the case
where the operational system is rebooted.  The condition of the
standby host is next to each case:

running normally: Panic, resulting from disk corruption
running in single user mode:  Panic, resulting from disk corruption
powered off: Operational host reboots without problem.

In the 'running normally' case, the cluster manager is up and running.
The next set of cases is when the operational host is suddenly 
powered off (via the power switch).  Once again, the condition of
the standby host is given for each case to the left of the colon-

running normally: Panic, resulting from disk corruption
running in single user mode:  Panic, resulting from disk corruption
powered off: Operational host reboots without problem.

As can be seen, the results are the same.

I didn't run the single user mode tests, so I need more info about
what was going on in that state; it seems as though it should behave
the same as when the standby host is powered off.  I'll look into
this situation.

The panics occur when the system is rebooting- the shared partitions
are mounted and a panic occurs.  If the system is brought up in
single user mode, and fsck is attempt without the force flag, fsck
reports that the partition is fine.

We will attempt to get a full oops message for you-  the problem
is that the stack is so large, it scrolls off the screen of the
kvm and we can't scroll back.  I'm going to hook a serial console
to the machine and see if I can get a complete dump that way.  If
you know of any other tricks for getting dump info, let me know.


Comment 10 Paul Hansen 2004-08-23 13:51:41 UTC
We have reproduced the panic with e49 (the general release version)
and am attaching the traceback.  I have something of a workaround
by running fsck whenever the system comes up via fstab.  This usually
prevents the panic (although not always as per the enclosed trace),
and frequently fsck needs manual intervention to complete.


Comment 11 Paul Hansen 2004-08-23 13:52:46 UTC
Created attachment 102983 [details]
Traceback from e49 kernel panic

Comment 12 Paul Hansen 2004-08-24 18:28:23 UTC
As an additional data point, I reproduced this problem using LAS3.0ES
on a pair of HP DL380G3s with 9GB of memory (only 8GB available due
to ES config) and an MSA500 storage array.  This combo is sold by HP
as a package- their Proliant DL380 G3 Packaged Cluster with MSA500.
All instances of this problem have been seen on this hardware config.
The problem reproduced exactly with LAS3.0 and the 3.0 cluster manager
(1.2.16-1).

This leads me to wonder if there is an issue with the MSA500, where
the shared file system lives.  Is a scsi reset being issued by the
standby host before the MSA500 can flush its buffers?  I've noticed
on 3.0 that the mount is transferred very quickly from host to host-
pretty much as soon as the cluster manager shuts down on the rebooting
host.  I'm assuming that a reset is done as part of the mount.  Am
looking into this; any insight you could provide would be welcome.

Comment 13 Lon Hohberger 2004-08-24 19:05:12 UTC
Mount operations don't generally perform SCSI resets; if that's what
you were asking.  SCSI resets typically happen at boot time.

At any rate, if you are seeing SCSI resets from the other member
(ever), then the shared storage configuration you're using won't work
very well, if at all.


Comment 14 Paul Hansen 2004-08-24 20:04:05 UTC
No, we haven't seen scsi resets- just a poor choice of hypothethicals.
I guess my question becomes whether something occurs during the mount 
that would disrupt the writing of buffers on the array. 

Comment 15 Lon Hohberger 2004-08-24 20:43:21 UTC
I don't suspect so.  Is the cache on the MSA configured in
write-through mode or disabled?

Comment 18 Paul Hansen 2004-08-26 20:52:42 UTC
Am looking into the MSA configuration; having trouble locating
info on how to configure the MSA- we don't change anything during
system assembly.

Just taking another tack, I decided to see if the forced unmounts
were working.  Another engineer has been doing the testing, so I
asked if we had been seeing failed umounts, and he said "All the
time".  I'm enclosing a piece of /var/log/messages from earlier
in the week with the messages.  It looks like the cluster shuts
down with the mounts still up.  This can't be good; the other 
system could then mount them before the rebooting system has a
chance to flush its buffers.  Does this sound plausible?

Comment 19 Paul Hansen 2004-08-26 20:54:23 UTC
Created attachment 103141 [details]
/var/log/messages sample showing failed umounts of shared partition

Comment 20 Lon Hohberger 2004-08-26 21:22:02 UTC
Please set the service manager's log level to 7, reproduct, and show
the resulting logs.

My guess is that there's interaction between what looks like
fabric-level fencing and clumanager.  You can't unmount a file system
if processes are using it -- and you can't kill a process using a file
system if that process is in disk-wait, which would happen if you
fenced a device (or set of devices) while one or more processes was
using them (e.g. copying a file).

So, to be clear:
- Process using file system
- File system fenced
- Process in disk wait.
- clumanager stop
- clumanager tries to kill process, process won't die
- Service stop fails
- clumanager exits, complaining

The below bugzilla has a link to an errata which will prevent the
service from restarting (by disabling the failed-to-stop service), but
that's probably not what you're after.  However, it's a start.

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128179

Comment 21 Paul Hansen 2004-08-27 14:49:59 UTC
Is that link correct?  It seems to point back to this bug (and
that's where I went when I clicked on it ;^)

Comment 22 Lon Hohberger 2004-08-27 15:21:48 UTC
No. No, it's not.

http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=118886

The above is, and here's the errata:

http://rhn.redhat.com/errata/RHBA-2004-223.html

Comment 23 Paul Hansen 2004-08-27 20:31:33 UTC
Created attachment 103184 [details]
Cluster file from same period as messages file.

Comment 24 Paul Hansen 2004-08-27 20:33:44 UTC
Sent a cluster log from the same period as the messages file.  The
system had been at log level 7.  If more logs are needed, just ask.

Comment 25 Paul Hansen 2004-09-01 13:50:17 UTC
Have done some more investigation of the problem and found that 
the copy of the large file to the shared partition results in
the copy process to spend a large amount of time in the task
uninterruptible state making it unkillable, making the umount of
the shared partition fail as the cluster manager is shutting down
in response to the reboot command.

We've done some rough measurements that show that it can take in
excess of 3 minutes to completely shutdown the machine after the
cluster manager has gone down, so there is ample time for the
dual mount to occur.

I'm also looking at solutions:  the first is making umount -f work
for locally mounted file systems.  I found a patch on the net which
looks credible and am investigating its potential.

In parallel, I'm trying to put a sleep in the SVC_START switch in
handle_svc_request in the service manager daemon, figuring if I can
delay the start up of the service on the surviving node long enough,
the umounts at system shutdown will release the shared drive.  I'm
a little uncomfortable with a 3+ minute sleep, but I'm not certain
what other alternatives there are.

Any thoughts on the above?  Thanks.

Comment 26 Paul Hansen 2004-09-04 03:06:59 UTC
Tried the sleep in the SVC_START in the service manager, but at
300 seconds, it still wasn't long enough to allow the rebooting
node to shutdown, so that approach didn't work.

Then hit on the idea to prevent the cluster from being shutdown
as part of the /etc/init.d scripts.  So I commented out the 'touch
/var/lock/subsys/cluster' in init.d/cluster, and renamed the 
CLUSTER_LOCKFILE to /var/lock/clulog.  The init doesn't stop
daemons without a lock file. 

Now the cluster stops at the end of the shutdown, so the relocate
to the standby host occurs well after the active host is down, so
no dual mounts.  We've been running the fix for 24 hours and it 
seems to work.  The lock file in its original location doesn't
seem to be used by the init script, and relocating it provides
the lock.c functionality.

Comment 29 Brent Fox 2004-09-10 20:27:22 UTC
Paul, just to make sure, are you still seeing the double mount problem
with the Errata that Lon mentioned in Comment #22
(http://rhn.redhat.com/errata/RHBA-2004-223.html)?

Comment 30 Lon Hohberger 2004-09-10 20:32:41 UTC
The 1.0.27 version of clumanager should prevent the service from
starting on the other node if force-unmount fails.

Another way to handle it is to reboot the node (the Un-pretty way,
e.g. "reboot -fn") if it can't safely umount the file system (instead
of disabling the service).

Comment 31 Paul Hansen 2004-09-10 21:15:07 UTC
I looked at the changelog for RHBA-2004-223.html and didn't see
a fix that looked like it addressed this issue, but I'm removing
my change from clumanager and we'll run our test on 1.0.27 and
see if it reproduces.  Should know the result sometime on Monday.

Comment 32 Paul Hansen 2004-09-13 18:37:00 UTC
We tried using a vanilla 1.0.27 over the weekend (without the
fix described in #26) and the file system corruption occurred.  I
don't understand your comment in #30 regarding the service not
starting on the other node if the force-unmount fails.  Does this
apply to a "cluadmin -- service relocate..." or a system shutdown?
On a system shutdown, the cluster tasks are stopped, but the mount
is left behind when umount fails.  The standby system sees the 
original system as down and relocates the service, which includes 
any associated mounts.  Am I missing something?  It would seem to
lead to the same situation described in the next paragraph if 
umount failure prevented the cluster from shutting down.

An interesting side effect of the #26 fix and I suspect the reboot -fn
approach is that the node is never marked as down on the quorum disk.
This causes the surviving node to power cycle the shutdown/rebooting
node in situations where you really want to shutdown or reboot.  Am
looking at potential fixes, but may be hacking a hack.



Comment 33 Lon Hohberger 2004-09-13 19:58:01 UTC
Clarification of #30: 

If a service fails to stop (e.g. file system can't unmount), the
service should be marked as 'disabled', not 'stopped' or 'started',
etc.  This, in turn, should prevent it from failing over when the node
is declared down, which should, in turn, prevent the file system
corruption at a cost of service down time (which the administrator
will need to fix).

You should see logs similar to the following on 1.0.27 if any service
fails to stop during shutdown (for any reason; disk/script/etc.):

May 12 13:50:00 blue clusvcmgrd[17244]: <warning> Service 1 failed to
stop - disabling
May 12 13:50:02 blue clusvcmgrd[16826]: <crit> Not all services
stopped cleanly, reboot needed
May 12 13:50:11 blue cluster[17459]: <notice> Completed shutdown of
Cluster Manager

(All of that said, there's still a kernel panic which needs to be
fixed...!)


Comment 34 RHEL Program Management 2007-10-19 19:22:07 UTC
This bug is filed against RHEL2.1, which is in maintenance phase.
During the maintenance phase, only security errata and select mission
critical bug fixes will be released for enterprise products.  Since
this bug does not meet that criteria, it is now being closed.

For more information of the RHEL errata support policy, please visit:
http://www.redhat.com/security/updates/errata/

If you feel this bug is indeed mission critical, please contact your
support representative.  You may be asked to provide detailed
information on how this bug is affecting you.


Note You need to log in before you can comment on or make changes to this bug.