From Bugzilla Helper: User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.7.1) Gecko/20040707 Description of problem: I'm using RHCM and testing cluster failover. After a host is running for a few minutes, an at-job triggers a reboot on the current host such that the other host mounts the shared drive and resumes processing. If there is heavy I/O activity (copying a 6 GB file) on the shared drive before the failover, after I few iterations of this test, I eventually encounter a kernel panic when the new host uses the shared drive. This problem is reproducible, but apparently only with this heavy I/O activity. The shared drive is mounted as ext3, as indicated by the stack trace. Here's what I copied from the screen. At the top, the values are: f884b077 f5393c60 f5ca1200 f3e765a0 f3e36000 f3e37000 f5411400 20736e61 Call trace: ext_free_data [ext3] 0xed __wait_on_buffer [kernel] 0x8e ext3_free_branches [ext3] 0x1b7 ext3_free_branches [ext3] 0xcc ext3_free_branches [ext3] 0xcc ext3_truncate [ext3] 0x2bc start_this_handle [jbd] 0x125 journal_start_Rsmp_ec53be73 [jbd] 0xbf start_transaction [jbd] 0x4e ext3_delete_inode [ext3] 0xe7 ext3_sops [ext3] 0x0 iput_free [kernel] 0x14c in_group_p [kernel] 0x1e dentry_iput [kernel] 0x4b vfs_unlock [kernel] 0x1eb lookup_hash [kernel] 0x4a sys_unlink [kernel] 0x9a do_page_fault [kernel] 0x0 system_call [kernel] 0x33 Code: 8b 43 14 85 c0 75 37 68 60 ef 83 f8 68 5b 04 00 00 68 b0 0f <0> Kernel panic: not continuing Version-Release number of selected component (if applicable): kernel-2.4.9-e.42 How reproducible: Always Steps to Reproduce: 1. On both systems in the cluster, edit /etc/rc.d/rc.local to set up an at-job to start copying a big file on the shared drive 3 minutes from now. The script run by the at-job should use a sleep loop to wait for for the shared drive to be mounted before starting to copy the big file. 2. Set up an at-job to reboot one system 7 minutes from now and another 13 minutes from now. 3. After the systems have bounced a few times, file corruption will develop on the shared drive and eventually result in a kernel panic. Actual Results: Kernel Panic as described earlier. Expected Results: No disk corruption (and no panic), or non-fatal detection of filesystem problems on the shared drive. Additional info:
The customer filed this on the 19th of July and have asked me on what the status is. I still see it listed as "NEW". I have changed the priority to "high" in hopes of having it looked at a bit faster.
I'm wondering if this is a result of the cluster manager switching over to the other scsi port before the cache is flushed on the rebooting host. The drive can be repaired with fsck and the inode with the most errors is the one representing the file receiving the large copy. Not being familiar with the RHCM code that moves the shared drives over to the other system, is this movement coordinated with the shutdown? From the behavior of the system, it almost looks like a race condition is occurring between the flush and the port transfer.
Cluster Manager shuts down the services and unmounts the file systems before exiting; if you're doing a 'shutdown -r now' or 'reboot' call from your at job, it won't finish reboot until the file systems are unmounted. A few seconds later, the other node will notice that the first is down, and take over the services and mount the disks -- there's a fairly large gap between cluster shutdown and subsequent relocation. If you're calling 'reboot -fn', there really isn't much in the way of buffers to worry about.
I see your point, but wonder if the surviving node sees that the cluster manager has gone down, and based on that info mounts the shared drives. This may happen before the 'rebooting' host has actually stopped or more importantly, flushed it's buffers. If failure detection is determined by the cluster manager' status on a node (and I believe that it is) doesn't the possibility of this race exist? Is a few seconds adequate time for the initial node to shutdown? Given what we've seen, I can't help but feel that the few seconds aren't enough time, and that the remount of the shared drives should only occur after the rebooting machine has fully shutdown.
It does mount the shared drives based on that, yes. In the event of a 'reboot' or 'shutdown -r now' (but not 'reboot -fn') command, clumanager will stop all services before stopping the quorum daemon - which means services won't relocate until long after the device has been unmounted. IIRC, the 2.4 kernel flushes dirty buffers to a file system's device at unmount time. Just in case it doesn't (e.g., the 2.2 kernel didn't in some cases), there is a program to force any remaning buffers to be flushed which is run by the service scripts immediately after the file system is unmounted. This happens in the 'stop' phase of the service, so the quorum daemon won't exit until after this program has run. If buffers somehow survived both the umount and buf-flush passes, then we have a fairly serious problem. BTW, What exactly are you triggering with atd?
A reboot command is triggered with atd- no parameters. To stress the system, as noted above, we are also copying a large file to the shared drive. This copy is independent of the cluster manager- just a cp from a terminal. When I've run fsck, this is the inode that always comes up corrupted, although others have been found to be bad. Could this transfer induce an non-zero open count on the shared partition that causes the umount to fail or not behave correctly such that i/o's are still being accepted- I'm just thinking out loud. The file system thinks that the shared partition has been umounted cleanly, so the superblock flags are set correctly. This can only be reproduced with the cluster manager doing the remount on the other host. Reboots (and power cycles without shutdowns) on a single host running the cluster manager all work fine. It's only when a standby host is available that the problem occurs. Don't know if i've helped, but it's more information.
If 'force-unmount' support is enabled for that file system in the cluster configuration, any PIDs with open files on the partition will be killed with prejudice (first by SIGTERM, then by SIGKILL if necessary) by the service scripts. Now, if something in the service depending on the file system wasn't getting stopped (e.g. a user script returned failure, preventing the umount [including force-unmount] from happening, there's a fix for that waiting to go out: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=118886 https://bugzilla.redhat.com/bugzilla/attachment.cgi?id=100182&action=view However, I don't think this is the problem, as in the bugzilla above, the superblock certainly was not clean. The journal gets replayed, etc. Have you considered introducing power switches to the mix if you have not already done so?
A full trace of the oops would be very helpful as well. thanks.
I've talked with another engineer here that has done some of the testing on this problem and have come up with the following matrix regarding the conditions under which the filesystem on the shared drives is corrupted. The first list is the results from the case where the operational system is rebooted. The condition of the standby host is next to each case: running normally: Panic, resulting from disk corruption running in single user mode: Panic, resulting from disk corruption powered off: Operational host reboots without problem. In the 'running normally' case, the cluster manager is up and running. The next set of cases is when the operational host is suddenly powered off (via the power switch). Once again, the condition of the standby host is given for each case to the left of the colon- running normally: Panic, resulting from disk corruption running in single user mode: Panic, resulting from disk corruption powered off: Operational host reboots without problem. As can be seen, the results are the same. I didn't run the single user mode tests, so I need more info about what was going on in that state; it seems as though it should behave the same as when the standby host is powered off. I'll look into this situation. The panics occur when the system is rebooting- the shared partitions are mounted and a panic occurs. If the system is brought up in single user mode, and fsck is attempt without the force flag, fsck reports that the partition is fine. We will attempt to get a full oops message for you- the problem is that the stack is so large, it scrolls off the screen of the kvm and we can't scroll back. I'm going to hook a serial console to the machine and see if I can get a complete dump that way. If you know of any other tricks for getting dump info, let me know.
We have reproduced the panic with e49 (the general release version) and am attaching the traceback. I have something of a workaround by running fsck whenever the system comes up via fstab. This usually prevents the panic (although not always as per the enclosed trace), and frequently fsck needs manual intervention to complete.
Created attachment 102983 [details] Traceback from e49 kernel panic
As an additional data point, I reproduced this problem using LAS3.0ES on a pair of HP DL380G3s with 9GB of memory (only 8GB available due to ES config) and an MSA500 storage array. This combo is sold by HP as a package- their Proliant DL380 G3 Packaged Cluster with MSA500. All instances of this problem have been seen on this hardware config. The problem reproduced exactly with LAS3.0 and the 3.0 cluster manager (1.2.16-1). This leads me to wonder if there is an issue with the MSA500, where the shared file system lives. Is a scsi reset being issued by the standby host before the MSA500 can flush its buffers? I've noticed on 3.0 that the mount is transferred very quickly from host to host- pretty much as soon as the cluster manager shuts down on the rebooting host. I'm assuming that a reset is done as part of the mount. Am looking into this; any insight you could provide would be welcome.
Mount operations don't generally perform SCSI resets; if that's what you were asking. SCSI resets typically happen at boot time. At any rate, if you are seeing SCSI resets from the other member (ever), then the shared storage configuration you're using won't work very well, if at all.
No, we haven't seen scsi resets- just a poor choice of hypothethicals. I guess my question becomes whether something occurs during the mount that would disrupt the writing of buffers on the array.
I don't suspect so. Is the cache on the MSA configured in write-through mode or disabled?
Am looking into the MSA configuration; having trouble locating info on how to configure the MSA- we don't change anything during system assembly. Just taking another tack, I decided to see if the forced unmounts were working. Another engineer has been doing the testing, so I asked if we had been seeing failed umounts, and he said "All the time". I'm enclosing a piece of /var/log/messages from earlier in the week with the messages. It looks like the cluster shuts down with the mounts still up. This can't be good; the other system could then mount them before the rebooting system has a chance to flush its buffers. Does this sound plausible?
Created attachment 103141 [details] /var/log/messages sample showing failed umounts of shared partition
Please set the service manager's log level to 7, reproduct, and show the resulting logs. My guess is that there's interaction between what looks like fabric-level fencing and clumanager. You can't unmount a file system if processes are using it -- and you can't kill a process using a file system if that process is in disk-wait, which would happen if you fenced a device (or set of devices) while one or more processes was using them (e.g. copying a file). So, to be clear: - Process using file system - File system fenced - Process in disk wait. - clumanager stop - clumanager tries to kill process, process won't die - Service stop fails - clumanager exits, complaining The below bugzilla has a link to an errata which will prevent the service from restarting (by disabling the failed-to-stop service), but that's probably not what you're after. However, it's a start. https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=128179
Is that link correct? It seems to point back to this bug (and that's where I went when I clicked on it ;^)
No. No, it's not. http://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=118886 The above is, and here's the errata: http://rhn.redhat.com/errata/RHBA-2004-223.html
Created attachment 103184 [details] Cluster file from same period as messages file.
Sent a cluster log from the same period as the messages file. The system had been at log level 7. If more logs are needed, just ask.
Have done some more investigation of the problem and found that the copy of the large file to the shared partition results in the copy process to spend a large amount of time in the task uninterruptible state making it unkillable, making the umount of the shared partition fail as the cluster manager is shutting down in response to the reboot command. We've done some rough measurements that show that it can take in excess of 3 minutes to completely shutdown the machine after the cluster manager has gone down, so there is ample time for the dual mount to occur. I'm also looking at solutions: the first is making umount -f work for locally mounted file systems. I found a patch on the net which looks credible and am investigating its potential. In parallel, I'm trying to put a sleep in the SVC_START switch in handle_svc_request in the service manager daemon, figuring if I can delay the start up of the service on the surviving node long enough, the umounts at system shutdown will release the shared drive. I'm a little uncomfortable with a 3+ minute sleep, but I'm not certain what other alternatives there are. Any thoughts on the above? Thanks.
Tried the sleep in the SVC_START in the service manager, but at 300 seconds, it still wasn't long enough to allow the rebooting node to shutdown, so that approach didn't work. Then hit on the idea to prevent the cluster from being shutdown as part of the /etc/init.d scripts. So I commented out the 'touch /var/lock/subsys/cluster' in init.d/cluster, and renamed the CLUSTER_LOCKFILE to /var/lock/clulog. The init doesn't stop daemons without a lock file. Now the cluster stops at the end of the shutdown, so the relocate to the standby host occurs well after the active host is down, so no dual mounts. We've been running the fix for 24 hours and it seems to work. The lock file in its original location doesn't seem to be used by the init script, and relocating it provides the lock.c functionality.
Paul, just to make sure, are you still seeing the double mount problem with the Errata that Lon mentioned in Comment #22 (http://rhn.redhat.com/errata/RHBA-2004-223.html)?
The 1.0.27 version of clumanager should prevent the service from starting on the other node if force-unmount fails. Another way to handle it is to reboot the node (the Un-pretty way, e.g. "reboot -fn") if it can't safely umount the file system (instead of disabling the service).
I looked at the changelog for RHBA-2004-223.html and didn't see a fix that looked like it addressed this issue, but I'm removing my change from clumanager and we'll run our test on 1.0.27 and see if it reproduces. Should know the result sometime on Monday.
We tried using a vanilla 1.0.27 over the weekend (without the fix described in #26) and the file system corruption occurred. I don't understand your comment in #30 regarding the service not starting on the other node if the force-unmount fails. Does this apply to a "cluadmin -- service relocate..." or a system shutdown? On a system shutdown, the cluster tasks are stopped, but the mount is left behind when umount fails. The standby system sees the original system as down and relocates the service, which includes any associated mounts. Am I missing something? It would seem to lead to the same situation described in the next paragraph if umount failure prevented the cluster from shutting down. An interesting side effect of the #26 fix and I suspect the reboot -fn approach is that the node is never marked as down on the quorum disk. This causes the surviving node to power cycle the shutdown/rebooting node in situations where you really want to shutdown or reboot. Am looking at potential fixes, but may be hacking a hack.
Clarification of #30: If a service fails to stop (e.g. file system can't unmount), the service should be marked as 'disabled', not 'stopped' or 'started', etc. This, in turn, should prevent it from failing over when the node is declared down, which should, in turn, prevent the file system corruption at a cost of service down time (which the administrator will need to fix). You should see logs similar to the following on 1.0.27 if any service fails to stop during shutdown (for any reason; disk/script/etc.): May 12 13:50:00 blue clusvcmgrd[17244]: <warning> Service 1 failed to stop - disabling May 12 13:50:02 blue clusvcmgrd[16826]: <crit> Not all services stopped cleanly, reboot needed May 12 13:50:11 blue cluster[17459]: <notice> Completed shutdown of Cluster Manager (All of that said, there's still a kernel panic which needs to be fixed...!)
This bug is filed against RHEL2.1, which is in maintenance phase. During the maintenance phase, only security errata and select mission critical bug fixes will be released for enterprise products. Since this bug does not meet that criteria, it is now being closed. For more information of the RHEL errata support policy, please visit: http://www.redhat.com/security/updates/errata/ If you feel this bug is indeed mission critical, please contact your support representative. You may be asked to provide detailed information on how this bug is affecting you.