From Bugzilla Helper: User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7) Gecko/20040803 Firefox/0.8 Description of problem: While shutting down or reboot vi init 0 or 6 I got the messages Unmounting filesystems: umount2: Device or resource busy umount: /var: device is busy The error occours 3 times more (while loop in /etc/init.d/halt). I have placed a lsof in /etc/init.d/halt: ... while [ -n "$remaining" -a "$retry" -gt 0 ] do if [ "$retry" -lt 3 ]; then LANG=C runcmd $"Unmounting file systems (retry): " umount -f $remaining else LANG=C runcmd $"Unmounting file systems: " umount -f $remaining fi sleep 2 remaining=`halt_get_remaining | sort -r` [ -z "$remaining" ] && break /sbin/fuser -k -m $sig $remaining >/dev/null lsof /var sleep 5 retry=$(($retry-1)) sig=-9 done ... There are no open files. All services stopping without failing before. The machine acts as a Samba server. Only standard service like ssh, cron and so on are active too. The problem is hardware indepent and occours in 90% of all shutdown or reboot processes, not always. Version-Release number of selected component (if applicable): initscripts-7.31.16.EL-1 How reproducible: Sometimes Steps to Reproduce: 1. Installing RHEL3ES with Update 3 on a machine. 2. Let the server running for a while (some minutes seem to be enough). 3. Do an init 0 or init 6. Actual Results: /var filesystem could not be unmounted. Expected Results: /var filesystem should be unmounted without errors Additional info: Stopping all running services and unmounting /var filesystem by hand seems to be no problem. The /var filesystem is a separate lvm volume and ext3-formatted. After rebooting the filsystem seems to be clean, no fsck is running.
Are you using the audit daemon, or is it disabled?
Yes, the audit daemon is up and running. But I can stop and start this daemon properly by hand. Is there any configuration issue for the audit daemon?
Just testing a theory... I'm assuming stopping the audit daemon beforehand doesn't affect this.
I have been having the same issue on several servers as well. The /var filesystem on the servers having the issue are just normal ext3 filesystems, no lvm or raid. RHEL 3 update 3.
That's right, I've seen the problem since using update 3, but never before.
I'm going to go out on a limb and assume that backing down to the U2 initscripts doesn't help?
I think vixie-cron-3.0.1-75.1 problem. vixie-cron-3.0.1-75.1 supports laus logging. When audit is off as Bugzilla#131860, this problem dos not occur.
Well, since I did a "chkconfig audit off" the problem doesn't occour anymore.
*** Bug 143694 has been marked as a duplicate of this bug. ***
"Just testing a theory... I'm assuming stopping the audit daemon beforehand doesn't affect this." That is correct, if the audit daemon has been started and you stop it (even killing the process with kill), the problem still occurs. -Darrian
Even once a cron scripts (e.g. /etc/cron.d/sysstat) is executed during audit daemon is running, this problem occurs. You need to stop audit daemon before the next record appears. ---- /var/log/cron ---- CROND[3063]: (root) CMD (/usr/bin/mrtg /etc/mrtg/mrtg.cfg) CROND[3065]: (root) CMD (/usr/lib/sa/sa1 1 1) ---- /var/log/cron ---- This problem does not occur on vixie-cron-3.0.1-74. Sorry, my poor English.
A note: As described in Issue 44127, lsof will not show open files if they belong to detached threads created by a parent thread which has exited. This may explain why lsof is not showing files held by daemons.
Thsi is from HP: Did some further testing. Here are my results: 1) As we both noticed, /var must be on its own partition 2) I can reproduced the umount failure on shutdown when vixie-cron-3.0.1-75.1 which comes with RHEL 3.0 U3 is installed. It is timing related, and you have to wait until you see audit messages indicating that cron has loaded entry: 2005-01-13T11:23:36 2 1523 -1 cron: crontab entry loaded - uid=0, g id=0, cmd=run-parts /etc/cron.hourly 2005-01-13T11:23:36 3 1523 -1 cron: crontab entry loaded - uid=0, g id=0, cmd=run-parts /etc/cron.daily 2005-01-13T11:23:36 4 1523 -1 cron: crontab entry loaded - uid=0, g id=0, cmd=run-parts /etc/cron.weekly 2005-01-13T11:23:36 5 1523 -1 cron: crontab entry loaded - uid=0, g id=0, cmd=run-parts /etc/cron.monthly 2005-01-13T11:23:36 6 1523 -1 cron: crontab loaded - crontab=/etc/c rontab, user=*system* After that, I can see the umount /var failures on shutdown. 3) If I install vixie-cron-3.0.1-74 which came with RHEL 3.0 U2: rpm -Uvh --force vixie-cron-3.0.1-74 I can't get it to fail. Even after letting it sit for hours then doing a shutdown. 4) If I go back to the vixie-cron-3.0.1-75.1, I can then eventually get it to fail given the items in #2 above.
When the problem occurs, what are the last few lines of /var/log/cron ? Please append the output of the command: # tail -100 /var/log/cron to this bug report. You might try installing the latest cron version for RHEL-3: vixie-cron-4.1-1_EL3 It may not be in RHEL-3 update channels yet - if not, it can be downloaded from: http://people.redhat.com/~jvdias/cron/RHEL-3
Created attachment 110954 [details] tail -100 of cron log as requested.
We have a customer having the same issue. The vixie-cron patch mentioned above did not worked out for him. Attached is his cron file (tail -100, anyways).
This is not a vixie-cron OR a laus issue. If in /etc/init.d/halt, the script that does the umounts and prints the messages: " Unmounting filesystems: umount2: Device or resource busy umount: /var: device is busy " you do a listing of active processes with open files, using "lsof", just before the code that prints the message, around line 157: ... while [ -n "$remaining" -a "$retry" -gt 0 ] do echo 'LSOF /var:' /sbin/lsof +D /var echo 'LSOF:' /sbin/lsof | /bin/egrep -v 'lsof|egrep' if [ "$retry" -lt 3 ]; then ... (copy /usr/sbin/lsof to /sbin/lsof first) If you then issue the 'halt' command, you can see all the processes that have any files on /var or that are still running, using the <shift> + <page up> / <page down> keys to view the process listing. This shows NO instances of auditd or vixie-cron or any process that has a file open under the /var partition. The only processes that are running at this point are: init migration keventd ksoftirqd kswapd bdflush kupdated mdrecover scsi_eh_0 scsi_eh_1 kjournald S01halt Because only init and the kernel daemons are running at this point, and we've verified that init has no open files on /var, this seems to be a kernel issue, in that either: A) The umount(2) system call is failing for no good reason or B) One or other of the kernel daemons (eg. kjournald) is blocking the umount . Kernel debugging needs to be enabled to get to the bottom of this. I'm moving this bug to the kernel .
Please install and run the kernel-smp-2.4.21-28.umount3.EL.i686.rpm kernel located in: http://people.redhat.com/anderson/.BZ_142532 It contains several debug statements that will be displayed in the failure path of a umount attempt. The output will be of the sort: do_umount: EBUSY: mnt_count: # flags: !MNT_DETACH do_umount: super_block: dirty: # locked: # open: # where the mnt_count # will be 3 or more, and there may be non-zero counts for dirty, locked and open files (if we're lucky). It's just a start, but will at least let us know whether there are file-related reasons for the umount failure, or if not whether we only have the mnt_count to work with (which should be 2 at umount time), but would be higher if there was either a process standing in the /var filesystem, or a mntget() call not accompanied by an associated mntput().
I've tested the problem with the debug kernel now - the output during the unmount phase of /etc/init.d/halt is: Unmounting file systems (retry): \ do_umount: EBUSY: mnt_count: 9 flags: !MNT_DETACH do_umount: super_block: dirty: 426 locked: 0 open: 0 sys_umount: do_umount returned EBUSY umount2: Device or resource busy umount: /var: device is busy Again, the 'lsof +D/var' showed NO open files on /var, and ONLY these processes were running: init migration keventd ksoftirqd kswapd bdflush kupdated mdrecover scsi_eh_0 scsi_eh_1 S01halt (/bin/bash) and the lsof listing showed NO open files on /var . I also tried inserting a 'sync' command into the retry loop, thinking that might clear the super-block dirty condition, but no joy. The last umount debug output of the kernel ( the retry loop tries three times) differed from that above only in this line: do_umount: super_block: dirty: 0 locked: 0 open: 0 So this means that the SB is NOT dirty, there are NO locked or open files, yet still the umount fails ? This problem is intermittent : it does not happen unless the auditd has been running and audit messages have been emitted ; but at the point where the umount fails, auditd is definitely not running; but the audit module is still loaded. Could this be something to do with the audit module ? I've never used netdump / crashdump before, but would be happy to try to generate a crash if that would help - some tips on setting it up would be helpful.
I've also tried inserting : /sbin/rmmod audit /sbin/lsmod /bin/sync in halt's retry loop, but again the problem occurs. Again the umount debug says: do_umount: super_block: dirty: 0 locked: 0 open: 0 These are the modules still loaded at this point: nfs lockd sunrpc usbserial parport_pc lp parport tulip crc32 floppy sg microcode keybdev mousedev hid input ehci-hcd usb-uhci usbcore ext3 jbd ata_piix scsi_dump_register libata sd_mod scsi_mod NO audit module is loaded. When the auditd process is not started and the machine is immediately shutdown after starting, the umount sometimes succeeds and this problem does not occur - about once out of 10 tries .
Hey, Jason, sorry I didn't even realize you were in Westford! (We're getting too big here...) I'll definitely come over later and set up for a netdump. By the way, when the dirty/locked/open counts drop to zero, what was the "mnt_count" shown in the printk just above it at that point?
We've finally found the root cause of this bug: whenever auditd does a read() of the /dev/audit device, the kernel's mnt_get count for the /var partition increases by 7, and then never decreases below 9 - umount() will succeed only if the mnt_get count is 2 . This happens regardless of whether auditd keeps its log files on /var or not. We are currently working on a fix for this.
Jason, I've put a new "umount4" test kernel in /usr/tmp on your machine, along with its debuginfo package. It will trace all mntget()'s done during the auditf_read() span. Because mntget is a very commonly-called function, tt also might help if you boot the system with "maxcpus=1" so that we don't get any intermingling of trace outputs. Dave
lwhatley.mil
Test kernels with a fix for this problem can be found in: http://people.redhat.com/anderson/.BZ_142532 There are 3 i686 kernels there: kernel-2.4.21-31.bz142532.EL.i686.rpm kernel-hugemem-2.4.21-31.bz142532.EL.i686.rpm kernel-smp-2.4.21-31.bz142532.EL.i686.rpm If a non-i686 kernel is needed, please post a request for it here.
Dave, Any idea when this fix will be implemented in an official kernel? More important; is this issue and imortant on or should i just ignore the error untill a new kernel with this fix is available? Ronald
That's up to the RHEL3 kernel maintainer, but my presumption is that it will be in RHEL3-U6. At this point in time, the question is whether it's going to be considered for a RHEL3-U5 re-spin, but again, that's not my call to make. It's a problem having to do with improper up-by-one reference counts on the files and their containing directories in /etc/audit/filesets.conf. The contents of those files should not be compromised by this bug.
Dave's fix will not be considered for a U5 respin, but it is on track for inclusion within the first couple of U6 builds.
A fix for this problem has just been committed to the RHEL3 U6 patch pool this evening (in kernel version 2.4.21-32.2.EL).
Super, thx. Ronald
>Dave, > >Any idea when this fix will be implemented in an official kernel? > >More important; is this issue and imortant on or should i just ignore the error >untill a new kernel with this fix is available? > >Ronald I am using an Intel server with built-in RAID 1 capacity to run 2x400G NCQ SATA harddisk. The provided RHAS3.0U3 driver doesn't allow me to update to U6 kernel!! Shoud I just ignore the error and wait for the intel update driver? I am so worry about the both /var and /var/spool partitions which contain important mails. Patrick
> I am using an Intel server with built-in RAID 1 capacity to run 2x400G NCQ SATA > harddisk. > The provided RHAS3.0U3 driver doesn't allow me to update to U6 kernel!! > Shoud I just ignore the error and wait for the intel update driver? This question is outside the scope of this bugzilla. Please work with your support contact for resolution of that particular problem. As I mentioned above in comment #34, the problem has to do with improper up-by-one reference counts on the files and their containing directories in /etc/audit/filesets.conf. But the contents of those files should not be compromised by this bug.
U6 is still under development. (It has not yet been released.)
ok, so I just stop using audit daemon will be fine. thanks PT
*** Bug 142960 has been marked as a duplicate of this bug. ***
To test this issue, simply run the system with LAuS and cron enabled . 1. Ensure LAuS is enabled (it is not by default): # chkconfig --level=2345 auditd on; chkconfig --level=016 auditd off 2. To speed things up, create a minute-ly cron job: # echo '* * * * * /bin/date>/dev/null' > /var/spool/cron/root 3. # reboot When the rebooted system has been up more than two minutes, audit log records will have been generated by cron. Now, to test: # halt system should shutdown cleanly with no messages like: ' Unmounting filesystems: umount2: Device or resource busy umount: /var: device is busy '
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2005-663.html
We run kernel-2.4.21-37.0.1.EL, and kernel-smp-2.4.21-37.0.1.EL with the auditd turned off. We didn't see the error with stock U6 kernel, but now see it again with the latest security fix kernel!
Re: comment #56 Could you please confirm that the bug persists even with the latest updates? Thanks.
The bug reported here has been fixed in RHEL3 U8 (kernel version 2.4.21-47.EL). If you still encounter a problem with U8, please enter a new bugzilla report.