Bug 142532 - error unmounting /var filesystem while shutdown
Summary: error unmounting /var filesystem while shutdown
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 3
Classification: Red Hat
Component: kernel
Version: 3.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
Assignee: Dave Anderson
QA Contact: Brian Brock
URL:
Whiteboard:
: 143694 (view as bug list)
Depends On:
Blocks: 156321
TreeView+ depends on / blocked
 
Reported: 2004-12-10 14:00 UTC by Stefan Kutzke
Modified: 2007-11-30 22:07 UTC (History)
15 users (show)

Fixed In Version: RHSA-2005-663
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2005-09-28 14:35:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
tail -100 of cron log as requested. (4.22 KB, text/plain)
2005-02-11 02:59 UTC, Geronimo A. Ordanza II
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2005:663 0 qe-ready SHIPPED_LIVE Important: Updated kernel packages available for Red Hat Enterprise Linux 3 Update 6 2005-09-28 04:00:00 UTC

Description Stefan Kutzke 2004-12-10 14:00:24 UTC
From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7)
Gecko/20040803 Firefox/0.8

Description of problem:
While shutting down or reboot vi init 0 or 6 I got the messages

Unmounting filesystems: umount2: Device or resource busy
umount: /var: device is busy

The error occours 3 times more (while loop in /etc/init.d/halt).

I have placed a lsof in /etc/init.d/halt:

... 
while [ -n "$remaining" -a "$retry" -gt 0 ]
do
    if [ "$retry" -lt 3 ]; then
        LANG=C runcmd $"Unmounting file systems (retry): "  umount -f
$remaining
    else
        LANG=C runcmd $"Unmounting file systems: "  umount -f $remaining
    fi
    sleep 2
    remaining=`halt_get_remaining | sort -r`
    [ -z "$remaining" ] && break
    /sbin/fuser -k -m $sig $remaining >/dev/null
    lsof /var
    sleep 5
    retry=$(($retry-1))
    sig=-9
done
...

There are no open files.

All services stopping without failing before. The machine acts as a
Samba server. Only standard service like ssh, cron and so on are
active too.

The problem is hardware indepent and occours in 90% of all shutdown or
reboot processes, not always.

Version-Release number of selected component (if applicable):
initscripts-7.31.16.EL-1

How reproducible:
Sometimes

Steps to Reproduce:
1. Installing RHEL3ES with Update 3 on a machine.
2. Let the server running for a while (some minutes seem to be enough).
3. Do an init 0 or init 6.


Actual Results:  /var filesystem could not be unmounted.

Expected Results:  /var filesystem should be unmounted without errors

Additional info:

Stopping all running services and unmounting /var filesystem by hand
seems to be no problem.
The /var filesystem is a separate lvm volume and ext3-formatted.
After rebooting the filsystem seems to be clean, no fsck is running.

Comment 1 Bill Nottingham 2004-12-10 21:38:40 UTC
Are you using the audit daemon, or is it disabled?


Comment 2 Stefan Kutzke 2004-12-13 16:08:17 UTC
Yes, the audit daemon is up and running. But I can stop and start this
daemon properly by hand. Is there any configuration issue for the
audit daemon?

Comment 3 Bill Nottingham 2004-12-14 01:16:34 UTC
Just testing a theory... I'm assuming stopping the audit daemon
beforehand doesn't affect this.

Comment 4 Jimmy Dean 2004-12-14 01:28:21 UTC
I have been having the same issue on several servers as well.  The
/var filesystem on the servers having the issue are just normal ext3
filesystems, no lvm or raid.  RHEL 3 update 3. 

Comment 5 Stefan Kutzke 2004-12-14 15:35:42 UTC
That's right, I've seen the problem since using update 3, but never
before.

Comment 6 Bill Nottingham 2004-12-14 16:51:51 UTC
I'm going to go out on a limb and assume that backing down to the U2
initscripts doesn't help?

Comment 7 Taichi Yanagiya 2004-12-15 03:13:38 UTC
I think vixie-cron-3.0.1-75.1 problem.
vixie-cron-3.0.1-75.1 supports laus logging.

When audit is off as Bugzilla#131860, this problem dos not occur.

Comment 8 Stefan Kutzke 2004-12-15 13:47:43 UTC
Well, since I did a "chkconfig audit off" the problem doesn't occour
anymore.

Comment 9 Bill Nottingham 2004-12-24 19:01:46 UTC
*** Bug 143694 has been marked as a duplicate of this bug. ***

Comment 10 Darrian Hale 2005-01-11 00:24:53 UTC
"Just testing a theory... I'm assuming stopping the audit daemon
beforehand doesn't affect this."

That is correct, if the audit daemon has been started and you stop it
(even killing the process with kill), the problem still occurs.

-Darrian

Comment 11 Taichi Yanagiya 2005-01-12 01:43:28 UTC
Even once a cron scripts (e.g. /etc/cron.d/sysstat) is executed 
during audit daemon is running, this problem occurs.

You need to stop audit daemon before the next record appears.

  ---- /var/log/cron ----
  CROND[3063]: (root) CMD (/usr/bin/mrtg /etc/mrtg/mrtg.cfg)
  CROND[3065]: (root) CMD (/usr/lib/sa/sa1 1 1)
  ---- /var/log/cron ----

This problem does not occur on vixie-cron-3.0.1-74.

Sorry, my poor English.


Comment 12 Steve Conklin 2005-01-12 16:10:09 UTC
A note: As described in Issue 44127, lsof will not show open files if they
belong to detached threads created by a parent thread which has exited. This may
explain why lsof is not showing files held by daemons.

Comment 13 Chris Williams 2005-01-14 14:57:56 UTC
Thsi is from HP:

Did some further testing.  Here are my results:

1) As we both noticed, /var must be on its own partition
2) I can reproduced the umount failure on shutdown when
vixie-cron-3.0.1-75.1 which comes
with RHEL 3.0 U3 is installed.  It is timing related, and you have to
wait until you see
audit messages indicating that cron has loaded entry:

2005-01-13T11:23:36      2   1523       -1 cron: crontab entry loaded -
uid=0, g
id=0, cmd=run-parts /etc/cron.hourly
2005-01-13T11:23:36      3   1523       -1 cron: crontab entry loaded -
uid=0, g
id=0, cmd=run-parts /etc/cron.daily
2005-01-13T11:23:36      4   1523       -1 cron: crontab entry loaded -
uid=0, g
id=0, cmd=run-parts /etc/cron.weekly
2005-01-13T11:23:36      5   1523       -1 cron: crontab entry loaded -
uid=0, g
id=0, cmd=run-parts /etc/cron.monthly
2005-01-13T11:23:36      6   1523       -1 cron: crontab loaded -
crontab=/etc/c
rontab, user=*system*

After that, I can see the umount /var failures on shutdown.

3) If I install vixie-cron-3.0.1-74 which came with RHEL 3.0 U2:

rpm -Uvh --force vixie-cron-3.0.1-74

I can't get it to fail.   Even after letting it sit for hours then doing a shutdown.

4) If I go back to the vixie-cron-3.0.1-75.1, I can then eventually get
it to fail given the items in #2 above.

Comment 15 Jason Vas Dias 2005-01-17 15:19:27 UTC
When the problem occurs, what are the last few lines of 
/var/log/cron ?
Please append the output of the command:
# tail -100 /var/log/cron 
to this bug report.

You might try installing the latest cron version for RHEL-3:
  vixie-cron-4.1-1_EL3 
It may not be in RHEL-3 update channels yet - if not, it can
be downloaded from:
  http://people.redhat.com/~jvdias/cron/RHEL-3 

 

Comment 18 Geronimo A. Ordanza II 2005-02-11 02:59:24 UTC
Created attachment 110954 [details]
tail -100 of cron log as requested.

Comment 19 Geronimo A. Ordanza II 2005-02-11 03:01:21 UTC
We have a customer having the same issue.  The vixie-cron patch
mentioned above did not worked out for him.  Attached is his cron file
(tail -100, anyways).

Comment 21 Jason Vas Dias 2005-02-24 19:28:35 UTC
This is not a vixie-cron OR a laus issue.

If in /etc/init.d/halt, the script that does the umounts and prints
the messages:
"
Unmounting filesystems: umount2: Device or resource busy
umount: /var: device is busy
"
you do a listing of active processes with open files, using "lsof",
just before the code that prints the message, around line 157:

...
while [ -n "$remaining" -a "$retry" -gt 0 ]
do
        echo 'LSOF /var:'
        /sbin/lsof +D /var
        echo 'LSOF:'
        /sbin/lsof | /bin/egrep -v 'lsof|egrep'
	if [ "$retry" -lt 3 ]; then
...

(copy /usr/sbin/lsof to /sbin/lsof first)

If you then issue the 'halt' command, you can see all the processes 
that have any files on /var or that are still running, using the
<shift> + <page up> / <page down> keys to view the process listing.

This shows NO instances of auditd or vixie-cron or any process that
has a file open under the /var partition.

The only processes that are running at this point are:

   init
   migration
   keventd
   ksoftirqd
   kswapd
   bdflush
   kupdated
   mdrecover
   scsi_eh_0
   scsi_eh_1
   kjournald
   S01halt
   
Because only init and the kernel daemons are running at this point,
and we've verified that init has no open files on /var, this seems
to be a kernel issue, in that either:
   A) The umount(2) system call is failing for no good reason
or B) One or other of the kernel daemons (eg. kjournald) is blocking
      the umount . 
Kernel debugging needs to be enabled to get to the bottom of this.
I'm moving this bug to the kernel .
 


Comment 23 Dave Anderson 2005-03-08 22:28:15 UTC
Please install and run the kernel-smp-2.4.21-28.umount3.EL.i686.rpm
kernel located in:

  http://people.redhat.com/anderson/.BZ_142532

It contains several debug statements that will be displayed in
the failure path of a umount attempt.  The output will be of
the sort:

  do_umount: EBUSY: mnt_count: # flags: !MNT_DETACH
  do_umount: super_block: dirty: # locked: # open: #

where the mnt_count # will be 3 or more, and there may be non-zero
counts for dirty, locked and open files (if we're lucky).

It's just a start, but will at least let us know whether there are 
file-related reasons for the umount failure, or if not whether we
only have the mnt_count to work with (which should be 2 at umount
time), but would be higher if there was either a process standing
in the /var filesystem, or a mntget() call not accompanied by an
associated mntput().



Comment 24 Jason Vas Dias 2005-03-08 23:23:01 UTC
I've tested the problem with the debug kernel now - the output during
the unmount phase of /etc/init.d/halt is:

Unmounting file systems (retry): \
 do_umount: EBUSY: mnt_count: 9 flags: !MNT_DETACH
 do_umount: super_block: dirty: 426 locked: 0 open: 0
 sys_umount: do_umount returned EBUSY
 umount2: Device or resource busy
 umount: /var: device is busy

Again, the 'lsof +D/var' showed NO open files on /var, and ONLY 
these processes were running:
    init  
    migration
    keventd
    ksoftirqd
    kswapd
    bdflush 
    kupdated
    mdrecover
    scsi_eh_0
    scsi_eh_1
    S01halt (/bin/bash)

and the lsof listing showed NO open files on /var .

I also tried inserting a 'sync' command into the retry loop, 
thinking that might clear the super-block dirty condition,
but no joy. 

The last umount debug output of the kernel ( the retry loop 
tries three times) differed from that above only in this line:
  
  do_umount: super_block: dirty: 0 locked: 0 open: 0

So this means that the SB is NOT dirty, there are NO locked or
open files, yet still the umount fails ? 
   
This problem is intermittent : it does not happen unless the
auditd has been running and audit messages have been emitted ;
but at the point where the umount fails, auditd is definitely
not running; but the audit module is still loaded. Could this
be something to do with the audit module ?

I've never used netdump / crashdump before, but would be happy
to try to generate a crash if that would help - some tips on 
setting it up would be helpful.


Comment 26 Jason Vas Dias 2005-03-08 23:48:13 UTC
I've also tried inserting :
     /sbin/rmmod audit
     /sbin/lsmod
     /bin/sync
in halt's retry loop, but again the problem occurs. 
Again the umount debug says:

  do_umount: super_block: dirty: 0 locked: 0 open: 0

These are the modules still loaded at this point:
  
  nfs lockd sunrpc usbserial parport_pc lp parport tulip
  crc32 floppy sg microcode keybdev mousedev hid input
  ehci-hcd usb-uhci usbcore ext3 jbd ata_piix scsi_dump_register
  libata sd_mod scsi_mod

NO audit module is loaded. 

When the auditd process is not started and the machine is immediately
shutdown after starting, the umount sometimes succeeds and this
problem does not occur - about once out of 10 tries . 


Comment 27 Dave Anderson 2005-03-09 16:07:47 UTC
Hey, Jason, sorry I didn't even realize you were in Westford!
(We're getting too big here...)  I'll definitely come over later
and set up for a netdump.

By the way, when the dirty/locked/open counts drop to zero,
what was the "mnt_count" shown in the printk just above it
at that point?

 

Comment 28 Jason Vas Dias 2005-03-10 16:53:46 UTC
We've finally found the root cause of this bug:
whenever auditd does a read() of the /dev/audit
device, the kernel's mnt_get count for the /var
partition increases by 7, and then never decreases
below 9 - umount() will succeed only if the mnt_get
count is 2 . This happens regardless of whether auditd 
keeps its log files on /var or not. We are currently 
working on a fix for this.  


Comment 29 Dave Anderson 2005-03-10 16:56:43 UTC
Jason,

I've put a new "umount4" test kernel in /usr/tmp on your machine,
along with its debuginfo package.  It will trace all mntget()'s
done during the auditf_read() span.  Because mntget is a very
commonly-called function, tt also might help if you boot the system
with "maxcpus=1" so that we don't get any intermingling of trace
outputs.

Dave

Comment 30 Lee Whatley 2005-03-18 22:21:08 UTC
lwhatley.mil

Comment 31 Dave Anderson 2005-03-18 22:48:23 UTC
Test kernels with a fix for this problem can be found in:

  http://people.redhat.com/anderson/.BZ_142532

There are 3 i686 kernels there:

  kernel-2.4.21-31.bz142532.EL.i686.rpm
  kernel-hugemem-2.4.21-31.bz142532.EL.i686.rpm
  kernel-smp-2.4.21-31.bz142532.EL.i686.rpm

If a non-i686 kernel is needed, please post a request for it here.




Comment 33 Ronald 2005-04-11 11:19:30 UTC
Dave,

Any idea when this fix will be implemented in an official kernel? 

More important; is this issue and imortant on or should i just ignore the error
untill a new kernel with this fix is available?

Ronald

 

Comment 34 Dave Anderson 2005-04-11 13:24:46 UTC
That's up to the RHEL3 kernel maintainer, but my presumption is
that it will be in RHEL3-U6.  At this point in time, the question
is whether it's going to be considered for a RHEL3-U5 re-spin,
but again, that's not my call to make.

It's a problem having to do with improper up-by-one reference
counts on the files and their containing directories in 
/etc/audit/filesets.conf.  The contents of those files should
not be compromised by this bug.
 


Comment 35 Ernie Petrides 2005-04-13 03:08:55 UTC
Dave's fix will not be considered for a U5 respin, but it is
on track for inclusion within the first couple of U6 builds.


Comment 36 Ernie Petrides 2005-04-23 00:37:57 UTC
A fix for this problem has just been committed to the RHEL3 U6
patch pool this evening (in kernel version 2.4.21-32.2.EL).


Comment 40 Ronald 2005-05-21 08:58:13 UTC
Super, thx. 

Ronald

Comment 41 Patrick Tsang 2005-05-30 09:41:04 UTC
>Dave,
>
>Any idea when this fix will be implemented in an official kernel? 
>
>More important; is this issue and imortant on or should i just ignore the error
>untill a new kernel with this fix is available?
>
>Ronald

I am using an Intel server with built-in RAID 1 capacity to run 2x400G NCQ SATA 
harddisk.
The provided RHAS3.0U3 driver doesn't allow me to update to U6 kernel!!
Shoud I just ignore the error and wait for the intel update driver?
I am so worry about the both /var and /var/spool partitions which contain 
important mails.

Patrick



Comment 42 Patrick Tsang 2005-05-30 09:42:02 UTC
>Dave,
>
>Any idea when this fix will be implemented in an official kernel? 
>
>More important; is this issue and imortant on or should i just ignore the error
>untill a new kernel with this fix is available?
>
>Ronald

I am using an Intel server with built-in RAID 1 capacity to run 2x400G NCQ SATA 
harddisk.
The provided RHAS3.0U3 driver doesn't allow me to update to U6 kernel!!
Shoud I just ignore the error and wait for the intel update driver?
I am so worry about the both /var and /var/spool partitions which contain 
important mails.

Patrick



Comment 43 Dave Anderson 2005-05-31 13:54:32 UTC
> I am using an Intel server with built-in RAID 1 capacity to run 2x400G NCQ SATA 
> harddisk.
> The provided RHAS3.0U3 driver doesn't allow me to update to U6 kernel!!
> Shoud I just ignore the error and wait for the intel update driver?

This question is outside the scope of this bugzilla.  Please work with
your support contact for resolution of that particular problem.

As I mentioned above in comment #34, the problem has to do with
improper up-by-one reference counts on the files and their containing
directories in /etc/audit/filesets.conf.  But the contents of those files
should not be compromised by this bug.
 

Comment 44 Ernie Petrides 2005-05-31 19:43:05 UTC
U6 is still under development.  (It has not yet been released.)


Comment 45 Patrick Tsang 2005-06-01 04:50:40 UTC
ok, so I just stop using audit daemon will be fine.
thanks
PT


Comment 46 Patrick Tsang 2005-06-01 04:51:46 UTC
ok, so I just stop using audit daemon will be fine.
thanks
PT


Comment 48 Peter Staubach 2005-06-09 14:49:57 UTC
*** Bug 142960 has been marked as a duplicate of this bug. ***

Comment 53 Jason Vas Dias 2005-08-16 18:06:55 UTC
To test this issue, simply run the system with LAuS and cron enabled .
1. Ensure LAuS is enabled (it is not by default):
   # chkconfig --level=2345 auditd on; chkconfig --level=016 auditd off
2. To speed things up, create a minute-ly cron job:
   # echo '* * * * * /bin/date>/dev/null' > /var/spool/cron/root
3. # reboot

When the rebooted system has been up more than two minutes, audit log records
will have been generated by cron.

Now, to test:
   # halt

system should shutdown cleanly with no messages like:
' Unmounting filesystems: umount2: Device or resource busy
  umount: /var: device is busy
'


Comment 55 Red Hat Bugzilla 2005-09-28 14:35:56 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2005-663.html


Comment 56 Tom O'Brien 2006-03-27 17:00:25 UTC
We run kernel-2.4.21-37.0.1.EL, and kernel-smp-2.4.21-37.0.1.EL with the 
auditd turned off. We didn't see the error with stock U6 kernel, but now see 
it again with the latest security fix kernel!

Comment 57 Demosthenes T. Mateo Jr. 2006-07-20 04:43:06 UTC
Re: comment #56

Could you please confirm that the bug persists even with the latest updates? Thanks.

Comment 58 Ernie Petrides 2006-07-20 20:22:40 UTC
The bug reported here has been fixed in RHEL3 U8 (kernel version 2.4.21-47.EL).

If you still encounter a problem with U8, please enter a new bugzilla report.


Note You need to log in before you can comment on or make changes to this bug.