Bug 121459

Summary: System running LAuS hanging regularly
Product: Red Hat Enterprise Linux 3 Reporter: Peggy Proffitt <peggy.proffitt>
Component: lausAssignee: Charlie Bennett <ccb>
Status: CLOSED CURRENTRELEASE QA Contact: Jay Turner <jturner>
Severity: high Docs Contact:
Priority: medium    
Version: 3.0CC: fenlason, laroche, mdewand, shillman, srevivo
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2004-04-29 13:35:54 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 119235    
Attachments:
Description Flags
tar file - content documented in Bug Report none

Description Peggy Proffitt 2004-04-21 19:36:06 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; 
T312461)

Description of problem:
Installed RHEL 3 ES Update 2 beta on IBM SMP Blade server and enabled 
auditing. The audit data is collecting as expected, but the system is 
hanging several times a day. There are no errors in /var/log/messages 
near the time of the hang. The system console is unresponsive. The 
system recovers after a reset. The auditing function is critical to 
our site. While this is a test platform, we plan to install RHEL 3 
Update 2 on all of our platforms once it is officially released.

I will be sending a tar file with the following supporting data:

messages
audit.conf
filter.conf
filesets.conf
info         (file containing output of uname -a and rpm -qa)
pam.d/*      (pam files modified for LAuS)


Version-Release number of selected component (if applicable):
laus-0.1-48RHEL3

How reproducible:
Didn't try


Additional info:

Comment 1 Peggy Proffitt 2004-04-21 19:40:59 UTC
Created attachment 99610 [details]
tar file - content documented in Bug Report

Comment 2 Mark DeWandel 2004-04-22 13:59:11 UTC
Although you mention that the console is unresponsive, I was wondering
if this includes alt-sysrq as well.  If alt-sysrq-t works, the output
would be very helpful.  Also, I see that your audit.conf file specifies
file mode logging instead of bin logging.  Are you running out of disk
space on the partition containing the log?


Comment 3 Suzanne Hillman 2004-04-22 14:32:47 UTC
It would also be very useful to have a sense of what partitions you
are using for what - can you give the output from 'df', as well as
what your /etc/fstab looks like?

Comment 4 Peggy Proffitt 2004-04-22 16:09:38 UTC
I set kernel.sysrq to 1 late yesterday. The next time the system 
hangs I'll gather what I can. Will the following be sufficient?
1) Alt-SysRq-p   (3-4 times to log processor information)
2) Alt-SysRq-m   (several times to log memory layout information)
3) Alt-SysRq-t   (once to log stack trace information)

The other output requested:
df output:
Filesystem           1K-blocks      Used Available Use% Mounted on
/dev/sda2             65383224   5777908  56230464  10% /
/dev/sda1               101089     14950     80920  16% /boot
/dev/sdb2             68547668   5882512  62665156   9% /fads/fdl4
none                   1289092         0   1289092   0% /dev/shm
/dev/sda3              1976268     32812   1841444   2% /temp
ipsnfs2:/fads/admin   40960000  18418820  22541180  45% /fads/admin
fdl1:/fads/fdl1       68547668   1417968  67129700   3% /fads/fdl1
fdl2:/fads/fdl2       68547668   5100984  63446684   8% /fads/fdl2
fdl3:/fads/fdl3       68547668    691788  67855880   2% /fads/fdl3

/etc/fstab:
LABEL=/                 /                       ext3    
defaults        1 1
LABEL=/boot             /boot                   ext3    
defaults        1 2
none                    /dev/pts                devpts  
gid=5,mode=620  0 0
LABEL=/fads/fdl4        /fads/fdl4              ext3    
defaults        1 2
none                    /proc                   proc    
defaults        0 0
none                    /dev/shm                tmpfs   
defaults        0 0
LABEL=/temp             /temp                   ext3    
defaults        1 2
/dev/sdb1               swap                    swap    
defaults        0 0
/dev/sda5               swap                    swap    
defaults        0 0
/dev/cdrom              /mnt/cdrom              udf,iso9660 
noauto,owner,rw 0 0
/dev/sdc                /mnt/floppy             auto    noauto,owner 
0 0
/dev/cdrom1             /mnt/cdrom1             udf,iso9660 
noauto,owner,ro 0 0
ipsnfs2:/fads/admin     /fads/admin     nfs     
bg,soft,intr,rsize=16384,wsize=16384,timeo=20
fdl1:/fads/fdl1 /fads/fdl1      nfs     
bg,soft,intr,rsize=16384,wsize=16384,timeo=20
fdl2:/fads/fdl2 /fads/fdl2      nfs     
bg,soft,intr,rsize=16384,wsize=16384,timeo=20
fdl3:/fads/fdl3 /fads/fdl3      nfs     
bg,soft,intr,rsize=16384,wsize=16384,timeo=20


Comment 5 Mark DeWandel 2004-04-22 16:26:27 UTC
Alt-SysRq-t is likely to be the most useful but certainly processor
and memory information may be helpful too.




Comment 6 Suzanne Hillman 2004-04-22 17:15:08 UTC
Also, any information on what the machine is being asked to do would
be helpful.

Comment 7 Suzanne Hillman 2004-04-22 18:17:06 UTC
Yet another question - what blade being used, exactly? By this, I
mean, what are the numbers which are part of the name?

Comment 9 Peggy Proffitt 2004-04-22 18:43:06 UTC
The system is not being heavily used at the time of hangs. The system 
has only had a couple of users access it. The only thing that we've 
noticed is that there may be a connection to the use of vim, but if 
so it is not reliably repeatable. On the first hang that occurred 
after the upgrade, I was the only user on the platform and had just 
begin to edit a file when the system stopped responding.

The blade server is an IBM BladeCenter (86771XX) with 4 HS20 model 
blades (867861X). Each system has two 2.8 GHz Xeon processes. Three 
of the systems are loaded with RHEL 3 Update 1. They are being used 
to port software. The fourth system is currently reserved for testing 
the beta RHEL 3 Update 2 release - this is the one that is hanging.

Comment 12 Peggy Proffitt 2004-04-23 14:18:11 UTC
The system was hung again this morning. The last audit record was 
logged around 5:34. The system did not respond to the sysrq 
sequences. After recovering, I upgraded the kernel to 2.4.21-14.ELsmp 
and laus to laus-0.1-54RHEL3.

I have not created the file /etc/sysconfig/audit. Do you recommend 
settings other that the default?

Comment 13 Mark DeWandel 2004-04-23 18:38:10 UTC
Peggy, thanks for trying alt-sysrq-t.  It looks like we need to move
on to a bigger club unfortunately.  What I would suggest to try next
is to add "nmi_watchdog=1" (without the quotes) to the appropriate
kernel line in /boot/grub/grub.conf.  For example,

title Red Hat Enterprise Linux AS (2.4.21-15.ELsmp)
        root (hd0,0)
        kernel /vmlinuz-2.4.21-15.ELsmp ro root=LABEL=/ nmi_watchdog=1
        initrd /initrd-2.4.21-15.ELsmp.img

The intended purpose of this is to provoke a panic via the NMI
watchdog timer which will hopefully yield some hint as to where we
are in the kernel.  Note that it will take 30 seconds after the hang
before panic output will appear on the console.

Comment 14 Peggy Proffitt 2004-04-26 21:22:34 UTC
I added the nmi_watchdog parameter to my kernel entry on Friday 4/23. 
The system has not hung since the latest kernel and laus rpms were 
installed earlier that day. I'm encouraging users to work on the 
server, but it's been a slow day. I'll update this as soon as 
something happens.

Comment 15 Peggy Proffitt 2004-04-28 20:38:42 UTC
Now that the system is staying up, we've begun to do some testing. 
When we run a regression test for some of our software with auditing 
turned off the test completes in approximately 4 minutes. When we run 
the same test with auditing enabled, the test takes well over an hour 
to complete. The system is spending a lot of time in iowait. The top 
command shows the CPU state of auditd as "D", which the man page for 
top defines as uninterruptable sleep. Is this a known problem? If 
not, should I open a bug report?

Comment 16 Peggy Proffitt 2004-04-29 13:19:11 UTC
The updated kernel and laus packages appear to have resolved the 
stability problem. I've opened bug 121970 to document the performance 
problem. Thanks for your help.

Comment 17 Suzanne Hillman 2004-04-29 13:35:54 UTC
Closing, as per comment #16.