Bug 121459
Summary: | System running LAuS hanging regularly | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 3 | Reporter: | Peggy Proffitt <peggy.proffitt> | ||||
Component: | laus | Assignee: | Charlie Bennett <ccb> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Jay Turner <jturner> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 3.0 | CC: | fenlason, laroche, mdewand, shillman, srevivo | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | i686 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2004-04-29 13:35:54 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 119235 | ||||||
Attachments: |
|
Description
Peggy Proffitt
2004-04-21 19:36:06 UTC
Created attachment 99610 [details]
tar file - content documented in Bug Report
Although you mention that the console is unresponsive, I was wondering if this includes alt-sysrq as well. If alt-sysrq-t works, the output would be very helpful. Also, I see that your audit.conf file specifies file mode logging instead of bin logging. Are you running out of disk space on the partition containing the log? It would also be very useful to have a sense of what partitions you are using for what - can you give the output from 'df', as well as what your /etc/fstab looks like? I set kernel.sysrq to 1 late yesterday. The next time the system hangs I'll gather what I can. Will the following be sufficient? 1) Alt-SysRq-p (3-4 times to log processor information) 2) Alt-SysRq-m (several times to log memory layout information) 3) Alt-SysRq-t (once to log stack trace information) The other output requested: df output: Filesystem 1K-blocks Used Available Use% Mounted on /dev/sda2 65383224 5777908 56230464 10% / /dev/sda1 101089 14950 80920 16% /boot /dev/sdb2 68547668 5882512 62665156 9% /fads/fdl4 none 1289092 0 1289092 0% /dev/shm /dev/sda3 1976268 32812 1841444 2% /temp ipsnfs2:/fads/admin 40960000 18418820 22541180 45% /fads/admin fdl1:/fads/fdl1 68547668 1417968 67129700 3% /fads/fdl1 fdl2:/fads/fdl2 68547668 5100984 63446684 8% /fads/fdl2 fdl3:/fads/fdl3 68547668 691788 67855880 2% /fads/fdl3 /etc/fstab: LABEL=/ / ext3 defaults 1 1 LABEL=/boot /boot ext3 defaults 1 2 none /dev/pts devpts gid=5,mode=620 0 0 LABEL=/fads/fdl4 /fads/fdl4 ext3 defaults 1 2 none /proc proc defaults 0 0 none /dev/shm tmpfs defaults 0 0 LABEL=/temp /temp ext3 defaults 1 2 /dev/sdb1 swap swap defaults 0 0 /dev/sda5 swap swap defaults 0 0 /dev/cdrom /mnt/cdrom udf,iso9660 noauto,owner,rw 0 0 /dev/sdc /mnt/floppy auto noauto,owner 0 0 /dev/cdrom1 /mnt/cdrom1 udf,iso9660 noauto,owner,ro 0 0 ipsnfs2:/fads/admin /fads/admin nfs bg,soft,intr,rsize=16384,wsize=16384,timeo=20 fdl1:/fads/fdl1 /fads/fdl1 nfs bg,soft,intr,rsize=16384,wsize=16384,timeo=20 fdl2:/fads/fdl2 /fads/fdl2 nfs bg,soft,intr,rsize=16384,wsize=16384,timeo=20 fdl3:/fads/fdl3 /fads/fdl3 nfs bg,soft,intr,rsize=16384,wsize=16384,timeo=20 Alt-SysRq-t is likely to be the most useful but certainly processor and memory information may be helpful too. Also, any information on what the machine is being asked to do would be helpful. Yet another question - what blade being used, exactly? By this, I mean, what are the numbers which are part of the name? The system is not being heavily used at the time of hangs. The system has only had a couple of users access it. The only thing that we've noticed is that there may be a connection to the use of vim, but if so it is not reliably repeatable. On the first hang that occurred after the upgrade, I was the only user on the platform and had just begin to edit a file when the system stopped responding. The blade server is an IBM BladeCenter (86771XX) with 4 HS20 model blades (867861X). Each system has two 2.8 GHz Xeon processes. Three of the systems are loaded with RHEL 3 Update 1. They are being used to port software. The fourth system is currently reserved for testing the beta RHEL 3 Update 2 release - this is the one that is hanging. The system was hung again this morning. The last audit record was logged around 5:34. The system did not respond to the sysrq sequences. After recovering, I upgraded the kernel to 2.4.21-14.ELsmp and laus to laus-0.1-54RHEL3. I have not created the file /etc/sysconfig/audit. Do you recommend settings other that the default? Peggy, thanks for trying alt-sysrq-t. It looks like we need to move on to a bigger club unfortunately. What I would suggest to try next is to add "nmi_watchdog=1" (without the quotes) to the appropriate kernel line in /boot/grub/grub.conf. For example, title Red Hat Enterprise Linux AS (2.4.21-15.ELsmp) root (hd0,0) kernel /vmlinuz-2.4.21-15.ELsmp ro root=LABEL=/ nmi_watchdog=1 initrd /initrd-2.4.21-15.ELsmp.img The intended purpose of this is to provoke a panic via the NMI watchdog timer which will hopefully yield some hint as to where we are in the kernel. Note that it will take 30 seconds after the hang before panic output will appear on the console. I added the nmi_watchdog parameter to my kernel entry on Friday 4/23. The system has not hung since the latest kernel and laus rpms were installed earlier that day. I'm encouraging users to work on the server, but it's been a slow day. I'll update this as soon as something happens. Now that the system is staying up, we've begun to do some testing. When we run a regression test for some of our software with auditing turned off the test completes in approximately 4 minutes. When we run the same test with auditing enabled, the test takes well over an hour to complete. The system is spending a lot of time in iowait. The top command shows the CPU state of auditd as "D", which the man page for top defines as uninterruptable sleep. Is this a known problem? If not, should I open a bug report? The updated kernel and laus packages appear to have resolved the stability problem. I've opened bug 121970 to document the performance problem. Thanks for your help. Closing, as per comment #16. |