Bug 171021

Summary: Entire OS becomes unuseable due to all mounted drives becoming read-only
Product: Red Hat Enterprise Linux 4 Reporter: Chris <cnd>
Component: kernelAssignee: Larry Woodman <lwoodman>
Status: CLOSED WORKSFORME QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.0CC: jbaron, sct
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
URL: http://www.linuxquestions.org/questions/showthread.php?s=&postid=1770559
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-07-10 15:51:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Chris 2005-10-17 13:22:24 UTC
From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)

Description of problem:
Me and many other people can "crash" the kernel by running the vmware application, as per this internet posting:-

http://www.linuxquestions.org/questions/showthread.php?s=&postid=1770559

In my case, I power up a virtual PC with 2gigs RAM (my host has 6gigs), and I try to install Oracle 10g in the host PC (also ES4)

Version-Release number of selected component (if applicable):
Linux localhost 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 i686 i386 GNU/Linux

How reproducible:
Always

Steps to Reproduce:
1. power up a virtual PC (ES4) with 2gigs RAM
2. install Oracle 10g
  

Actual Results:  kernel: journal_get_undo_access: No memory for committed data

(sda becomes read-only, OS becomes useless)

Expected Results:  No error

Additional info:

this error could probably be used as a nasty DoS attack

Comment 1 Chris 2005-10-17 13:55:06 UTC
FYI - upon reboot, my entire 300gig drive was totally hosed - fsck went nuts 
for 10 mins, forced a reboot, then I lost the lot.

IMHO1 - besides the fact I *should* have had 4gigs RAM spare, nothing should 
be allowed to exhaust kernel memory so much that it make the kernel useless 
and ultimately destroys all data on all mounted disks.

IMHO2 - after something does screw up and memory gets exhaused, the kernel 
should *automatically* monitor the state of the system and resume normal 
operation when memory becomes free again, including re-mounting as RW whatever 
it turned into RO in order to prevent catastrophic disk destruction after 
rebooting. (I did kill vmware which should have freed things up before I 
typed "reboot")

Comment 2 Larry Woodman 2005-10-17 15:27:37 UTC
Chris, I really dont know how to reproduce this internally.  Can I ask you to
reproduce this problem and get me a "vmstat 1" outputs as well as several
AltSysrq M, W and P outputs followed by one AltSysrq-T output.

Thanks, Larry Woodman


Comment 3 Chris 2005-10-17 16:15:16 UTC
OK Larry - I'll have a go - it'll take me some time to reinstall the OS etc 
tho.

I understand "vmstat 1" - but what's all that "AltSysrq" stuff?  I presume 
it's something relating to hitting "Alt" and the "SysRq" button, probably on 
the console, and probably only in a GUI (X) - is this correct?  (It did 
nothing in my vnc session, but I've got a DL360 so I can bring up a console on 
the iLo card without going in to the datacenter if that's the only way - 
assuming I can send an AltSysRq through to the iLo from my browser...)
Do I have to do anything to enable the AltSysrq stuff?


Comment 4 Larry Woodman 2005-10-17 17:51:21 UTC
1.) as root "echo 1 > /proc/sys/kernel/sysrq"

2.) at the console keyboard hold down the Alt and SysRq keys and press M W P and T

3.) the results are written to /var/log/messages


Larry


Comment 5 Stephen Tweedie 2005-10-17 18:42:35 UTC
The error

ernel: journal_get_undo_access: No memory for committed data

indicates that the kernel is under serious memory pressure.  If the internal
journaling state machine can't make progress as a result then taking the journal
offline and going readonly is the only action ext3 can take, but it's a
defensive measure and not something that should cause any corruption.  Indeed,
I've got plenty of reports of kernel memory starvation causing ext3 to complain
like this without any corruption.

So there may well be something else going on --- some other component of the
kernel which is not reacting as gracefully to the memory starvation.  (And it's
low memory starvation that's happening in this case, so there's less than 1G of
that to go around no matter how much physical ram you have, unless you run the
hugemem kernel.)

Full kernel logs (not just the single line of ext3 error) may help to point to
the problem; serial or network console can be invaluable in trapping that.

Comment 6 Chris 2005-10-18 00:26:39 UTC
Thanks Stephen for that explanation (and Larry for those SysRq instructions).

Unfortunately - I've tried 3 times now and not been able to reproduce this 
problem; perhaps the actual /proc/sys/kernel/sysrq setting has an effect, or 
perhaps me running vmstat and periodically doing the AltSysRq stuff changed 
the conditions?

Double-unfortunately - after the install worked, I stopped logging stuff and 
created a new database inside my virtual machine, which ultimately locked up 
the host kernel completely.  The problem seems related not so much to memory 
usage, as to extreme disk usage (at least - that's my guess - during the 
oracle install, it's just some Java apps copying files around).

I don't have time left to experiment (sorry - gotta get this machine live 
ASAP) so please accept my apologies for not managing to get more info for you.

Comment 7 Larry Woodman 2007-07-10 15:51:07 UTC
No longer repeoducable.