From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1) Description of problem: Me and many other people can "crash" the kernel by running the vmware application, as per this internet posting:- http://www.linuxquestions.org/questions/showthread.php?s=&postid=1770559 In my case, I power up a virtual PC with 2gigs RAM (my host has 6gigs), and I try to install Oracle 10g in the host PC (also ES4) Version-Release number of selected component (if applicable): Linux localhost 2.6.9-22.ELsmp #1 SMP Mon Sep 19 18:32:14 EDT 2005 i686 i686 i386 GNU/Linux How reproducible: Always Steps to Reproduce: 1. power up a virtual PC (ES4) with 2gigs RAM 2. install Oracle 10g Actual Results: kernel: journal_get_undo_access: No memory for committed data (sda becomes read-only, OS becomes useless) Expected Results: No error Additional info: this error could probably be used as a nasty DoS attack
FYI - upon reboot, my entire 300gig drive was totally hosed - fsck went nuts for 10 mins, forced a reboot, then I lost the lot. IMHO1 - besides the fact I *should* have had 4gigs RAM spare, nothing should be allowed to exhaust kernel memory so much that it make the kernel useless and ultimately destroys all data on all mounted disks. IMHO2 - after something does screw up and memory gets exhaused, the kernel should *automatically* monitor the state of the system and resume normal operation when memory becomes free again, including re-mounting as RW whatever it turned into RO in order to prevent catastrophic disk destruction after rebooting. (I did kill vmware which should have freed things up before I typed "reboot")
Chris, I really dont know how to reproduce this internally. Can I ask you to reproduce this problem and get me a "vmstat 1" outputs as well as several AltSysrq M, W and P outputs followed by one AltSysrq-T output. Thanks, Larry Woodman
OK Larry - I'll have a go - it'll take me some time to reinstall the OS etc tho. I understand "vmstat 1" - but what's all that "AltSysrq" stuff? I presume it's something relating to hitting "Alt" and the "SysRq" button, probably on the console, and probably only in a GUI (X) - is this correct? (It did nothing in my vnc session, but I've got a DL360 so I can bring up a console on the iLo card without going in to the datacenter if that's the only way - assuming I can send an AltSysRq through to the iLo from my browser...) Do I have to do anything to enable the AltSysrq stuff?
1.) as root "echo 1 > /proc/sys/kernel/sysrq" 2.) at the console keyboard hold down the Alt and SysRq keys and press M W P and T 3.) the results are written to /var/log/messages Larry
The error ernel: journal_get_undo_access: No memory for committed data indicates that the kernel is under serious memory pressure. If the internal journaling state machine can't make progress as a result then taking the journal offline and going readonly is the only action ext3 can take, but it's a defensive measure and not something that should cause any corruption. Indeed, I've got plenty of reports of kernel memory starvation causing ext3 to complain like this without any corruption. So there may well be something else going on --- some other component of the kernel which is not reacting as gracefully to the memory starvation. (And it's low memory starvation that's happening in this case, so there's less than 1G of that to go around no matter how much physical ram you have, unless you run the hugemem kernel.) Full kernel logs (not just the single line of ext3 error) may help to point to the problem; serial or network console can be invaluable in trapping that.
Thanks Stephen for that explanation (and Larry for those SysRq instructions). Unfortunately - I've tried 3 times now and not been able to reproduce this problem; perhaps the actual /proc/sys/kernel/sysrq setting has an effect, or perhaps me running vmstat and periodically doing the AltSysRq stuff changed the conditions? Double-unfortunately - after the install worked, I stopped logging stuff and created a new database inside my virtual machine, which ultimately locked up the host kernel completely. The problem seems related not so much to memory usage, as to extreme disk usage (at least - that's my guess - during the oracle install, it's just some Java apps copying files around). I don't have time left to experiment (sorry - gotta get this machine live ASAP) so please accept my apologies for not managing to get more info for you.
No longer repeoducable.