Bug 171021
Summary: | Entire OS becomes unuseable due to all mounted drives becoming read-only | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Chris <cnd> |
Component: | kernel | Assignee: | Larry Woodman <lwoodman> |
Status: | CLOSED WORKSFORME | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.0 | CC: | jbaron, sct |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | i686 | ||
OS: | Linux | ||
URL: | http://www.linuxquestions.org/questions/showthread.php?s=&postid=1770559 | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-07-10 15:51:07 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Chris
2005-10-17 13:22:24 UTC
FYI - upon reboot, my entire 300gig drive was totally hosed - fsck went nuts for 10 mins, forced a reboot, then I lost the lot. IMHO1 - besides the fact I *should* have had 4gigs RAM spare, nothing should be allowed to exhaust kernel memory so much that it make the kernel useless and ultimately destroys all data on all mounted disks. IMHO2 - after something does screw up and memory gets exhaused, the kernel should *automatically* monitor the state of the system and resume normal operation when memory becomes free again, including re-mounting as RW whatever it turned into RO in order to prevent catastrophic disk destruction after rebooting. (I did kill vmware which should have freed things up before I typed "reboot") Chris, I really dont know how to reproduce this internally. Can I ask you to reproduce this problem and get me a "vmstat 1" outputs as well as several AltSysrq M, W and P outputs followed by one AltSysrq-T output. Thanks, Larry Woodman OK Larry - I'll have a go - it'll take me some time to reinstall the OS etc tho. I understand "vmstat 1" - but what's all that "AltSysrq" stuff? I presume it's something relating to hitting "Alt" and the "SysRq" button, probably on the console, and probably only in a GUI (X) - is this correct? (It did nothing in my vnc session, but I've got a DL360 so I can bring up a console on the iLo card without going in to the datacenter if that's the only way - assuming I can send an AltSysRq through to the iLo from my browser...) Do I have to do anything to enable the AltSysrq stuff? 1.) as root "echo 1 > /proc/sys/kernel/sysrq" 2.) at the console keyboard hold down the Alt and SysRq keys and press M W P and T 3.) the results are written to /var/log/messages Larry The error ernel: journal_get_undo_access: No memory for committed data indicates that the kernel is under serious memory pressure. If the internal journaling state machine can't make progress as a result then taking the journal offline and going readonly is the only action ext3 can take, but it's a defensive measure and not something that should cause any corruption. Indeed, I've got plenty of reports of kernel memory starvation causing ext3 to complain like this without any corruption. So there may well be something else going on --- some other component of the kernel which is not reacting as gracefully to the memory starvation. (And it's low memory starvation that's happening in this case, so there's less than 1G of that to go around no matter how much physical ram you have, unless you run the hugemem kernel.) Full kernel logs (not just the single line of ext3 error) may help to point to the problem; serial or network console can be invaluable in trapping that. Thanks Stephen for that explanation (and Larry for those SysRq instructions). Unfortunately - I've tried 3 times now and not been able to reproduce this problem; perhaps the actual /proc/sys/kernel/sysrq setting has an effect, or perhaps me running vmstat and periodically doing the AltSysRq stuff changed the conditions? Double-unfortunately - after the install worked, I stopped logging stuff and created a new database inside my virtual machine, which ultimately locked up the host kernel completely. The problem seems related not so much to memory usage, as to extreme disk usage (at least - that's my guess - during the oracle install, it's just some Java apps copying files around). I don't have time left to experiment (sorry - gotta get this machine live ASAP) so please accept my apologies for not managing to get more info for you. No longer repeoducable. |