Bug 447909
Summary: | init scripts vulnerable to bad shutdown | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | stef <stephane.tranchemer> | ||||
Component: | e2fsprogs | Assignee: | Eric Sandeen <esandeen> | ||||
Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | low | ||||||
Version: | 6 | CC: | kzak, mcepl, mcepl, oliver, petrosyan, poelstra | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2008-08-23 04:44:00 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
stef
2008-05-22 12:43:21 UTC
This error is different - if you're at this point: - you've already passed the filesystem check (which apparently succeeded) - things are still corrupted enough that simple commands crash Assigning to e2fsprogs. Out of curiosity, what happens if you force a fsck of a system in this state? I'd suggest modifying rc.sysinit so that the segfault leaves a corefile, if it doesn't already, and attach that corefile along with information about which binary (from which package, and which version) produced it. Are there any interesting kernel messages at the time of the segfault? (In reply to comment #1) Alas I don't have any machine in this state at the moment, we made a fix-roll yesterday when we had a certain number of computers in this state... we couldn't, of course, leave machines in this state perturbing the classes. I'll try next time we got this error (it can be very soon). (In reply to comment #2) No kernel message... Of course I tryed to see what happened but couldn't find the culprit. Can you give the way to modify rc.sysinit so that the segfault leaves a corefile ? I'm sorry, I missed the last update. ulimit -c unlimited somewhere early in rc.sysinit should drop a core, as long as the fs is mounted read/write when the segfault happens, I think. Thanks, -Eric I had the error again today. This time I tryed the FSCK thing as suggested in comment #1 I booted a Fedora disc in rescue mode and tryed to FSCK. At first the fsck told me the filesystem was clean and no check was needed, but then I issued a 'fsck -f' and actually found a handfull of errors in each pass of the FSCK process. After that the machine did boot fine again. As you suggested, something could maybe be done at the e2fsprogs level to address this. Next time I'll try the ulimit thing from comment #2 and comment #4 see what it have to say. ... but no segfault, eh. Some of this is slightly odd; if fsck by itself says the fs is clean, then it thinks it was cleanly unmounted, i.e. from a clean shutdown. Was this after a hard poweroff? fsck -f tells it to check anyway regardless, and it found corruption; maybe from some other previous incident? If you are cutting power, and if drive write caches are enabled and barriers are not on and enforcing, then you are at risk for corruption at power loss. I'm not terribly surprised or concerned that e2fsck finds corruption after power loss (unless you do have barriers enabled), but I am concerned about the segfault, and about any creeping corruption not related to power loss... Also, saving the fsck output will generally be helpful. Thanks, -Eric (In reply to comment #6) > Some of this is slightly odd; if fsck by itself says the fs is clean, then it > thinks it was cleanly unmounted, i.e. from a clean shutdown. Was this after a > hard poweroff? As I said in the report, the computers are in computer rooms and we're not behind every single student to see what he does ... most likely it's the result of a sudden power-off made by some impatient folk that wants to reboot fast under windows (our computer are dual-boot) without waiting for the shutdown sequence to complete... > Also, saving the fsck output will generally be helpful. Okay, I'll get the output next time Ok. At this point the only real bug I see here is the segfaulting fsck. Does this sound right to you? Thanks, -Eric (In reply to comment #8) > Ok. At this point the only real bug I see here is the segfaulting fsck. Segfaulting Fsck ? I haven't seen anything like this, the segfault happens apparently on a 'rm' operation : /etc/rc.d/rc.sysinit: line 821 : 163 Segmentation fault rm -f $ afile/* Instead Fsck was unable to see a problem on the filesystem since I had to force it with th '-f' parameter to check, and found errors. After the errors were corrected the system boot ok. Oh, I'm sorry. I confused the two. So, the segfaulting rm is the only bug I see here ;) -Eric (In reply to comment #10) > Oh, I'm sorry. I confused the two. So, the segfaulting rm is the only bug I > see here ;) Yes, the system is stuck when rc.sysinit call for a 'rm -f' (if that's really what happens and not a byproduct of a larger procedure). At this point the only thing to do is to shut it down and reboot on a rescue disc to FSCK. The main argument here is that it seems clear that the filesystem is not in a state allowing to boot correctly but is not detected as such and the FSCKing of the partition is not started. ok, I see. Sorry, I'm skimming too much. So what we know: You have students who essentially pull the plug from time to time You have boxes which sometimes segfault on rm -f at boot time You often find corruption with manual e2fsck -f So what I would like to know is: Does the rm -f segfault happen after the initscripts have done a fsck (or determined that a fsck is not necessary?) Which was it, was fsck skipped because the filesystem looked clean or did fsck run? What sort of corruption does fsck find when you force it to run? The two possibilities I see are that either fsck is not completely fixing the filesystem when it runs, or some other corruption is happening while the system is running, but a clean unmount means that fsck isn't run at boot time and the corruption isn't found until you manually e2fsck -f. It would be interesting to see the fsck output in any case. Thanks, -Eric Created attachment 308921 [details]
log of the 'fsck -f' command
Okay, here it goes. I found today another computer in the said state, so I could use the procedure we described above. fsck /dev/hda* => no check, fsck says everything is okay fsck -f /dev/hda* => multiple errors found I attached the console return. After this 'fsck -f' the system boots OK again Fedora Core 6 and Fedora 7 are no longer maintained. Were you able to reproduce this bug in Fedora 9? (In reply to comment #15) I would like to... We decided at our uni to move on to FC9 for the new semester starting in september/october, alas due to Bug 442457 we were stuck very quickly in the kickstart creating process. Our hardware is HP workstations with hard-drives concerned by the bug, so it's impossible to make a working install on it. Since the bug resolution wasn't, and still isn't, available our IT department decided to use FC8 instead. I'll make reports about how FC8 performs during the next semester. Switching the bug to NEEDINFO, because we expect information from the reporter. Pass 1: Checking inodes, blocks, and sizes Error reading block 3310586 (Attempt to read block from filesystem resulted in short read) while doing inode scan. Ignore error? yes that's worrisome. Do you get kernel messages too? This looks like an IO error. Also, getting the core from the segfaulting command, along with the subsequent fsck -f output, might shed some light on things. I'd also be curious to know if mounting with barriers enabled makes the problem go away. I'd certainly suggest it in this use pattern :) (In reply to comment #18) *No more message on the console than what is said in the report of the bug /etc/rc.d/rc.sysinit: line 821 : 163 Segmentation fault rm -f $ afile/* *I dont know where the core file is supposed to be dumped (if it is) *"mounting with barriers enabled" ? What is that ? Could you tell meme more ? to mount ext3 with barriers, mount -o barrier=1 ..... What is the latest version of fedora that you have seen the original problem on? it was originally filed against FC6, have you seen it since? If you haven't seen it since FC6 I may need to close since FC6 is no longer maintained; If I do, and if you see it on F8 or later, please re-open with new info ... If you see this again on recent (supported) Fedora - F8 or later, please re-open. Assuming the problem is unique to FC6 (for now...) so closing WONTFIX as F6 is no longer supported. |