Description of problem: We are using Fedora Core distros at my univ for education purposes. Corrently we are on FC6, moving to FC9 soon. Our students aren't not very carefull about what they do with our computers, and we have very often to go in computer rooms to fsck the machines (most likely due to violent poweroff), but some computer sometimes do not ever won't start again... The message on the console is : /etc/rc.d/rc.sysinit: line 821 : 163 Segmentation fault rm -f $ afile/* I think something is wrong here in the way Linux handles some starting scripts, some files seems vulnerable during a certain time making the system unable to start again if violent shutdown is done during this period of time. Version-Release number of selected component (if applicable): Fedora Core 6 but certainly other versions How reproducible: sometimes Steps to Reproduce: 1.poweroff violently several times and at different moments a system 2.restart and see what happens 3.do again until this behaviour happens Actual results: stuck on a sysinit error Expected results: system stop on a warning asking for fsck instead of being dead cold Additional info:
This error is different - if you're at this point: - you've already passed the filesystem check (which apparently succeeded) - things are still corrupted enough that simple commands crash Assigning to e2fsprogs. Out of curiosity, what happens if you force a fsck of a system in this state?
I'd suggest modifying rc.sysinit so that the segfault leaves a corefile, if it doesn't already, and attach that corefile along with information about which binary (from which package, and which version) produced it. Are there any interesting kernel messages at the time of the segfault?
(In reply to comment #1) Alas I don't have any machine in this state at the moment, we made a fix-roll yesterday when we had a certain number of computers in this state... we couldn't, of course, leave machines in this state perturbing the classes. I'll try next time we got this error (it can be very soon). (In reply to comment #2) No kernel message... Of course I tryed to see what happened but couldn't find the culprit. Can you give the way to modify rc.sysinit so that the segfault leaves a corefile ?
I'm sorry, I missed the last update. ulimit -c unlimited somewhere early in rc.sysinit should drop a core, as long as the fs is mounted read/write when the segfault happens, I think. Thanks, -Eric
I had the error again today. This time I tryed the FSCK thing as suggested in comment #1 I booted a Fedora disc in rescue mode and tryed to FSCK. At first the fsck told me the filesystem was clean and no check was needed, but then I issued a 'fsck -f' and actually found a handfull of errors in each pass of the FSCK process. After that the machine did boot fine again. As you suggested, something could maybe be done at the e2fsprogs level to address this. Next time I'll try the ulimit thing from comment #2 and comment #4 see what it have to say.
... but no segfault, eh. Some of this is slightly odd; if fsck by itself says the fs is clean, then it thinks it was cleanly unmounted, i.e. from a clean shutdown. Was this after a hard poweroff? fsck -f tells it to check anyway regardless, and it found corruption; maybe from some other previous incident? If you are cutting power, and if drive write caches are enabled and barriers are not on and enforcing, then you are at risk for corruption at power loss. I'm not terribly surprised or concerned that e2fsck finds corruption after power loss (unless you do have barriers enabled), but I am concerned about the segfault, and about any creeping corruption not related to power loss... Also, saving the fsck output will generally be helpful. Thanks, -Eric
(In reply to comment #6) > Some of this is slightly odd; if fsck by itself says the fs is clean, then it > thinks it was cleanly unmounted, i.e. from a clean shutdown. Was this after a > hard poweroff? As I said in the report, the computers are in computer rooms and we're not behind every single student to see what he does ... most likely it's the result of a sudden power-off made by some impatient folk that wants to reboot fast under windows (our computer are dual-boot) without waiting for the shutdown sequence to complete... > Also, saving the fsck output will generally be helpful. Okay, I'll get the output next time
Ok. At this point the only real bug I see here is the segfaulting fsck. Does this sound right to you? Thanks, -Eric
(In reply to comment #8) > Ok. At this point the only real bug I see here is the segfaulting fsck. Segfaulting Fsck ? I haven't seen anything like this, the segfault happens apparently on a 'rm' operation : /etc/rc.d/rc.sysinit: line 821 : 163 Segmentation fault rm -f $ afile/* Instead Fsck was unable to see a problem on the filesystem since I had to force it with th '-f' parameter to check, and found errors. After the errors were corrected the system boot ok.
Oh, I'm sorry. I confused the two. So, the segfaulting rm is the only bug I see here ;) -Eric
(In reply to comment #10) > Oh, I'm sorry. I confused the two. So, the segfaulting rm is the only bug I > see here ;) Yes, the system is stuck when rc.sysinit call for a 'rm -f' (if that's really what happens and not a byproduct of a larger procedure). At this point the only thing to do is to shut it down and reboot on a rescue disc to FSCK. The main argument here is that it seems clear that the filesystem is not in a state allowing to boot correctly but is not detected as such and the FSCKing of the partition is not started.
ok, I see. Sorry, I'm skimming too much. So what we know: You have students who essentially pull the plug from time to time You have boxes which sometimes segfault on rm -f at boot time You often find corruption with manual e2fsck -f So what I would like to know is: Does the rm -f segfault happen after the initscripts have done a fsck (or determined that a fsck is not necessary?) Which was it, was fsck skipped because the filesystem looked clean or did fsck run? What sort of corruption does fsck find when you force it to run? The two possibilities I see are that either fsck is not completely fixing the filesystem when it runs, or some other corruption is happening while the system is running, but a clean unmount means that fsck isn't run at boot time and the corruption isn't found until you manually e2fsck -f. It would be interesting to see the fsck output in any case. Thanks, -Eric
Created attachment 308921 [details] log of the 'fsck -f' command
Okay, here it goes. I found today another computer in the said state, so I could use the procedure we described above. fsck /dev/hda* => no check, fsck says everything is okay fsck -f /dev/hda* => multiple errors found I attached the console return. After this 'fsck -f' the system boots OK again
Fedora Core 6 and Fedora 7 are no longer maintained. Were you able to reproduce this bug in Fedora 9?
(In reply to comment #15) I would like to... We decided at our uni to move on to FC9 for the new semester starting in september/october, alas due to Bug 442457 we were stuck very quickly in the kickstart creating process. Our hardware is HP workstations with hard-drives concerned by the bug, so it's impossible to make a working install on it. Since the bug resolution wasn't, and still isn't, available our IT department decided to use FC8 instead. I'll make reports about how FC8 performs during the next semester.
Switching the bug to NEEDINFO, because we expect information from the reporter.
Pass 1: Checking inodes, blocks, and sizes Error reading block 3310586 (Attempt to read block from filesystem resulted in short read) while doing inode scan. Ignore error? yes that's worrisome. Do you get kernel messages too? This looks like an IO error. Also, getting the core from the segfaulting command, along with the subsequent fsck -f output, might shed some light on things. I'd also be curious to know if mounting with barriers enabled makes the problem go away. I'd certainly suggest it in this use pattern :)
(In reply to comment #18) *No more message on the console than what is said in the report of the bug /etc/rc.d/rc.sysinit: line 821 : 163 Segmentation fault rm -f $ afile/* *I dont know where the core file is supposed to be dumped (if it is) *"mounting with barriers enabled" ? What is that ? Could you tell meme more ?
to mount ext3 with barriers, mount -o barrier=1 ..... What is the latest version of fedora that you have seen the original problem on? it was originally filed against FC6, have you seen it since? If you haven't seen it since FC6 I may need to close since FC6 is no longer maintained; If I do, and if you see it on F8 or later, please re-open with new info ...
If you see this again on recent (supported) Fedora - F8 or later, please re-open. Assuming the problem is unique to FC6 (for now...) so closing WONTFIX as F6 is no longer supported.