Bug 447909

Summary: init scripts vulnerable to bad shutdown
Product: [Fedora] Fedora Reporter: stef <stephane.tranchemer>
Component: e2fsprogsAssignee: Eric Sandeen <esandeen>
Status: CLOSED WONTFIX QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 6CC: kzak, mcepl, mcepl, oliver, petrosyan, poelstra
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-08-23 04:44:00 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
log of the 'fsck -f' command none

Description stef 2008-05-22 12:43:21 UTC
Description of problem:
We are using Fedora Core distros at my univ for education purposes. Corrently we
are on FC6, moving to FC9 soon.
Our students aren't not very carefull about what they do with our computers, and
we have very often to go in computer rooms to fsck the machines (most likely due
to violent poweroff), but some computer sometimes do not ever won't start again...

The message on the console is :

/etc/rc.d/rc.sysinit: line 821 :  163 Segmentation fault   rm -f $ afile/*


I think something is wrong here in the way Linux handles some starting scripts,
some files seems vulnerable during a certain time making the system unable to
start again if violent shutdown is done during this period of time.

Version-Release number of selected component (if applicable):
Fedora Core 6 but certainly other versions

How reproducible:
sometimes

Steps to Reproduce:
1.poweroff violently several times and at different moments a system
2.restart and see what happens
3.do again until this behaviour happens
  
Actual results:

stuck on a sysinit error

Expected results:

system stop on a warning asking for fsck instead of being dead cold

Additional info:

Comment 1 Bill Nottingham 2008-05-22 15:57:59 UTC
This error is different - if you're at this point:
- you've already passed the filesystem check (which apparently succeeded)
- things are still corrupted enough that simple commands crash

Assigning to e2fsprogs. Out of curiosity, what happens if you force a fsck of a
system in this state?

Comment 2 Eric Sandeen 2008-05-22 19:35:02 UTC
I'd suggest modifying rc.sysinit so that the segfault leaves a corefile, if it
doesn't already, and attach that corefile along with information about which
binary (from which package, and which version) produced it.

Are there any interesting kernel messages at the time of the segfault?

Comment 3 stef 2008-05-23 08:03:02 UTC
(In reply to comment #1)
Alas I don't have any machine in this state at the moment, we made a fix-roll
yesterday when we had a certain number of computers in this state... we
couldn't, of course, leave machines in this state perturbing the classes.
I'll try next time we got this error (it can be very soon).

(In reply to comment #2)
No kernel message... Of course I tryed to see what happened but couldn't find
the culprit.
Can you give the way to modify rc.sysinit so that the segfault leaves a corefile ?

Comment 4 Eric Sandeen 2008-06-03 04:51:22 UTC
I'm sorry, I missed the last update.

ulimit -c unlimited

somewhere early in rc.sysinit should drop a core, as long as the fs is mounted
read/write when the segfault happens, I think.

Thanks,
-Eric

Comment 5 stef 2008-06-10 08:16:18 UTC
I had the error again today.
This time I tryed the FSCK thing as suggested in comment #1

I booted a Fedora disc in rescue mode and tryed to FSCK.
At first the fsck told me the filesystem was clean and no check was needed, but
then I issued a 'fsck -f' and actually found a handfull of errors in each pass
of the FSCK process.

After that the machine did boot fine again.

As you suggested, something could maybe be done at the e2fsprogs level to
address this.

Next time I'll try the ulimit thing from comment #2 and comment #4 see what it
have to say.


Comment 6 Eric Sandeen 2008-06-10 13:29:41 UTC
... but no segfault, eh.

Some of this is slightly odd; if fsck by itself says the fs is clean, then it
thinks it was cleanly unmounted, i.e. from a clean shutdown.  Was this after a
hard poweroff?  fsck -f tells it to check anyway regardless, and it found
corruption; maybe from some other previous incident?

If you are cutting power, and if drive write caches are enabled and barriers are
not on and enforcing, then you are at risk for corruption at power loss.

I'm not terribly surprised or concerned that e2fsck finds corruption after power
loss (unless you do have barriers enabled), but I am concerned about the
segfault, and about any creeping corruption not related to power loss...

Also, saving the fsck output will generally be helpful.

Thanks,
-Eric

Comment 7 stef 2008-06-10 13:38:11 UTC
(In reply to comment #6)
> Some of this is slightly odd; if fsck by itself says the fs is clean, then it
> thinks it was cleanly unmounted, i.e. from a clean shutdown.  Was this after a
> hard poweroff?  

As I said in the report, the computers are in computer rooms and we're not
behind every single student to see what he does ... most likely it's the result
of a sudden power-off made by some impatient folk that wants to reboot fast
under windows (our computer are dual-boot) without waiting for the shutdown
sequence to complete...

> Also, saving the fsck output will generally be helpful.

Okay, I'll get the output next time

Comment 8 Eric Sandeen 2008-06-10 13:48:56 UTC
Ok.  At this point the only real bug I see here is the segfaulting fsck.  Does
this sound right to you?

Thanks,

-Eric

Comment 9 stef 2008-06-10 13:55:43 UTC
(In reply to comment #8)
> Ok.  At this point the only real bug I see here is the segfaulting fsck.

Segfaulting Fsck ?  I haven't seen anything like this, the segfault happens
apparently on a 'rm' operation :

/etc/rc.d/rc.sysinit: line 821 :  163 Segmentation fault   rm -f $ afile/*

Instead Fsck was unable to see a problem on the filesystem since I had to force
it with th '-f' parameter to check, and found errors. After the errors were
corrected the system boot ok.

Comment 10 Eric Sandeen 2008-06-10 14:21:48 UTC
Oh, I'm sorry.  I confused the two.  So, the segfaulting rm is the only bug I
see here ;)

-Eric

Comment 11 stef 2008-06-10 14:29:17 UTC
(In reply to comment #10)
> Oh, I'm sorry.  I confused the two.  So, the segfaulting rm is the only bug I
> see here ;)

Yes, the system is stuck when rc.sysinit call for a 'rm -f' (if that's really
what happens and not a byproduct of a larger procedure). At this point the only
thing to do is to shut it down and reboot on a rescue disc to FSCK.

The main argument here is that it seems clear that the filesystem is not in a
state allowing to boot correctly but is not detected as such and the FSCKing of
the partition is not started.

Comment 12 Eric Sandeen 2008-06-10 14:43:41 UTC
ok, I see.  Sorry, I'm skimming too much.

So what we know:

You have students who essentially pull the plug from time to time
You have boxes which sometimes segfault on rm -f at boot time
You often find corruption with manual e2fsck -f

So what I would like to know is:

Does the rm -f segfault happen after the initscripts have done a fsck (or
determined that a fsck is not necessary?)

Which was it, was fsck skipped because the filesystem looked clean or did fsck run?

What sort of corruption does fsck find when you force it to run?

The two possibilities I see are that either fsck is not completely fixing the
filesystem when it runs, or some other corruption is happening while the system
is running, but a clean unmount means that fsck isn't run at boot time and the
corruption isn't found until you manually e2fsck -f.

It would be interesting to see the fsck output in any case.

Thanks,
-Eric

Comment 13 stef 2008-06-11 11:40:29 UTC
Created attachment 308921 [details]
log of the 'fsck -f' command

Comment 14 stef 2008-06-11 11:43:20 UTC
Okay, here it goes. 
I found today another computer in the said state, so I could use the procedure
we described above.

fsck /dev/hda* => no check, fsck says everything is okay
fsck -f /dev/hda* => multiple errors found

I attached the console return.

After this 'fsck -f' the system boots OK again

Comment 15 petrosyan 2008-06-27 19:04:50 UTC
Fedora Core 6 and Fedora 7 are no longer maintained. Were you able to reproduce
this bug in Fedora 9?

Comment 16 stef 2008-07-01 11:35:51 UTC
(In reply to comment #15)
I would like to...

We decided at our uni to move on to FC9 for the new semester starting in
september/october, alas due to Bug 442457 we were stuck very quickly in the
kickstart creating process.

Our hardware is HP workstations with hard-drives concerned by the bug, so it's
impossible to make a working install on it.
Since the bug resolution wasn't, and still isn't, available our IT department
decided to use FC8 instead.

I'll make reports about how FC8 performs during the next semester.

Comment 17 Matěj Cepl 2008-07-02 07:13:27 UTC
Switching the bug to NEEDINFO, because we expect information from the reporter.

Comment 18 Eric Sandeen 2008-07-07 22:52:10 UTC
Pass 1: Checking inodes, blocks, and sizes
Error reading block 3310586 (Attempt to read block from filesystem resulted in
short read) while doing inode scan.  Ignore error? yes


that's worrisome.  Do you get kernel messages too?  This looks like an IO error.

Also, getting the core from the segfaulting command, along with the subsequent
fsck -f output, might shed some light on things.

I'd also be curious to know if mounting with barriers enabled makes the problem
go away.  I'd certainly suggest it in this use pattern :)


Comment 19 stef 2008-07-08 11:50:01 UTC
(In reply to comment #18)

*No more message on the console than what is said in the report of the bug
/etc/rc.d/rc.sysinit: line 821 :  163 Segmentation fault   rm -f $ afile/*

*I dont know where the core file is supposed to be dumped (if it is)

*"mounting with barriers enabled" ?  What is that ? Could you tell meme more ?

Comment 20 Eric Sandeen 2008-07-15 21:26:14 UTC
to mount ext3 with barriers, mount -o barrier=1 .....

What is the latest version of fedora that you have seen the original problem on?
 it was originally filed against FC6, have you seen it since?

If you haven't seen it since FC6 I may need to close since FC6 is no longer
maintained; If I do, and if you see it on F8 or later, please re-open with new
info ...

Comment 21 Eric Sandeen 2008-08-23 04:44:00 UTC
If you see this again on recent (supported) Fedora - F8 or later, please re-open.

Assuming the problem is unique to FC6 (for now...) so closing WONTFIX as F6 is no longer supported.