435358 – iscsi root device never gets fscked when there's an i/o error

Bug 435358 - iscsi root device never gets fscked when there's an i/o error

Summary: iscsi root device never gets fscked when there's an i/o error

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	initscripts
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	low
Severity:	low
Target Milestone:	beta
Target Release:	---
Assignee:	initscripts Maintenance Team
QA Contact:	Brock Organ
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	435716 435717 435722
TreeView+	depends on / blocked

Reported:	2008-02-28 20:26 UTC by James Laska
Modified:	2013-09-02 06:24 UTC (History)
CC List:	4 users (show)
Fixed In Version:	RHBA-2008-0443
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 17:24:15 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0443	0	normal	SHIPPED_LIVE	initscripts bug fix and enhancement update	2008-05-20 16:44:03 UTC

Description James Laska 2008-02-28 20:26:54 UTC

Description of problem:

When a system with iscsi devices experiences a network glitch, and the system is
rebooted.  On reboot, rc.sysinit does not force an fsck for the iscsi devices as
it normally would for other locally attached disks.  As a result, the user must
manually run fsck after the system boots into read-only mode.

Version-Release number of selected component (if applicable):
RHEL5.2-Server-20080225.2 (initscripts-8.45.18.EL-1)

How reproducible:
Only tested once

Steps to Reproduce:
[Feb 28 14:50:30] <      pjones> | 1) i/o error happens for whatever reason
(broken cable, act of god, etc.)
[Feb 28 14:50:38] <      pjones> | 2) reboot happens
[Feb 28 14:50:50] <      pjones> | 3) initrd mounts fs read-only
[Feb 28 14:51:06] <      pjones> | 4) kernel sees that it's in an error state,
marks it as needing fsck, fixes the journal
[Feb 28 14:51:28] <      pjones> | 5) initscripts runs "fsck -A", which doesn't
touch it because of _netdev in the options (which we can't take out for other
reasons)
[Feb 28 14:51:53] <      pjones> | 6) initscripts remounts it read-write
[Feb 28 14:52:10] <      pjones> | 7) kernel sees the error still and forces it
back to read-only
[Feb 28 14:52:37] <      pjones> | 8) netfs doesn't fsck it because it can't
touch the lockfile in /var/run so it never actually runs correctly
[Feb 28 14:55:51] <      pjones> | (also lots of other stuff fails between 7 and 8)
[Feb 28 14:56:41] <      pjones> |  / *absolutely must* get fscked in
rc.sysinit, not a separate initscript.  

Actual results:
 - filesystem mounted in read-only mode, manual fsck required

Expected results:
 - rc.sysinit should trigger an fsck upon boot.

Additional info:

Comment 2 Bill Nottingham 2008-02-29 04:33:15 UTC

This isn't really anything new - this would be the case even in 5.0 GA if you
have a network root block device (iSCSI, GFS2, NBD) - they would all run into
this issue.

Possible solutions, of varying quality:
- Remove _netdev from fstab for /. This would, however, break shutdown.
(Obviously bad.)
- Remove the case from rc.sysinit so that _netdev devices are fscked. This
would, however, break booting with *any* non-root network block devices, as they
wouldn't be found when fsck runs, including existing installations. (Also,
obviously bad.)
- Run fsck on / from the initrd. *ducks*
- Introduce Yet Another Magic Flag, honored by shutdown as 'root is a network
device', but different from _netdev so it would be fscked. Would be a rather
ridiculous hack, but may work.

Comment 4 Bill Nottingham 2008-02-29 17:35:21 UTC

As for introducing Yet Another Magic flag - _netdev is handled specifically by
mount(8) - so if we went that route to fix, it would require changes to (at a
minimum) initscripts, anaconda, and util-linux.

Comment 5 Bill Nottingham 2008-02-29 18:37:57 UTC

Another alternative is root-causing why fsck from netfs fails for /, and fixing
that issue.

Comment 6 Peter Jones 2008-02-29 20:19:32 UTC

notting: I'm not sure how that'll help -- the fs is already failed to RO mode by
the kernel at that point, and can't ever be put back in RW mode.

At any rate, I'm pretty sure the reason netfs doesn't work is that
/var/lock/subsys/netfs is still present.

Comment 8 Bill Nottingham 2008-02-29 23:52:24 UTC

(In reply to comment #6)
> At any rate, I'm pretty sure the reason netfs doesn't work is that
> /var/lock/subsys/netfs is still present.

Oh, right. And at that point you can't get to the FS to fix it and behave
reliably. So we're back to the add-another-flag hack.

Comment 9 Eric Sandeen 2008-03-01 00:09:03 UTC

Just to document some of the stuff pjones & I looked at....

for the -t option to fsck, "opt=" and "noopt=" options are specified as
comma-delimited, including the "opt/noopt" part.

i.e. -t opt=foo,noopt=bar

And they are cumulative; a filesystem must match each option (or no-option)
specification to be checked.  Above, only filesystems with option foo *and*
without option bar will be checked.

... although at this point I guess it looks like we won't need to combine the
option specifiers in this way... but just in case.

-Eric

Comment 10 Bill Nottingham 2008-03-03 16:33:11 UTC

8.45.19.EL-1 built.

Comment 14 errata-xmlrpc 2008-05-21 17:24:15 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0443.html

Comment 15 Fedora Update System 2008-06-05 20:33:22 UTC

cmake-2.4.8-2.fc8 has been submitted as an update for Fedora 8

Note You need to log in before you can comment on or make changes to this bug.