Bug 465539

Summary:	running mkinitrd causes md raid components to fail
Product:	[Fedora] Fedora	Reporter:	Alexandre Oliva <oliva>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	dcantrell, kernel-maint, pjones, wtogami
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-10-25 02:38:10 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Alexandre Oliva 2008-10-03 19:27:13 UTC

It appears that all the O(n^n) probing in /sys done by mkinitrd exercises some condition within the kernel that causes (often unrelated) members of md arrays to be regarded as failed.

The configuration in which this was most visible was my notebook, with two external USB disks, just installed and syncing data heavily among them and the internal disk.  Leaving the system alone for a while always resulted in a complete sync, and using the system normally worked fine afterwards, which sort of ruling out the possibility of actual disk failures.  However, installing a new kernel was a sure way to get some device be marked as failed.

After much investigation, I narrowed it down to running mkinitrd, more precisely, to the portion of mkinitrd that tries to figure out what drivers is needs to bring up the components of the md array that makes up the volume group where the root and swap/suspend devices are.  Note that the failure is not always in the raid members of this volume group, and not even the /boot partition, members of unrelated, stand-alone raid devices that are not needed by initrd suffer from thsi problem as well.

This is probably a kernel issue, some kind of race condition between accessing /sys (which mkinitrd/nash/blkid do a lot to figure out what depends on what) and updating/maintaining md arrays in the kernel.  If you confirm that mkinitrd/nash/blkid contain no code that should intentionally mess with the status of an md raid device, please reassign this to kernel, and keep the bug handy for dupes :-)

Comment 1 Alexandre Oliva 2008-10-03 23:59:32 UTC

Some more information, that points at the kernel, indeed.

I narrowed the problem further to the first nash-resolveDevice command issued by mkinitrd, in my case:

echo nash-resolveDevice /dev/livre/test | /sbin/nash --forcequiet

Then, I performed a syscall binary search on it:

echo nash-resolveDevice /dev/livre/test | strace -f /sbin/nash --forcequiet 2>&1 | sed 436q | tail

Varying the 436, so as to figure out what the last syscall was before the problem.  strace would get a pipe error and kill the child process once its stderr was closed, so I knew exactly what the last known good syscall was, regardless of how long it took for the bad syscall to take effect.

Turns out that at 436, I would reliably not get a faulty array member, but at 437, I would quite often (but not absolutely every time) get it.  And the printed lines didn't vary, so I knew I was getting consistent results.

So, what were the last few lines?

open("/sys/block//sda/device/model", O_RDONLY) = 6
fcntl(6, F_GETFD)                       = 0
fcntl(6, F_SETFD, FD_CLOEXEC)           = 0
fstat(6, {st_mode=S_IFREG|0444, st_size=4096, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f128427c000
read(6, "00BEVT-60ZCT0   \n"..., 4096)  = 17
close(6)                                = 0 [line 436]
munmap(0x7f128427c000, 4096)            = 0 [line 437]

Up to this point, we're fine.  Now, the fact that we get an error at this point means the next syscall that strace would print (after issuing the call) is the one that triggers the error.  Here it is:

ioctl(5, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 00, 00, 00, fe, 00], mx_sb_len=32, iovec_count=0, dxfer_len=254, timeout=5000, flags=0, data[254]=["\0\0\2\1\37\0\0\0WDC WD3200BEVT-60ZCT0   \0"...], status=00, masked_status=00, sb[0]=[], host_status=0, driver_status=0, resid=214, duration=1, info=0}) = 0


Anyhow, sda is (in this session) one of the external USB drives, and sda3 is the member of the md RAID1 array that fails after this syscall.  And here's what gets logged to /var/log/messages:

Oct  3 20:47:47 livre kernel: sd 0:0:0:0: [sda] Result: hostbyte=DID_ERROR drive
rbyte=DRIVER_OK,SUGGEST_RETRY
Oct  3 20:47:47 livre kernel: end_request: I/O error, dev sda, sector 156296262
Oct  3 20:47:47 livre kernel: end_request: I/O error, dev sda, sector 156296262
Oct  3 20:47:47 livre kernel: md: super_written gets error=-5, uptodate=0
Oct  3 20:47:47 livre kernel: raid1: Disk failure on sda3, disabling device.
Oct  3 20:47:47 livre kernel: raid1: Operation continuing on 2 devices.

Hmm...  This is a new 2.5" USB-powered disk, unlike the other 3.5" external enclosure, that's also USB but that draws power from a regular power outlet.

Right now I can't tell whether the error affects only this one (the order of the drives seems to change at every boot, so it's easy to be misled into believing it affects all disks), but I'll keep my eyes open.

Anyhow, this is looking like a kernel issue, and I don't quite see how mkinitrd could even work around it :-(  Reassigning, and marking as 'need info' for myself, so that I come back with more details about this specific disk, if it turns out to be the one that fails consistently, or something else if the failure varies over all 3 disks.

Comment 2 Alexandre Oliva 2008-10-04 00:13:03 UTC

FTR, with tridge's scsi_io program (http://samba.org/~tridge/ctdb/utils/scsi_io/scsi_io.c), I managed to get sda to fail in a similar way with the usn (unit serial number) request.  It does return a serial number (unlike the other, older USB disk), but it fails afterwards.

That said, usn is not quite the same kind of request.  The exact syscall is:

ioctl(3, SG_IO, {'S', SG_DXFER_FROM_DEV, cmd[6]=[12, 01, 80, 00, ff, 00], mx_sb_len=32, iovec_count=0, dxfer_len=255, timeout=5000, flags=0, data[255]=["\0\200\0\f152D20329000\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...], status=00, masked_status=00, sb[0]=[], host_status=0, driver_status=0, resid=239, duration=2, info=0}) = 0

and identical errors get logged to /var/log/messages.

The other disks don't respond the same way (as in demanding a retry for the next operation).  So it's something specific to this particular device.  Ugh...  Any suggestions?

Comment 3 Alexandre Oliva 2008-10-06 01:10:26 UTC

http://bugzilla.kernel.org/show_bug.cgi?id=9638 might be in some way related (just because it's the same kind of USB enclosure, with some references to underflow, which is what's triggering the error I get).

Comment 4 Alexandre Oliva 2008-10-08 06:28:54 UTC

kernel patch to fix this bug is attached to the bug above: http://bugzilla.kernel.org/attachment.cgi?id=18207

Comment 5 Chuck Ebbert 2008-10-25 02:38:10 UTC

Fixed in 2.6.27.4-46.rc3