446279 – kdump initrd does not handle lun scanning race condition

Bug 446279 - kdump initrd does not handle lun scanning race condition

Summary: kdump initrd does not handle lun scanning race condition

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kexec-tools
Sub Component:
Version:	5.2
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Neil Horman
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	391501
TreeView+	depends on / blocked

Reported:	2008-05-13 21:09 UTC by Doug Chapman
Modified:	2018-10-27 12:56 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-01-20 20:58:47 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch to delay boot while scsi drivers initalize (1.61 KB, patch) 2008-05-14 14:00 UTC, Neil Horman	no flags	Details \| Diff
updated patch with backticks and $ properly escaped (856 bytes, patch) 2008-05-14 20:24 UTC, Neil Horman	no flags	Details \| Diff
fix race condition on kdump boot (3.13 KB, patch) 2008-05-15 19:13 UTC, Doug Chapman	no flags	Details \| Diff
patch to provide initramfs with list of critical block devices (3.14 KB, patch) 2008-05-16 17:26 UTC, Neil Horman	no flags	Details \| Diff
Show Obsolete (3) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2009:0105	0	normal	SHIPPED_LIVE	kexec-tools bug fix and enhancement update	2009-01-20 16:04:36 UTC

Description Doug Chapman 2008-05-13 21:09:59 UTC

Description of problem:
This is a problem we used to hit a lot in normal bootup but was resolved by
recent versions of mkinitrd.  We need to do something similar in mkdumprd.

in RHEL5 when a driver module is loaded it starts scanning for luns in another
thread.  During bootup it is important to be sure that all the drivers are done
scanning before it tries to mount root filesystem or start LVM.  This is seen
most often on large systems with lots of storage (i.e. a SAN environment).

The hack in mkinitrd basically just looks at /proc/scsi/scsi and waits for it to
stop changing.

I am working on a patch for mkdumprd now that mimics what mkinitrd does.  It
won't be the same fix because they fixed it by adding a command to the nash
shell to handle this but it will be the same idea.

I will attach it here once I test it out.

Version-Release number of selected component (if applicable):
kexec-tools-1.102pre-21.el5

How reproducible:
difficult unless you have a big system with a lot of storage.  On big systems it
can be nearly 100% of the time.


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 Neil Horman 2008-05-13 23:53:43 UTC

I understand your pain here, but I have to ask:  Is waiting for /proc/scsi/scsi
to quiesce really that safe a way to make this work?  I've solved this problem
in what I think is a safer (if far less automatic) way.  We added in RHEL5.2 a
kdump_pre and kdump_post directive, which allows you to specify a script to run
immediately prior to core collection.  The kdump_pre directive in particular is
usefull here as it allows you to write a script that waits for a specific drive
to become available using whatever detection means are most relevant to the
environment at hand.  Granted, the implication here is that a custom script has
to be tailored for any environment that experiences this problem, but I'm not
entirely sure that polling /proc/scsi/scsi for a no-change time threshold is
much better.  If you would please, take a look at the kdump_pre directive (I'm
happy to write a script for you to target a particular system if you like), and
see if it doesn't solve your problem.  If you're not happy with the results,, we
can look at the solution you are proposing above.  Thanks!

Comment 2 Doug Chapman 2008-05-14 12:19:26 UTC

As far as the /proc/scsi/scsi method being safe. no it is not the "best" way but
this is how RHEL5.X _always_ boots.  More recent upstream kernels have a better
fix which I suggested we backport to RHEL but I forget the reasons that was
rejected.  We have not seen problems with this method yet to the best of my
knowledge.  We had LOTS of problems before that hack was put into the initrd.

I suppose the kdump_pre solution would work but are you saying we would rely on
the customer to do that?  That is not an acceptable solution.  If the system
boots OK in a normal situation without the user needing to hack it it should
boot OK under kdump.

Comment 3 Neil Horman 2008-05-14 14:00:16 UTC

Created attachment 305363 [details]
patch to delay boot while scsi drivers initalize

Yeah, my solutions requires customization for environments with this problem.
but I had decided that was ok, since all our docs tell the users they should
test their configuration for kdump before deploying (lest they reserve too
little memory and oom kill themselves on kdump boot).  If mkinitrd does this
for normal boot, I guess it would be alright to include in mkdumprd (hack
though it is).	The attached patch should mimic the stabilize functionality in
nash.  Can you test it out and confirm that it fixes the problem for you? 
Thanks!

Comment 4 Neil Horman 2008-05-14 20:24:06 UTC

Created attachment 305402 [details]
updated patch with backticks and $ properly escaped

new patch with proper escape sequences

Comment 5 Doug Chapman 2008-05-14 22:29:16 UTC

Neil,

Just as I remember when we had to fix this in mkinitrd I am finding this is
ugly.  Still not working with the latest patch, I will continue to investigate
and hopefully have a reliable solution soon.

In the meantime I am going to assign this to myself since I have the hardware to
reproduce it.

- Doug

Comment 6 Doug Chapman 2008-05-15 19:13:28 UTC

Created attachment 305525 [details]
fix race condition on kdump boot

This solution seems to work.  I looked into mkinitrd some more and found the
biggest difference is mkinitrd probes for what driver it needs for the root
device _before_ it looks at /etc/modules.conf.

This patch moves the code that probes what is needed for the root filesystem
earlier so that the critical modules get loaded first.	This is what allows
mkinitrd to avoid the race we are seeing.

I also noticed that it looks like mkdumprd stated life out as just a modified
version of mkinitrd.  Has there been any though into splitting out the common
bits of both into a new file so that we can maintain it more easily?  I found
there were a lot of other little changes that have been made to mkinitrd that
probably should be included in mkdumprd as well.

I have tested this on my rx4640 set up to boot from a lun off of a qlogic FC
card.  Without this patch it tries to mount "/" before it is probed, with the
patch it boots cleanly.

Comment 7 Neil Horman 2008-05-16 01:45:18 UTC

So, I'm looking at this patch, and it just seems to me that, as you say, it just
loads scsi hba modules earlier in the boot process.  The only thing that I see
that doing is providing more time before we start checking /proc/scsi/scsi for
changes.  That just seems soooooo hackish, weather or not mkinitrd does it too.
 If thats the case I'd just as soon increase the polling interval on the last
chunk (which unless, I'm mistaken, will accomplish the same goal).  Although
honestly, i feel like we need to have a better solution.  In fact, I think I
have an idea for one.  We can, when we start the kdump service, record the
devices in /sys/block, filtered by the devices that we need to talk to
ocnfigured dump targets.  Then we can just poll untill all those devices are
present in /sys/block on kdump boot.  I'll put a patch together and post it shortly.

Comment 8 Neil Horman 2008-05-16 17:26:52 UTC

Created attachment 305717 [details]
patch to provide initramfs with list of critical block devices 

Ok, so here's a patch with the idea I had.  When mkdumprd runs and generates a
list of the modules needed to access the drives that we need to successfully
complete the dump capture process, it (with this patch), will also record the
names of the corresponding block devices, as they appear in /sys/block.  This
file is then stored in the initramfs, and queried on bootup, after all the
modules are loaded.  Booting pauses, and loops until all the named block
devices appear in sysfs.  While this doesn't wait for all devices to appear, it
will now wait until at least until the	critical needed devices are present
before attempting a dump capture.  I tried it on a local test system, and it
worked well.  If you could please test it on your system, which we know to
suffer from this problem in kdump, and confirm the fix, I'll get it checked in
ASAP.  Thanks!

Comment 9 Doug Chapman 2008-05-16 20:34:27 UTC

Works on my rx4640 booting using the qlogic card (the case that failed before).
 I will test in some other cases but I think this is the only "nasty" one likely
to break.

thanks!

- Doug

Comment 10 Kevin Krafthefer 2008-05-22 16:06:27 UTC

This RFE has been reviewed during the RHEL RFE review
with Red Hat product management. This request has been *tentatively* approved
for inclusion
in the next update. This decision is not final and still pends further
technical review and scoping by Red Hat development engineering.

Comment 11 Neil Horman 2008-05-22 16:59:15 UTC

fixed in -22.el5  thanks!

Comment 16 errata-xmlrpc 2009-01-20 20:58:47 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2009-0105.html

Note You need to log in before you can comment on or make changes to this bug.