Bug 743599

Summary: I/O errors on boot
Product: Red Hat Enterprise Linux 6 Reporter: Joe Pope <pope_svr4>
Component: device-mapper-multipathAssignee: Ben Marzinski <bmarzins>
Status: CLOSED NOTABUG QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: high Docs Contact:
Priority: unspecified    
Version: 6.1CC: agk, bmarzins, dwysocha, heinzm, prajnoha, prockai, zkabelac
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-09-30 14:25:04 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
config files
none
command output
none
command output
none
command output
none
multipath -ll output
none
cat /proc/partitions output
none
dmesg output zip 1
none
dmesg output zip 2
none
dmesg output zip 3
none
dmesg output zip 4
none
dmesg output zip 5
none
dmesg output zip 6 none

Description Joe Pope 2011-10-05 13:41:35 UTC
Description of problem:
We are NOT booting from the SAN. During the boot a large number of I/O errors are displayed for the attached SAN storage and the server will never finish booting and continually scrolls I/O errors. I rebuilt the initramfs using the "--preload scsi_dh_rdac" option and now we still see some of the I/O errors but the server will at least completely boot. Once the server completely boots all LUNs are accessible and everything works fine. What is causing these I/O errors and are they something to be concerned about?

Version-Release number of selected component (if applicable):
RHEL6.1
Kernel - 2.6.32-131.0.15.el6.x86_64


How reproducible:
Simply reboot the server

Steps to Reproduce:
1. Reboot the server
  
Actual results:
Without the rebuilt initramfs using the "--preload scsi_dh_rdac" option the server never finishes booting. With the rebuilt initramfs some I/O errors are displayed but the server will at least finish booting.

Expected results:
Server should boot cleanly with attached SAN storage

Additional info:
device-mapper-multipath-0.4.9-41.el6.x86_64
QLogic HBAs
Storage - IBM (LSI) 1814

Comment 2 Ben Marzinski 2011-10-05 16:32:56 UTC
This looks like it's the same issue as Bug 690523.  If that's the case, there are two sources of your errors. The first is that rdac scsi device handler module was getting loaded after the qlogic driver, and this was keeping the devices from getting set up correctly when they were initially discovered, causing a lot of IO errors.

The second issue is likely that multipathd is getting started before all of the devices have been discovered.  When this happens, if multipath sees a passive path first, it will activate that path.  If there is IO going to the formerly active path, this will be failed back.

To make sure that this is what you are seeing, you could try disabling multipath, and verifying that the errors stop

The easiest way to do this is to run

#chkconfig multipathd off

and remove

/etc/multipath.conf

and then remake the initramfs. You should make a backup of /etc/multipath.conf and your current initramfs.  With this done, multipath will be disabled.  If
you reboot and still see errors, they are being caused by something other than multipath.

Comment 3 Joe Pope 2011-10-05 18:28:48 UTC
When I remake the initramfs should I leave out the --preload of scsi_dh_rdac? What about the wwids file in /etc/mulitpath, should that be removed?

Comment 4 Joe Pope 2011-10-05 20:05:55 UTC
turned multipathd off, removed multipath.conf, rebuilt initramfs - on reboot the server never finished booting. It continuously scrolled I/O error messages as follows for all of the devices:
end_request: I/O error, dev sdcz, sector 0
end_request: I/O error, dev sdcz, sector 15619784576
Then udev started and the errors changed but continued as follows for all of the devices:
Buffer I/O error on device sdev, logical block 244059105
ERROR: pdc: reading /dev/sdev [Input/output error]
I finally had to force a restart after 20+ minutes.


To actually get the server to boot, I had to modify the kernel line from grub.conf and add the following: rdloaddriver=scsi_dh_rdac
Once I added that line the server still scrolled some I/O errors but not nearly as many and it booted. I also received the following errors during this boot:
udevd-work[562]: rename(/dev/disk/by-id/scsi-360080e500017f0e60000a5894baafeeb.udev-tmp, /dev/disk/by-id/scsi-360080e500017f0e60000a5894baafeeb) failed: No such file or directory

udevd-work[562]: rename(/dev/disk/by-id/wwn-360080e500017f0e60000a5894baafeeb.udev-tmp, /dev/disk/by-id/wwn-360080e500017f0e60000a5894baafeeb) failed: No such file or directory

Comment 5 Ben Marzinski 2011-10-05 20:49:25 UTC
(In reply to comment #3)
> When I remake the initramfs should I leave out the --preload of scsi_dh_rdac?
> What about the wwids file in /etc/mulitpath, should that be removed?

You should include "--preload scsi_dh_rdac", otherwise, you will hit those errors. You could also include "--nompath". But removing /etc/multipath.conf should be enough.

You don't need to remove the /etc/multipath directory.  Dracut and the init scripts only check if the config file is there, to determine if multipathing should be started.

Comment 6 Ben Marzinski 2011-10-05 20:54:24 UTC
So, did you see more IO errors with multipath enabled (and the scsi device handler preloaded) or was it the same?  Once you've booted up, can you verify that multipathd really isn't running, and that 

# multipath -l

shows no devices.  If multipath really was disabled, then I'm not sure where those errors are coming from.  It could be that without multipath there, LVM is probing the passive paths. Or it could something else completely.

Comment 7 Joe Pope 2011-10-05 20:58:42 UTC
The --nompath is not a valid option to mkinitrd.

I am running the tests from comment #6 now and will post results shortly.

Comment 8 Joe Pope 2011-10-05 21:00:50 UTC
And yes multipath was disabled when I ran the tests in comment #4. just verified

Comment 9 Ben Marzinski 2011-10-05 21:26:46 UTC
(In reply to comment #7)
> The --nompath is not a valid option to mkinitrd.

Sorry. Thanks should be "-o multipath", but it shouldn't be necessary if you removed /etc/multipath.conf before you remade the initramfs.

Comment 10 Joe Pope 2011-10-05 22:11:53 UTC
I removed the multipath.conf, turned off multipathd and rebuilt the initramfs with the "--preload scsi_dh_rdac" option. I got the same results as in comment #4.

However, if I rebuild the initramfs with the "--preload scsi_dh_rdac" option and enable multipath the server boots in normal time with I/O errors and some of the udev errors but the I/O errors are considerably less.

I will attach some docs this evening that have excerpts from dmesg to see if anything can be gleaned from that.

Comment 11 Joe Pope 2011-10-05 22:31:40 UTC
Interestingly enough... When I had multipathd off and "--preload scsi_dh_rdac" option built into the initramfs the system would not boot as stated in comment #10. The only way to get it to boot was to modify the kernel line with "rdloaddriver=scsi_dh_rdac".

Question:
If I built the initramfs to preload the scsi_dh_rdac module shouldn't it have already been loaded? Why did I have to add the rdloaddriver option as well to get it to boot?

Comment 12 Joe Pope 2011-10-06 10:15:07 UTC
Created attachment 526654 [details]
config files

Page 1 of config files and command output

Comment 13 Joe Pope 2011-10-06 10:16:11 UTC
Created attachment 526655 [details]
command output

Page 2 of command output

Comment 14 Joe Pope 2011-10-06 10:16:57 UTC
Created attachment 526657 [details]
command output

Page 3 of command output

Comment 15 Joe Pope 2011-10-06 10:17:34 UTC
Created attachment 526658 [details]
command output

Page 4 of command output

Comment 16 Suzanne Logcher 2011-10-06 18:43:23 UTC
Since RHEL 6.2 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.
               
Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 17 Joe Pope 2011-10-07 03:14:22 UTC
(In reply to comment #16)
> Since RHEL 6.2 External Beta has begun, and this bug remains
> unresolved, it has been rejected as it is not proposed as
> exception or blocker.
> 
> Red Hat invites you to ask your support representative to
> propose this request, if appropriate and relevant, in the
> next release of Red Hat Enterprise Linux.

So does this mean there will be no forth coming help with this issue? The problem still exists. What can I do for support on this bug?

Comment 18 Ben Marzinski 2011-10-07 16:30:13 UTC
As far as having the default initramfs make sure that the scsi device handlers have been loaded, that is being dealt with in Bug 690523.  Since there is a workaround (manually running dracut with the --preload option) and required code is not upstreams yet, this was pushed back the 6.3

I'm not sure what the difference is between the rdloaddriver and the --preload dracut option.  I can take a look.

As for the remaining error messages, they may still be caused by multipath as I speculated in Comment 2, but I noticed something odd while looking those four pages of output.  All of the error messages on those pages were from scsi devices that weren't listed as multipath paths.

When you boot up, with multipath running, can you capture the error messages? Once you are booted up, run

# cat /proc/partitions
# multipath -ll


I'm interested in seeing which devices are actually causing the errors on bootup.

Comment 19 Joe Pope 2011-10-11 23:18:46 UTC
I added the following option to the kernel line in grub.conf "rdloaddriver=scsi_dh_rdac". I am using the original initramfs generated when the OS was installed. The initramfs does NOT have the "--preload=scsi_dh_rdac" option built in. The server boots in normal time and displays fewer I/O errors and udev but they are still there. I will be attaching the output of the "multipath -ll" and "cat /proc/partitions" commands.

I will also be attaching a large portion of the "dmesg" command.

Comment 20 Joe Pope 2011-10-11 23:20:36 UTC
Created attachment 527559 [details]
multipath -ll output

Output: multipath -ll

Comment 21 Joe Pope 2011-10-11 23:29:08 UTC
Created attachment 527560 [details]
cat /proc/partitions output

cat /proc/partitions output

Comment 22 Joe Pope 2011-10-12 00:22:08 UTC
Created attachment 527568 [details]
dmesg output zip 1

dmesg output zip 1

Comment 23 Joe Pope 2011-10-12 00:25:51 UTC
Created attachment 527569 [details]
dmesg output zip 2

Comment 24 Joe Pope 2011-10-12 00:27:45 UTC
Created attachment 527570 [details]
dmesg output zip 3

Comment 25 Joe Pope 2011-10-12 00:29:28 UTC
Created attachment 527571 [details]
dmesg output zip 4

Comment 26 Joe Pope 2011-10-12 00:31:19 UTC
Created attachment 527572 [details]
dmesg output zip 5

Comment 27 Joe Pope 2011-10-12 00:33:05 UTC
Created attachment 527573 [details]
dmesg output zip 6

Comment 28 Ben Marzinski 2011-10-12 21:22:47 UTC
I'm digging through this information right now, but is there anyway to repost this as text files, instead of PDFs.  I would be able to sift through this much
faster that way, and I'm not able to convert the pdfs back to text files accurately enough to be helpful.

Comment 29 Joe Pope 2011-10-12 23:09:49 UTC
The servers are on a closed network and that is the only way I have to send the files.

Comment 30 Ben Marzinski 2011-10-13 18:11:59 UTC
So, looking at the output, all of your errors are happening before multipath is started (it doesn't get started in the initramfs), and the I/O errors are happening on the passive paths when LVM is scanning the devices.   Could you try rebuilding the initramfs with multipath included.

first, make sure /etc/multipath exists, then run

dracut -a multipath

Then LVM should notice that the devices are multipathed and talk to the multipath devices instead of scsi devices.

The errors like:
"unable to read partition table"
"unable to read RDB block 0"
"unknown partition table"

that happen when a passive path is initially discovered are probably unavoidable.

Comment 31 Joe Pope 2011-10-14 00:08:17 UTC
what is the difference between mkinitrd and dracut? They both create an initramfs file. Does dracut do things differently?

I will test the dracut command tomorrow and let you know.

Comment 32 Ben Marzinski 2011-10-14 15:14:29 UTC
mkinitrd is now just a bash script wrapper around dracut.

Comment 33 Joe Pope 2011-10-14 22:35:17 UTC
I ran "dracut -a multipath" and removed the "rdloaddriver=scsi_sh_rdac" parameter from grub.conf.

On reboot I get a bunch of "end_request: I/O error" messages and I see a bunch of "/dev/sd##: read failed after 0 of 4096 at 0: Input/output error" messages.

Once multipathd starts there are no more error messages. The server boots in normal time as well. If this is normal behaviour I am fine with that but I just wanted to make sure. I will use the dracut command for all my servers.

On shutdown I also see the "...read failed..." messages but the server reboots just fine.

Comment 34 Joe Pope 2011-10-18 12:05:02 UTC
Are the results I mentioned in comment #33 normal?

If you need more info or logs, please let me know.

Comment 35 Ben Marzinski 2011-10-19 18:10:39 UTC
Before multipath starts, there's no way for any programs to know that they aren't supposed to use the passive paths, so that's probably normal.

Comment 36 Joe Pope 2011-10-19 21:59:00 UTC
Works for me. I will just rebuild my initramfs files on each server using the dracut command you sent and call it a day. Thanks for the help.