Bug 555871 - Rebooting multiple LPARs can make vdisks disappear
Summary: Rebooting multiple LPARs can make vdisks disappear
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: powerpc
OS: Linux
high
high
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On: 524110
Blocks: 556938
TreeView+ depends on / blocked
 
Reported: 2010-01-15 19:33 UTC by Nate Straz
Modified: 2016-04-26 15:17 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of: 524110
Environment:
Last Closed: 2010-02-09 21:11:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
system dump from basic after hdisk1 errors (5.44 MB, application/x-bzip2)
2010-01-18 20:30 UTC, Nate Straz
no flags Details

Description Nate Straz 2010-01-15 19:33:49 UTC
+++ This bug was initially created as a clone of Bug #524110 +++

Description of problem:

After rebooting all Linux LPARs in a system, virtual disks are not always available when the Linux LPAR reboots.  It could be the root vdisk, in which case our systems boot from the network or it could be the shared disk.  In the second case, rescanning the SCSI bus usually brings back the LUN.



--- Additional comment from nstraz on 2010-01-05 17:54:10 EST ---

Created an attachment (id=381871)
system dump compressed with bzip2

After getting everything updated to the latest VIOS fix level 2.1.2.10-FP-22 I have not yet been able to recreate the hang.  I have hit another odd issue.  Sometimes when basic-p1 reboots (one of the Linux LPARs) the shared disk is not available.   I see messages like this during boot:

scsi 0:0:3:0: aborting command. lun 0x8300000000000000, tag 0xc00000007faf2af8
scsi 0:0:3:0: aborted task tag 0xc00000007faf2af8 completed
scsi 0:0:3:0: timing out command, waited 22s

The shared disk is a Winchester FC-SATA array attached via Emulex LightPulse FC card and QLogic switches.  IVM still says the physical disk is available and there were no other messages.  After rebooting the LPAR again the disk came back.

After a few cycles of this I started getting errors in Linux:

KINFO: task gfs_mkfs:3861 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs_mkfs      D 000000000ff111f8  6624  3861   3860                     (NOTLB)
Call Trace:
[C00000006795EF50] [C0000000679D2C30] 0xc0000000679d2c30 (unreliable)
[C00000006795F120] [C000000000010AA0] .__switch_to+0x124/0x148
[C00000006795F1B0] [C0000000003D8F38] .schedule+0xc08/0xdbc
[C00000006795F2C0] [C0000000003D9C3C] .io_schedule+0x58/0xa8
[C00000006795F350] [C0000000000FD2D8] .sync_buffer+0x68/0x80
[C00000006795F3C0] [C0000000003DA054] .__wait_on_bit+0xa0/0x114
[C00000006795F470] [C0000000003DA160] .out_of_line_wait_on_bit+0x98/0xc8
[C00000006795F570] [C0000000000FD19C] .__wait_on_buffer+0x30/0x48
[C00000006795F5F0] [C0000000000FDAC0] .__block_prepare_write+0x3ac/0x440
[C00000006795F720] [C0000000000FDB88] .block_prepare_write+0x34/0x64
[C00000006795F7A0] [C000000000104C4C] .blkdev_prepare_write+0x28/0x40
[C00000006795F820] [C0000000000C09C4] .generic_file_buffered_write+0x420/0x76c
[C00000006795F960] [C0000000000C10B4] .__generic_file_aio_write_nolock+0x3a4/0x448
[C00000006795FA60] [C0000000000C1524] .generic_file_aio_write_nolock+0x30/0xa4
[C00000006795FB00] [C0000000000C1AAC] .generic_file_write_nolock+0x78/0xb0
[C00000006795FC70] [C00000000010464C] .blkdev_file_write+0x20/0x34
[C00000006795FCF0] [C0000000000F93FC] .vfs_write+0x118/0x200
[C00000006795FD90] [C0000000000F9B6C] .sys_write+0x4c/0x8c
[C00000006795FE30] [C0000000000086A4] syscall_exit+0x0/0x40
ibmvscsi 30000002: Command timed out (1). Resetting connection
sd 0:0:3:0: abort bad SRP RSP type 1
sd 0:0:3:0: timing out command, waited 360s
end_request: I/O error, dev sdb, sector 104667554
Buffer I/O error on device dm-2, logical block 817712
lost page write due to I/O error on dm-2
Buffer I/O error on device dm-2, logical block 817713
...
ibmvscsi 30000002: Command timed out (1). Resetting connection
printk: 46 messages suppressed.
sd 0:0:3:0: abort bad SRP RSP type 1

At this point I initiated a system dump which is attached.

--- Additional comment from bugproxy.com on 2010-01-06 11:36:22 EST ---

------- Comment From tpnoonan.com 2010-01-06 10:03 EDT-------
Hi Red Hat, In case IBM needs to do a firmware/software update to the machines
at Red Hat, can you please clarify if this problem is on the Red Hat owned PWR
systems in MN or on the other PWR systems loaned to Red Hat by IBM? Thanks.

--- Additional comment from nstraz on 2010-01-06 11:55:09 EST ---

These are the Red Hat owned systems in MN.  I already did all the firmware and software upgrades to the latest versions.

--- Additional comment from bugproxy.com on 2010-01-06 14:48:11 EST ---

------- Comment From kumarr.com 2010-01-06 14:39 EDT-------
Red Hat,

Thanks for the latest update and the dump. At this point, it looks like you have encountered a different bug.

We have some questions/requests for you:

1. Would it be possible for you to provide us with new error logs like fsperror logs and other logs on the VIOS related to the problem you are seeing?

2. Please provide the uname -a output you ran these tests on.

3. If there is anything to add more to the console output that would be great.

Thanks!

--- Additional comment from nstraz on 2010-01-06 15:37:25 EST ---

Created an attachment (id=382071)
error and console logs from basic related to system dump

Attached are the Error/Event Logs entries from around the time I initiated the system dump.  Following that in the attachment are the console logs from basic-p1.

The kernel I'm using is the latest development kernel for RHEL 5.5.
Linux version 2.6.18-183.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Mon Dec 21 18:42:39 EST 2009^M

I've continued running this test and I've hit this again on another pSeries box.  I initiated a system dump there too.  I also get into a situation where on partition can't see the shared disk and the other can't see its root disk.

--- Additional comment from bugproxy.com on 2010-01-06 16:55:00 EST ---

------- Comment From brking.com 2010-01-06 16:46 EDT-------
Are there any error logs in the VIOS? If you ssh to the VIOS as padmin, then run the "errlog" command it will list the errors. "errlog -ls" will list all the additional error log data for all the errors.

--- Additional comment from nstraz on 2010-01-06 17:33:16 EST ---

Created an attachment (id=382106)
errlog -ls output from basic

Thank you for the errlog instructions.  There are lots of entries so I've compressed then with bzip2 and attached it.

--- Additional comment from bugproxy.com on 2010-01-07 11:25:14 EST ---

------- Comment From brking.com 2010-01-07 11:19 EDT-------
This is looking like it might be a VIOS issue. I discussed with some folks in VIOS and they need a PMR opened so they can work the problem. Can you open a software PMR against VIOS? Once you have the PMR opened, let me know the PMR number and I can make sure it gets escalated appropriately. They will need to know your configuration in the PMR and the service team will most likely request some logs as well.

--- Additional comment from nstraz on 2010-01-08 08:48:12 EST ---

I was able to open a PMR over the phone, it is 30158379.

While doing so we found out that our software support contract had expired in August 2009.  Someone from IBM Sales is supposed to contact me with pricing and I was given the tracking number 1572759.

--- Additional comment from bugproxy.com on 2010-01-14 11:43:42 EST ---

------- Comment From brking.com 2010-01-14 11:25 EDT-------
I've discussed the problem with VIOS development. Would it be possible to provide remote access to the VIOS so that VIOS development can take a look at the system live? This would help immensely in resolving this issue in a timely fashion.

--- Additional comment from bugproxy.com on 2010-01-15 11:10:24 EST ---

------- Comment From tpnoonan.com 2010-01-15 10:34 EDT-------
Hi Red Hat, Since the original problem has been resolved, is the new problem in the way of moving PWR LPAR
Red Hat Software Cluster Suite  from "tech preview" to full support in RHEL5.5? Or is thje new problem "just" a defect to be resolved? Thanks.

--- Additional comment from nstraz on 2010-01-15 11:14:19 EST ---

I haven't been able to get through enough testing to determine if the hang on reboot is gone because the new problem of virtual disks disappearing is getting in the way.

Comment 1 IBM Bug Proxy 2010-01-15 21:10:48 UTC
------- Comment From kumarr.com 2010-01-15 16:07 EDT-------
Mirroring to IBM

Comment 2 IBM Bug Proxy 2010-01-18 17:11:07 UTC
------- Comment From brking.com 2010-01-18 12:00 EDT-------
One of the things the VIOS team has indicated they will need is a snap from the failing VIOSes. Can you login as padmin to the VIOSes you are seeing the problem on and run the "snap" command. It should collect a bunch of data and create a snap file in the home directory. Please attach the snap data to this bug when it is available.

Comment 3 Nate Straz 2010-01-18 20:15:07 UTC
I hit this again over the weekend, this time a little differently.  The systems were idle and I found this message on the VIOS partitions console:

Thu Jan 14 23:51:07 CST 2010
Automatic Error Log Analysis for hdisk1 has detected a problem.
The Service Request Number is 
  2643-129: Error log analysis indicates a SCSI bus problem..


hdisk1 corresponds to one of the internal SCSI disks.  This caused the Linux partitions to fail and the VIOS console was unresponsive when I tried it.  I initiated a system dump which I will attach.  After the system came back up I ran the snap command and I'll attach that file as well.

Comment 4 Nate Straz 2010-01-18 20:16:48 UTC
When the Linux LPARs came back up after comment 3 they tried to boot off the network instead of the disk.  I went back into SMS and listed all bootable devices.

 Version SF240_382
 SMS 1.6 (c) Copyright IBM Corp. 2000,2005 All rights reserved.
-------------------------------------------------------------------------------
 Select Device
 Device  Current  Device
 Number  Position  Name
 1.        2      Virtual Ethernet
                  ( loc=U9110.51A.06B1DFD-V3-C4-T1 )
 2.        -      SCSI CD-ROM
                  ( loc=U9110.51A.06B1DFD-V3-C2-T1-W8200000000000000-L0 )








 -------------------------------------------------------------------------------
 Navigation keys:
 M = return to Main Menu
 ESC key = return to previous screen         X = eXit System Management Services
 -------------------------------------------------------------------------------
 Type menu item number and press Enter or select Navigation key:

Comment 5 Nate Straz 2010-01-18 20:30:15 UTC
Created attachment 385236 [details]
system dump from basic after hdisk1 errors

Comment 6 IBM Bug Proxy 2010-01-18 22:25:20 UTC
------- Comment From brking.com 2010-01-18 17:11 EDT-------
Is this problem being seen on just one system, multiple systems, or all of the systems running this test?

Comment 7 Nate Straz 2010-01-18 22:30:58 UTC
I've hit this on all three systems I'm running the test on.

Comment 8 IBM Bug Proxy 2010-01-18 22:34:54 UTC
------- Comment From brking.com 2010-01-18 17:23 EDT-------
Did this problem not occur with RHEL 5.4, or is this the first time these tests have been run? If this is a regression with RHEL 5.5, its possible this is related to a couple new features that have been enabled in the ibmvscsi driver in Linux. They can be disabled through module parameters. You could try adding the following module parameters to /etc/modprobe.conf:

options ibmvscsic fast_fail=0,client_reserve=0

Rebuild the initrd as appropriate, reboot, and retry...

Comment 9 Nate Straz 2010-01-18 22:44:55 UTC
Would the changes to the ibmvscsi driver cause SMS not to see the virtual disks?

Comment 10 IBM Bug Proxy 2010-01-18 23:23:36 UTC
------- Comment From brking.com 2010-01-18 18:11 EDT-------
(In reply to comment #10)
> Would the changes to the ibmvscsi driver cause SMS not to see the virtual
> disks?

I don't think the chances are very high that these changes have anything to do with the problem we are seeing, just suggested it as something we can do on the client side to eliminate variables. The ibmvscsi driver changes alter the behavior of the VIOS regarding error recovery. These settings should be discarded when the OpenFirmware driver loads and initializes the VSCSI adapter, which is why its not very likely this is the cause of the problem. However, its possible that due to fast_fail being enabled, the VIOS is timing out commands sooner, which is sending it down a different error recovery path such that the VSCSI adapter becomes "unavailable" for a period of time.

Comment 11 IBM Bug Proxy 2010-01-19 18:51:06 UTC
------- Comment From brking.com 2010-01-19 12:50 EDT-------
Thanks for providing the snap file. VIOS development is in the process of analyzing the snap file. One thing that was noticed upon analyzing the snap was that there is a non IBM part number FC adapter in the system. Is that expected?

Comment 12 Nate Straz 2010-01-19 19:41:09 UTC
Yes, that is expected.  We bought them through CDW and selected the version which was compatible with the pSeries hardware.

What does SMS look for to determine if a disk is bootable?  I moved the unbootable volume from basic-p2 to basic-p1 and the usual boot signature is there, 0x55aa at offset 510.  I've restarted basic-p2 several times and it still is not able to boot from the vdisk.

Comment 13 IBM Bug Proxy 2010-01-19 20:14:23 UTC
------- Comment From brking.com 2010-01-19 15:06 EDT-------
VIOS development has reviewed the snap and provided the following feedback:

It appears an unsupported Emulex 4Gb adapter is being used, because the VPD data does not match what an IBM one should look like. So this maybe an off the shelf Emulex adapter and thus would not be supported. There is only one FC adapter in this system and it's VPD is:
fcs0 U788C.001.AAA6467-P1-C14-C1-T1 FC Adapter

Part Number.................LP11000-M4
Serial Number...............VM64043527
Network Address.............10000000C95C4D2F
ROS Level and ID............02C82774
Device Specific.(Z0)........1036406D
Device Specific.(Z1)........00000000
Device Specific.(Z2)........00000000
Device Specific.(Z3)........03000909
Device Specific.(Z4)........FFC01231
Device Specific.(Z5)........02C82774
Device Specific.(Z6)........06C32715
Device Specific.(Z7)........07C32774
Device Specific.(Z8)........20000000C95C4D2F
Device Specific.(Z9)........BS2.71X4
Device Specific.(ZA)........B1D2.70A5
Device Specific.(ZB)........B2D2.71X4
Device Specific.(ZC)........00000000
Hardware Location Code......U788C.001.AAA6467-P1-C14-C1-T1
The above part number, serial number are invalid. Also it is missing a Device Specific (ZM) field that AIX requires for fast fail, dynamic tracking, and IP over FC. As a point of reference here's what a supported 4Gb adapter should look like:
fcs0 U0.1-P2-I4/Q1 FC Adapter

Part Number.................03N5029
EC Level....................A
Serial Number...............1E313BB001
Device Specific.(CC)........5759
Manufacturer................001E
FRU Number.................. 03N5029
Device Specific.(ZM)........3
Network Address.............10000000C94AE262
ROS Level and ID............02C82138
Device Specific.(Z0)........1036406D
Device Specific.(Z1)........00000000
Device Specific.(Z2)........00000000
Device Specific.(Z3)........03000909
Device Specific.(Z4)........FFC01159
Device Specific.(Z5)........02C82138
Device Specific.(Z6)........06C12138
Device Specific.(Z7)........07C12138
Device Specific.(Z8)........20000000C94AE262
Device Specific.(Z9)........BS2.10X8
Device Specific.(ZA)........B1F2.10X8
Device Specific.(ZB)........B2F2.10X8
Device Specific.(YL)........U0.1-P2-I4/Q1

Comment 14 IBM Bug Proxy 2010-01-20 16:56:43 UTC
------- Comment From brking.com 2010-01-20 10:52 EDT-------
Are there any errors on the Winchester array? We are seeing commands timing out from the VIOS.

Comment 15 Nate Straz 2010-02-09 21:11:53 UTC
Considering this not a bug since the hardware is not supported by IBM.


Note You need to log in before you can comment on or make changes to this bug.