524110 – Rebooting multiple LPARs can disable the whole machine

Bug 524110 - Rebooting multiple LPARs can disable the whole machine

Summary: Rebooting multiple LPARs can disable the whole machine

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	powerpc
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Steve Best
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	555871 556938
TreeView+	depends on / blocked

Reported:	2009-09-17 21:25 UTC by Nate Straz
Modified:	2016-04-26 13:46 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	555871 (view as bug list)
Environment:
Last Closed:	2010-02-09 21:12:03 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Sample firmware event from basic (2.01 KB, text/plain) 2009-11-25 16:23 UTC, Nate Straz	no flags	Details
Logs from the latest recreation (10.81 KB, text/plain) 2009-12-21 23:51 UTC, Nate Straz	no flags	Details
fsp dump (128.00 KB, application/octet-stream) 2009-12-22 14:18 UTC, Nate Straz	no flags	Details
fsp dump (368.35 KB, application/octet-stream) 2009-12-22 22:18 UTC, Nate Straz	no flags	Details
fsp dump id 1000004 (373.49 KB, application/octet-stream) 2009-12-22 22:29 UTC, Nate Straz	no flags	Details
cluster.conf (1.81 KB, application/octet-stream) 2010-01-04 21:04 UTC, Nate Straz	no flags	Details
cluster.conf (2.75 KB, text/plain) 2010-01-05 12:31 UTC, IBM Bug Proxy	no flags	Details
system dump compressed with bzip2 (4.38 MB, application/octet-stream) 2010-01-05 22:54 UTC, Nate Straz	no flags	Details
error and console logs from basic related to system dump (26.02 KB, application/octet-stream) 2010-01-06 20:37 UTC, Nate Straz	no flags	Details
errlog -ls output from basic (15.36 KB, application/x-bzip2) 2010-01-06 22:33 UTC, Nate Straz	no flags	Details
Show Obsolete (1) View All

Description Nate Straz 2009-09-17 21:25:33 UTC

Description of problem:

Our LPAR cluster uses two LPARs on each physical machine for Linux, plus the AIX VIOS LPAR.  

During one of our GFS fsck tests we reboot all nodes in the cluster in order to test fsck after recovery. We send a "reboot -fin" command to each node in the cluster.  Sometimes when we do that the entire machine hangs, including the VIOS LPAR.  I'm not able to get any response from the serial console for the VIOS LPAR or any other LPAR on the physical host.  I have to go to ASM and reboot the entire machine to recover from this state.


Version-Release number of selected component (if applicable):


How reproducible:
About 20% of the time.

Steps to Reproduce:
1. reboot -fin on all Linux LPARs at the same time.
2.
3.
  
Actual results:
Entire pSeries system becomes unusable.

Expected results:
LPARs should reboot

Additional info:

Comment 1 Kevin W Monroe 2009-11-19 16:52:44 UTC

Hi Nate - couple questions to help us understand this better:

- I'm not familiar with reboot -fin. Where do you run this command? Does this reboot all lpars on a machine, or reboot the entire managed system?

- When does the hang happen - prior to the actual power cycle? after restart but prior to the fsck?  during the fsck?

- Is the VIO lpar active prior to the client lpars (whose storage is served by the vio server) starting?

- Do you see anything related to this hang (op panel codes, etc) in the Serviceable Events or ASM logs?

Please post output from the following run on the HMC (or provide hmc access isntructions):

- lshmc -V <-- upper case V
- lshmc -v <-- lower case v
- lshmc -b
- lssyscfg -r sys

Comment 2 Nate Straz 2009-11-25 16:21:51 UTC

(In reply to comment #1)
> Hi Nate - couple questions to help us understand this better:
> 
> - I'm not familiar with reboot -fin. Where do you run this command? Does this
> reboot all lpars on a machine, or reboot the entire managed system?

That's run as root on each Linux LPAR.

> - When does the hang happen - prior to the actual power cycle? after restart
> but prior to the fsck?  during the fsck?

Just after the reboot commands.  All LPARs, including the VIOS LPAR hang.

> - Is the VIO lpar active prior to the client lpars (whose storage is served by
> the vio server) starting?

It is active and the storage is working before the reboot commands are issued.
 
> - Do you see anything related to this hang (op panel codes, etc) in the
> Serviceable Events or ASM logs?

There is usually a new informational firmware event in the ASM log.
I can attach one for you.

> Please post output from the following run on the HMC (or provide hmc access
> isntructions):
> 
> - lshmc -V <-- upper case V
> - lshmc -v <-- lower case v
> - lshmc -b
> - lssyscfg -r sys  

These systems use IVM, not an HMC.

$ lsivm -V
9110-51A,06B1DFD,1
$ lssyscfg -r sys
name=basic,type_model=9110-51A,serial_num=06B1DFD,ipaddr=10.15.90.22,state=Operating,sys_time=11/25/09 10:19:15,power_off_policy=0,active_lpar_mobility_capable=0,inactive_lpar_mobility_capable=0,cod_mem_capable=0,cod_proc_capable=1,vet_activation_capable=1,os400_capable=0,active_lpar_share_idle_procs_capable=0,micro_lpar_capable=1,dlpar_mem_capable=1,assign_phys_io_capable=0,lhea_capable=0,max_lpars=20,max_power_ctrl_lpars=1,service_lpar_id=1,service_lpar_name=basic,mfg_default_config=0,curr_configured_max_lpars=11,pend_configured_max_lpars=11,config_version=0100020000000000,pend_lpar_config_state=enabled,lpar_avail_priority_capable=0,lpar_proc_compat_mode_capable=0,max_vios_lpar_id=1,virtual_io_server_capable=0

Comment 3 Nate Straz 2009-11-25 16:23:12 UTC

Created attachment 373774 [details]
Sample firmware event from basic

Comment 4 Ben Levenson 2009-12-09 19:40:13 UTC

We need to move quickly if we expect this to be resolved in RHEL-5.5.
Any recommendations Kevin?
Thanks

Comment 5 John Jarvis 2009-12-09 19:43:35 UTC

Kevin has returned to Austin so Steve Best, onsite Power guy in Westford, will be the best source.

Comment 6 IBM Bug Proxy 2009-12-10 21:20:38 UTC

------- Comment From kumarr.com 2009-12-10 16:18 EDT-------
Mirroring to IBM

Comment 7 IBM Bug Proxy 2009-12-11 17:10:47 UTC

------- Comment From kumarr.com 2009-12-11 12:05 EDT-------
Hi Red Hat,

I notice this problem was reported on RHEL 5.3. Have you seen this on the RHEL 5.5 builds? Have you tried this on RHEL 5.4?

Also, could you please confirm that this is not a hardware issue?

Thanks!

Comment 8 IBM Bug Proxy 2009-12-11 17:21:27 UTC

------- Comment From kumarr.com 2009-12-11 12:13 EDT-------
I was talking to Nadia Fry - our Test Team lead for Linux on Power hardware. She had some questions:

1. What hardware is this running on?
2. Is this running at the most current level of Firmware? Please provide the VIOS level you are running.

Thanks!

Comment 9 Nate Straz 2009-12-11 18:14:29 UTC

We did hit this during RHEL 5.4 testing.  I have not attempted PPC testing on RHEL 5.5 at this time.

We has an IBM support person on site and he was not able to find anything wrong with the hardware.

The hardware is model 9110-51A.

$ lsfware
system:SF240_338 (t) SF240_259 (p) SF240_338 (t)

Comment 11 IBM Bug Proxy 2009-12-18 14:21:31 UTC

------- Comment From tpnoonan.com 2009-12-18 09:12 EDT-------
changing ibm severity to block as this blocks pwr lpar red hat software cluster suite in rhel5.5

Comment 12 IBM Bug Proxy 2009-12-18 16:24:00 UTC

------- Comment From tpnoonan.com 2009-12-18 11:11 EDT-------
red hat please consider this under the exception process for rhel5.5 as this blocks RHEL5.5 PWR LPAR Red Hat Software Cluster Suite which was a technology preview in rhel5.4 and planned for production in rhel5.5, thanks

Comment 13 John Jarvis 2009-12-18 17:04:16 UTC

This doesn't need an exception since it is proposed by RH engineering as a blocker.

Comment 14 IBM Bug Proxy 2009-12-18 19:40:42 UTC

------- Comment From kumarr.com 2009-12-18 14:38 EDT-------
Hello RedHat,

We have been working with Steven Sombar, a developer in our Firmware team, and he has this request.
Could you please provide this information ASAP?

Thanks!

-----------------------------------------------------

We will need a FSPDUMP (service processor dump) to see what is going on
(maybe one already happened automatically!)
You can initiate fspdumps from asmi (usually http://<fsp_ip_address>)
- Log in as dev  (usually dev / FipSdev)
- Under System Configuration ->Hardware Deconfiguration, you can view what is
deconfigured (if anything).
- You can view error logs with  System Service Aids->Error/Event Logs.
- You can determine cec/fsp code level with this commaind
registry -r fstp/DriverName from  "System Service Aids"->"Service Processor
Command Line"
- Manually initiate a service processor dump (I think System Service
Aids->service processor dump)
Dumps are uploaded to
HMC -------------------> /dump
Linux -----------------> /var/log/dump
Send up all dumps that you you find, or kindly upload them
to /afs or /gsa  and we will go from there.

-Steven SOMbar  smsombar @ us.ibm.com

Comment 15 Nate Straz 2009-12-18 20:03:37 UTC

(In reply to comment #14)
> We will need a FSPDUMP (service processor dump) to see what is going on
> (maybe one already happened automatically!)
> You can initiate fspdumps from asmi (usually http://<fsp_ip_address>)
> - Log in as dev  (usually dev / FipSdev)

This log in is not working.  We're running firmware SF240_338.

> - Under System Configuration ->Hardware Deconfiguration, you can view what is
> deconfigured (if anything).

Nothing is deconfigured.

> - You can view error logs with  System Service Aids->Error/Event Logs.

I have a sample attached to the bug already, but I can attach another example from another machine.

> - You can determine cec/fsp code level with this commaind
> registry -r fstp/DriverName from  "System Service Aids"->"Service Processor
> Command Line"
> - Manually initiate a service processor dump (I think System Service
> Aids->service processor dump)
> Dumps are uploaded to
> HMC -------------------> /dump
> Linux -----------------> /var/log/dump
> Send up all dumps that you you find, or kindly upload them
> to /afs or /gsa  and we will go from there.


I can't do this until I can log in as dev.

Comment 16 IBM Bug Proxy 2009-12-18 20:31:00 UTC

------- Comment From kumarr.com 2009-12-18 15:28 EDT-------
Hi RedHat,

As we are looking into this, can you please update to the latest firmware level? I understand that some stability issues have been resolved in the most recent update for your system.

Thanks!

Comment 17 IBM Bug Proxy 2009-12-19 16:13:53 UTC

------- Comment From smsombar.com 2009-12-18 22:27 EDT-------
Here is the daily dev login password to asmi
Use this to login to asmi and initiate the fspdumps, etc. if needed after the firmware update, or anytime
The password is different every day.
Family:  i-pFSP_DE_9110-51A
Machine Type/Model:  9110-51A
Serial#:  06B1DFD
12/18/2009 0D9EAE46
12/19/2009 62D5BB43
12/20/2009 02648540
12/21/2009 57A3965D
12/22/2009 A4FEE35A
12/23/2009 4489ED57

Comment 18 Nate Straz 2009-12-21 23:51:01 UTC

Created attachment 379740 [details]
Logs from the latest recreation

I updated the firmware to SF240_382 on all nodes and updated a few other drivers which IVM found updates for.  I was able to reproduce this again using the latest RHEL 5.5 nightly build.  Attached are three hidden log entries from around the time I started running the test case.

Here is the firmware level from the service processor command line:

Command entered: registry -r fstp/DriverName
fips240/b0421a_0910.240

I initiated a service processor dump but I am unclear where the files are supposed to show up.  We're running IVM and three Linux LPARs.  All of which are hung when I get into this state.  Where should I find the service processor dump?  Do I need any specific packages installed in order to retrieve the dumps?

Comment 19 IBM Bug Proxy 2009-12-22 14:00:46 UTC

------- Comment From smsombar.com 2009-12-22 08:57 EDT-------
The  attachment
Created an attachment (id=50147) [details]
Logs from the latest recreation
shows only Informational Logs, which normally can safely be ignored.
We are mostly concerned with Predictive or Unrecoverable logs, none seen yet.

If there is no HMC, look for here:
Linux -----------------> /var/log/dump
If you don;t find any run this command on the fsp command line:
fipsdump --list
This will tell us the name of the dump(s) and if they have been uploaded or not to the Linux partition.
Note: if there is more than one FSP, please execute this command on both FSPs.
-Steven Sombar

Comment 20 Nate Straz 2009-12-22 14:18:10 UTC

Created attachment 379829 [details]
fsp dump

I found the fsp dump in IVM and was able to download it.

Also, we're running VIOS version 1.5.2.1-FP-11.1.  I know this is far behind and I'll try to find some time to get it upgraded to 2.1.2.0.

Comment 21 IBM Bug Proxy 2009-12-22 16:24:11 UTC

------- Comment From smsombar.com 2009-12-22 11:17 EDT-------
FSPDUMP.06B1DFD.01000002.20091221234156
length 131071
does not expand with our tools. :-(    It is an "incomplete" dump.
- was there adequate time between dump initiation and the upload?
- this is a binary file.   If it was FTP-ed, was binary mode used?
Sometimes an incomplete dump can happen due to hardware problems
or if the dump is interrupted for some reason.
-Steven Sombar

Comment 22 Nate Straz 2009-12-22 17:35:18 UTC

(In reply to comment #21)
> ------- Comment From smsombar.com 2009-12-22 11:17 EDT-------
> FSPDUMP.06B1DFD.01000002.20091221234156
> length 131071
> does not expand with our tools. :-(    It is an "incomplete" dump.
> - was there adequate time between dump initiation and the upload?

There were several hours.  IVM only showed a 128k dump.  How large should they be?

I issued the first service processor dump while the VIOS partition was not responding.  I issued another one just a few minutes ago.  I have not seen it show up in IVM yet.  How long does that take?

The output from fipsdump --list is now:

[0] FSPDUMP.06B1DFD.01000002.20091221234156 id:01000002 o:00000004 s:0005a065 v:01 r=00000000 t:009
e0000 d:00000004 dev:0
[1] FSPDUMP.06B1DFD.01000003.20091222170031 id:01000003 o:0005a06c s:0005c163 v:01 r=00000000 t:009
e0000 d:0005a06c dev:0
Success
Exitting fips dump 0: 

> - this is a binary file.   If it was FTP-ed, was binary mode used?

No, it was downloaded directly from IVM over HTTP and uploaded to bugzilla.

> Sometimes an incomplete dump can happen due to hardware problems
> or if the dump is interrupted for some reason.

What process is supposed to pull the dump from the service processor to the Linux partition?

Comment 23 IBM Bug Proxy 2009-12-22 20:30:55 UTC

------- Comment From smsombar.com 2009-12-22 15:26 EDT-------
There should be some Predictive or Unrecoverable SRCs!
If not it will be difficult to debug the reboot issue. ASMI's Platform Error Log will show Predictive or Unrecoverable errors. I am perplexed as to why you are not seeing any.

A manually (or automatically) initiated service processor dump (fspdump) taken near the time of system failure will also report key Predictive or Unrecoverable errors. It also shows other data asuch as
CTRMs and bad hardware.

FSPDUMPs (service processor dumps) are normally 300k-1.2M and typically take only a few
minutes to upload to the partition (in this case /var/log/dump)
or HMC. If the partition or hmc is not up the fsp will keep the dump until
it sees something it can upload the dump to.
The OS is notified that a dump is available via RTAS notification, I believe (I am not a dump expert).

I see FSPDUMP.06B1DFD.01000002.20091221234156 is reported to
be only 128k bytes. However the fsp sees its size s=0005a065
which is 368741 bytes.
So my guess is that this dump was likely not fully uploaded to the partition.

What we in development might do in this situation would be to
- mount a dirctory from a "linux companion box" aka companion
via nfs to /nfs on the fsp, and offload any fspdumps with the command
fipsdump --r
or
- attach an hmc
or
- restart the partition. The fsp should complete uploading of the dumps,
provided the partition is properly configured.
I am also thinking you might want to consult one of the PE support team
experts (previously mentioned) as to the best course of action for field data collection. They might have additional suggestions.

Regards,
Steven Sombar

Comment 24 Nate Straz 2009-12-22 21:08:19 UTC

(In reply to comment #0)
> Steps to Reproduce:
> 1. reboot -fin on all Linux LPARs at the same time.

I think the reboot issue is related to the I/O load we put on the LPARs before we reboot them.  Here's a better description of the test case.

1. Shared storage is exported to the Linux LPARs
2. A cluster is configured on the Linux LPARs
3. GFS is mounted on all Linux LPARs
4. Create an I/O load to the GFS mount point on all Linux LPARs
5. Reboot all Linux LPARs w/ `reboot -fin`

At this point the VIOS LPAR hangs and the entire system becomes unusable.

Comment 25 Nate Straz 2009-12-22 22:18:36 UTC

Created attachment 379926 [details]
fsp dump

Here is a new dump I did.  This wasn't done around the time the hang occurred, but after the system was rebooted and brought back online.

Comment 26 Nate Straz 2009-12-22 22:29:13 UTC

Created attachment 379933 [details]
fsp dump id 1000004

This dump may be more useful.  I took it after reproducing the reboot hang again.  After rebooting the entire system the dump came up truncated in IVM (256k) so I restarted the VIOS partition and got the entire thing.

Comment 27 IBM Bug Proxy 2009-12-23 14:31:24 UTC

------- Comment From smsombar.com 2009-12-23 09:20 EDT-------
I have expanded FSPDUMP.06B1DFD.01000004.20091222220853
(to our local storage at
/gsa/ausgsa/home/s/m/smsombar/screensms/734634/dump_01000004/dumpout)
In file Errorlogs
The number of Unrecoverable Errors = 0
The number of Predictive    Errors = 0
The number of Recovered     Errors = 0

Was this dump manually initiated or did it happen automatically as a
result of the hanging partitions?

Without any specific errors recorded I am thinking the phyp or
VIOS team will need to help debug this one.    So I am sending
defect 734634/FW502123 (our internal version of this BZ)
to the phyp team next.
-Steven SOMbar

Comment 28 Nate Straz 2009-12-23 14:33:49 UTC

(In reply to comment #27)
> Was this dump manually initiated or did it happen automatically as a
> result of the hanging partitions?

It was manually initiated from ASMI while the partitions were hung.

Comment 29 IBM Bug Proxy 2009-12-23 15:01:18 UTC

------- Comment From smsombar.com 2009-12-23 09:52 EDT-------
Currently 734634/FW502123
is owned by  ajmac.com.  I tried to add him to this BZ as a CC
but  bz said cryptically
CC:  ajmac.com did not match anything
I guess that means ajmac has no direct BZ access.

Comment 30 IBM Bug Proxy 2009-12-23 15:41:55 UTC

------- Comment From lxie.com 2009-12-23 10:35 EDT-------
Ramon, have you seen this in your RHCS testing (rhel5 or rhel6)?

Comment 31 IBM Bug Proxy 2009-12-28 17:31:34 UTC

------- Comment From kumarr.com 2009-12-28 12:20 EDT-------
Hello Red Hat,

Our PHYP team which works on VIOS is working on debugging this issue. Could you pleae look at the following question from them and provide us with a sysdump as mentioned below? Please let us know if you are having problems.

Thanks!

If all partitions hang, have you taken a platform (PHYP) sysdump?  If so,
where it is located?  OIf all partitions hang, have they taken a platform (PHYP) sysdump?  If so,
where it is located?  Otherwise, is a remote connection with piranha phyp
viable?

Comment 32 IBM Bug Proxy 2009-12-29 14:31:41 UTC

------- Comment From lxie.com 2009-12-29 09:26 EDT-------
Ramon, have you seen this in your RHCS testing?

Comment 33 IBM Bug Proxy 2009-12-29 15:01:24 UTC

------- Comment From rcvalle.ibm.com 2009-12-29 09:58 EDT-------
I didn't see this occur on my RHCS testing environment.
Please, can you attach a copy of your cluster configuration file?
Do you have any fence devices configured?
Are you using two_node="1" in the cluster configuration?

Comment 34 Nate Straz 2010-01-04 21:04:47 UTC

Created attachment 381645 [details]
cluster.conf

I'm in the process of upgrading all VIOS partitions to 2.1 w/ Fix Pack 22 and interim fix IZ63813.  I'll attempt to reproduce the hang and get a partition dump after the upgrade is complete.

Attached is the cluster.conf I'm using.  I am using IVM for fencing.  We have three pSeries model 9110-51A with two Linux LPARs on each in the cluster.  I hit this most often during the gfs_fsck_stress test when an I/O load is applied to the shared file system then all nodes are rebooted.

Comment 35 IBM Bug Proxy 2010-01-04 23:11:56 UTC

------- Comment From brking.com 2010-01-04 18:07 EDT-------
If the problem still occurs after the VIOS is updated, a system dump would be requested when the system is in the hung state. To capture this dump, on ASMI, under System Service Aids -> System Dump, select Initiate. Once the system is back up after the dump, it will be written to /var/adm/ras/platform on the VIOS. There will be both a SYSDUMP and an FSPDUMP. Please upload both of them for analysis. Thanks.

Comment 36 IBM Bug Proxy 2010-01-05 12:31:44 UTC

------- Comment From rcvalle.ibm.com 2010-01-05 07:22 EDT-------
There are some things I'd like you to try:

1. Create a failover domain for all nodes of your cluster
2. Add the tag <fcdriver>lpfc</ fcdriver> in each "clusternode" tag
3. Change the "name" attribute of each "method" tag to "1" in each "fence" tag
4. Use the full hostname in "ipaddr" attribute of the "fencedevice" tag
5. Create a resource and a service for the GFS filesystem
6. Try to fence one node of your cluster manually using the fence_lpar script and see if it works as expected
7. Make sure that all nodes of your cluster have each other and the fence devices listed in its "/etc/hosts" file

(In reply to comment #39)
> Created an attachment (id=50211) [details]
> cluster.conf
> ------- Comment on attachment From nstraz 2010-01-04 16:04:47
> EDT-------
>
>
> I'm in the process of upgrading all VIOS partitions to 2.1 w/ Fix Pack 22 and
> interim fix IZ63813.  I'll attempt to reproduce the hang and get a partition
> dump after the upgrade is complete.
>
> Attached is the cluster.conf I'm using.  I am using IVM for fencing.  We have
> three pSeries model 9110-51A with two Linux LPARs on each in the cluster.  I
> hit this most often during the gfs_fsck_stress test when an I/O load is applied
> to the shared file system then all nodes are rebooted.

Comment 37 IBM Bug Proxy 2010-01-05 12:31:50 UTC

Created attachment 381747 [details]
cluster.conf


------- Comment (attachment only) From rcvalle.ibm.com 2010-01-05 07:23 EDT-------

Comment 38 Nate Straz 2010-01-05 22:54:10 UTC

Created attachment 381871 [details]
system dump compressed with bzip2

After getting everything updated to the latest VIOS fix level 2.1.2.10-FP-22 I have not yet been able to recreate the hang.  I have hit another odd issue.  Sometimes when basic-p1 reboots (one of the Linux LPARs) the shared disk is not available.   I see messages like this during boot:

scsi 0:0:3:0: aborting command. lun 0x8300000000000000, tag 0xc00000007faf2af8
scsi 0:0:3:0: aborted task tag 0xc00000007faf2af8 completed
scsi 0:0:3:0: timing out command, waited 22s

The shared disk is a Winchester FC-SATA array attached via Emulex LightPulse FC card and QLogic switches.  IVM still says the physical disk is available and there were no other messages.  After rebooting the LPAR again the disk came back.

After a few cycles of this I started getting errors in Linux:

KINFO: task gfs_mkfs:3861 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
gfs_mkfs      D 000000000ff111f8  6624  3861   3860                     (NOTLB)
Call Trace:
[C00000006795EF50] [C0000000679D2C30] 0xc0000000679d2c30 (unreliable)
[C00000006795F120] [C000000000010AA0] .__switch_to+0x124/0x148
[C00000006795F1B0] [C0000000003D8F38] .schedule+0xc08/0xdbc
[C00000006795F2C0] [C0000000003D9C3C] .io_schedule+0x58/0xa8
[C00000006795F350] [C0000000000FD2D8] .sync_buffer+0x68/0x80
[C00000006795F3C0] [C0000000003DA054] .__wait_on_bit+0xa0/0x114
[C00000006795F470] [C0000000003DA160] .out_of_line_wait_on_bit+0x98/0xc8
[C00000006795F570] [C0000000000FD19C] .__wait_on_buffer+0x30/0x48
[C00000006795F5F0] [C0000000000FDAC0] .__block_prepare_write+0x3ac/0x440
[C00000006795F720] [C0000000000FDB88] .block_prepare_write+0x34/0x64
[C00000006795F7A0] [C000000000104C4C] .blkdev_prepare_write+0x28/0x40
[C00000006795F820] [C0000000000C09C4] .generic_file_buffered_write+0x420/0x76c
[C00000006795F960] [C0000000000C10B4] .__generic_file_aio_write_nolock+0x3a4/0x448
[C00000006795FA60] [C0000000000C1524] .generic_file_aio_write_nolock+0x30/0xa4
[C00000006795FB00] [C0000000000C1AAC] .generic_file_write_nolock+0x78/0xb0
[C00000006795FC70] [C00000000010464C] .blkdev_file_write+0x20/0x34
[C00000006795FCF0] [C0000000000F93FC] .vfs_write+0x118/0x200
[C00000006795FD90] [C0000000000F9B6C] .sys_write+0x4c/0x8c
[C00000006795FE30] [C0000000000086A4] syscall_exit+0x0/0x40
ibmvscsi 30000002: Command timed out (1). Resetting connection
sd 0:0:3:0: abort bad SRP RSP type 1
sd 0:0:3:0: timing out command, waited 360s
end_request: I/O error, dev sdb, sector 104667554
Buffer I/O error on device dm-2, logical block 817712
lost page write due to I/O error on dm-2
Buffer I/O error on device dm-2, logical block 817713
...
ibmvscsi 30000002: Command timed out (1). Resetting connection
printk: 46 messages suppressed.
sd 0:0:3:0: abort bad SRP RSP type 1

At this point I initiated a system dump which is attached.

Comment 39 IBM Bug Proxy 2010-01-06 16:36:22 UTC

------- Comment From tpnoonan.com 2010-01-06 10:03 EDT-------
Hi Red Hat, In case IBM needs to do a firmware/software update to the machines
at Red Hat, can you please clarify if this problem is on the Red Hat owned PWR
systems in MN or on the other PWR systems loaned to Red Hat by IBM? Thanks.

Comment 40 Nate Straz 2010-01-06 16:55:09 UTC

These are the Red Hat owned systems in MN.  I already did all the firmware and software upgrades to the latest versions.

Comment 41 IBM Bug Proxy 2010-01-06 19:48:11 UTC

------- Comment From kumarr.com 2010-01-06 14:39 EDT-------
Red Hat,

Thanks for the latest update and the dump. At this point, it looks like you have encountered a different bug.

We have some questions/requests for you:

1. Would it be possible for you to provide us with new error logs like fsperror logs and other logs on the VIOS related to the problem you are seeing?

2. Please provide the uname -a output you ran these tests on.

3. If there is anything to add more to the console output that would be great.

Thanks!

Comment 42 Nate Straz 2010-01-06 20:37:25 UTC

Created attachment 382071 [details]
error and console logs from basic related to system dump

Attached are the Error/Event Logs entries from around the time I initiated the system dump.  Following that in the attachment are the console logs from basic-p1.

The kernel I'm using is the latest development kernel for RHEL 5.5.
Linux version 2.6.18-183.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Mon Dec 21 18:42:39 EST 2009^M

I've continued running this test and I've hit this again on another pSeries box.  I initiated a system dump there too.  I also get into a situation where on partition can't see the shared disk and the other can't see its root disk.

Comment 43 IBM Bug Proxy 2010-01-06 21:55:00 UTC

------- Comment From brking.com 2010-01-06 16:46 EDT-------
Are there any error logs in the VIOS? If you ssh to the VIOS as padmin, then run the "errlog" command it will list the errors. "errlog -ls" will list all the additional error log data for all the errors.

Comment 44 Nate Straz 2010-01-06 22:33:16 UTC

Created attachment 382106 [details]
errlog -ls output from basic

Thank you for the errlog instructions.  There are lots of entries so I've compressed then with bzip2 and attached it.

Comment 45 IBM Bug Proxy 2010-01-07 16:25:14 UTC

------- Comment From brking.com 2010-01-07 11:19 EDT-------
This is looking like it might be a VIOS issue. I discussed with some folks in VIOS and they need a PMR opened so they can work the problem. Can you open a software PMR against VIOS? Once you have the PMR opened, let me know the PMR number and I can make sure it gets escalated appropriately. They will need to know your configuration in the PMR and the service team will most likely request some logs as well.

Comment 46 Nate Straz 2010-01-08 13:48:12 UTC

I was able to open a PMR over the phone, it is 30158379.

While doing so we found out that our software support contract had expired in August 2009.  Someone from IBM Sales is supposed to contact me with pricing and I was given the tracking number 1572759.

Comment 47 IBM Bug Proxy 2010-01-14 16:43:42 UTC

------- Comment From brking.com 2010-01-14 11:25 EDT-------
I've discussed the problem with VIOS development. Would it be possible to provide remote access to the VIOS so that VIOS development can take a look at the system live? This would help immensely in resolving this issue in a timely fashion.

Comment 48 IBM Bug Proxy 2010-01-15 16:10:24 UTC

------- Comment From tpnoonan.com 2010-01-15 10:34 EDT-------
Hi Red Hat, Since the original problem has been resolved, is the new problem in the way of moving PWR LPAR
Red Hat Software Cluster Suite  from "tech preview" to full support in RHEL5.5? Or is thje new problem "just" a defect to be resolved? Thanks.

Comment 49 Nate Straz 2010-01-15 16:14:19 UTC

I haven't been able to get through enough testing to determine if the hang on reboot is gone because the new problem of virtual disks disappearing is getting in the way.

Comment 50 Nate Straz 2010-01-15 19:36:51 UTC

I just copied the disk related issues to new bug 555871 and will continue with that issue there.  IBM should probably mirror that bug for the VIOS team to follow.

Comment 51 IBM Bug Proxy 2010-01-15 20:06:06 UTC

------- Comment From tpnoonan.com 2010-01-15 14:51 EDT-------
HI Red Hat, IBM can't reproduce either problem in our test bed. Note that the disk storage you have in MN is not in the support matrix.

Comment 52 Nate Straz 2010-01-18 15:47:21 UTC

While the shared storage is not in the support matrix, the internal storage used for the LPARs' root disk is and that disappears on reboots too.  We have our LPARs set to boot from the HD then the network and during my testing they'll boot from the network at times instead of the HD, which is a volume on the local to each system.

Comment 53 Steve Best 2010-02-09 20:48:22 UTC

Nate,

last I heard for this issue the hardware that you are running with wasn't support and IBM is trying to get you supported hardware that you can test with. Should we close this bz, or is there some reason that we should leave it open?

Thanks,
Steve

Comment 54 Nate Straz 2010-02-09 21:12:03 UTC

Considering this not a bug since the hardware is not supported by IBM.

Comment 55 IBM Bug Proxy 2010-02-09 22:42:06 UTC

------- Comment From kumarr.com 2010-02-09 17:37 EDT-------
Closing per above comments

Note You need to log in before you can comment on or make changes to this bug.