Description of problem: Our LPAR cluster uses two LPARs on each physical machine for Linux, plus the AIX VIOS LPAR. During one of our GFS fsck tests we reboot all nodes in the cluster in order to test fsck after recovery. We send a "reboot -fin" command to each node in the cluster. Sometimes when we do that the entire machine hangs, including the VIOS LPAR. I'm not able to get any response from the serial console for the VIOS LPAR or any other LPAR on the physical host. I have to go to ASM and reboot the entire machine to recover from this state. Version-Release number of selected component (if applicable): How reproducible: About 20% of the time. Steps to Reproduce: 1. reboot -fin on all Linux LPARs at the same time. 2. 3. Actual results: Entire pSeries system becomes unusable. Expected results: LPARs should reboot Additional info:
Hi Nate - couple questions to help us understand this better: - I'm not familiar with reboot -fin. Where do you run this command? Does this reboot all lpars on a machine, or reboot the entire managed system? - When does the hang happen - prior to the actual power cycle? after restart but prior to the fsck? during the fsck? - Is the VIO lpar active prior to the client lpars (whose storage is served by the vio server) starting? - Do you see anything related to this hang (op panel codes, etc) in the Serviceable Events or ASM logs? Please post output from the following run on the HMC (or provide hmc access isntructions): - lshmc -V <-- upper case V - lshmc -v <-- lower case v - lshmc -b - lssyscfg -r sys
(In reply to comment #1) > Hi Nate - couple questions to help us understand this better: > > - I'm not familiar with reboot -fin. Where do you run this command? Does this > reboot all lpars on a machine, or reboot the entire managed system? That's run as root on each Linux LPAR. > - When does the hang happen - prior to the actual power cycle? after restart > but prior to the fsck? during the fsck? Just after the reboot commands. All LPARs, including the VIOS LPAR hang. > - Is the VIO lpar active prior to the client lpars (whose storage is served by > the vio server) starting? It is active and the storage is working before the reboot commands are issued. > - Do you see anything related to this hang (op panel codes, etc) in the > Serviceable Events or ASM logs? There is usually a new informational firmware event in the ASM log. I can attach one for you. > Please post output from the following run on the HMC (or provide hmc access > isntructions): > > - lshmc -V <-- upper case V > - lshmc -v <-- lower case v > - lshmc -b > - lssyscfg -r sys These systems use IVM, not an HMC. $ lsivm -V 9110-51A,06B1DFD,1 $ lssyscfg -r sys name=basic,type_model=9110-51A,serial_num=06B1DFD,ipaddr=10.15.90.22,state=Operating,sys_time=11/25/09 10:19:15,power_off_policy=0,active_lpar_mobility_capable=0,inactive_lpar_mobility_capable=0,cod_mem_capable=0,cod_proc_capable=1,vet_activation_capable=1,os400_capable=0,active_lpar_share_idle_procs_capable=0,micro_lpar_capable=1,dlpar_mem_capable=1,assign_phys_io_capable=0,lhea_capable=0,max_lpars=20,max_power_ctrl_lpars=1,service_lpar_id=1,service_lpar_name=basic,mfg_default_config=0,curr_configured_max_lpars=11,pend_configured_max_lpars=11,config_version=0100020000000000,pend_lpar_config_state=enabled,lpar_avail_priority_capable=0,lpar_proc_compat_mode_capable=0,max_vios_lpar_id=1,virtual_io_server_capable=0
Created attachment 373774 [details] Sample firmware event from basic
We need to move quickly if we expect this to be resolved in RHEL-5.5. Any recommendations Kevin? Thanks
Kevin has returned to Austin so Steve Best, onsite Power guy in Westford, will be the best source.
------- Comment From kumarr.com 2009-12-10 16:18 EDT------- Mirroring to IBM
------- Comment From kumarr.com 2009-12-11 12:05 EDT------- Hi Red Hat, I notice this problem was reported on RHEL 5.3. Have you seen this on the RHEL 5.5 builds? Have you tried this on RHEL 5.4? Also, could you please confirm that this is not a hardware issue? Thanks!
------- Comment From kumarr.com 2009-12-11 12:13 EDT------- I was talking to Nadia Fry - our Test Team lead for Linux on Power hardware. She had some questions: 1. What hardware is this running on? 2. Is this running at the most current level of Firmware? Please provide the VIOS level you are running. Thanks!
We did hit this during RHEL 5.4 testing. I have not attempted PPC testing on RHEL 5.5 at this time. We has an IBM support person on site and he was not able to find anything wrong with the hardware. The hardware is model 9110-51A. $ lsfware system:SF240_338 (t) SF240_259 (p) SF240_338 (t)
------- Comment From tpnoonan.com 2009-12-18 09:12 EDT------- changing ibm severity to block as this blocks pwr lpar red hat software cluster suite in rhel5.5
------- Comment From tpnoonan.com 2009-12-18 11:11 EDT------- red hat please consider this under the exception process for rhel5.5 as this blocks RHEL5.5 PWR LPAR Red Hat Software Cluster Suite which was a technology preview in rhel5.4 and planned for production in rhel5.5, thanks
This doesn't need an exception since it is proposed by RH engineering as a blocker.
------- Comment From kumarr.com 2009-12-18 14:38 EDT------- Hello RedHat, We have been working with Steven Sombar, a developer in our Firmware team, and he has this request. Could you please provide this information ASAP? Thanks! ----------------------------------------------------- We will need a FSPDUMP (service processor dump) to see what is going on (maybe one already happened automatically!) You can initiate fspdumps from asmi (usually http://<fsp_ip_address>) - Log in as dev (usually dev / FipSdev) - Under System Configuration ->Hardware Deconfiguration, you can view what is deconfigured (if anything). - You can view error logs with System Service Aids->Error/Event Logs. - You can determine cec/fsp code level with this commaind registry -r fstp/DriverName from "System Service Aids"->"Service Processor Command Line" - Manually initiate a service processor dump (I think System Service Aids->service processor dump) Dumps are uploaded to HMC -------------------> /dump Linux -----------------> /var/log/dump Send up all dumps that you you find, or kindly upload them to /afs or /gsa and we will go from there. -Steven SOMbar smsombar @ us.ibm.com
(In reply to comment #14) > We will need a FSPDUMP (service processor dump) to see what is going on > (maybe one already happened automatically!) > You can initiate fspdumps from asmi (usually http://<fsp_ip_address>) > - Log in as dev (usually dev / FipSdev) This log in is not working. We're running firmware SF240_338. > - Under System Configuration ->Hardware Deconfiguration, you can view what is > deconfigured (if anything). Nothing is deconfigured. > - You can view error logs with System Service Aids->Error/Event Logs. I have a sample attached to the bug already, but I can attach another example from another machine. > - You can determine cec/fsp code level with this commaind > registry -r fstp/DriverName from "System Service Aids"->"Service Processor > Command Line" > - Manually initiate a service processor dump (I think System Service > Aids->service processor dump) > Dumps are uploaded to > HMC -------------------> /dump > Linux -----------------> /var/log/dump > Send up all dumps that you you find, or kindly upload them > to /afs or /gsa and we will go from there. I can't do this until I can log in as dev.
------- Comment From kumarr.com 2009-12-18 15:28 EDT------- Hi RedHat, As we are looking into this, can you please update to the latest firmware level? I understand that some stability issues have been resolved in the most recent update for your system. Thanks!
------- Comment From smsombar.com 2009-12-18 22:27 EDT------- Here is the daily dev login password to asmi Use this to login to asmi and initiate the fspdumps, etc. if needed after the firmware update, or anytime The password is different every day. Family: i-pFSP_DE_9110-51A Machine Type/Model: 9110-51A Serial#: 06B1DFD 12/18/2009 0D9EAE46 12/19/2009 62D5BB43 12/20/2009 02648540 12/21/2009 57A3965D 12/22/2009 A4FEE35A 12/23/2009 4489ED57
Created attachment 379740 [details] Logs from the latest recreation I updated the firmware to SF240_382 on all nodes and updated a few other drivers which IVM found updates for. I was able to reproduce this again using the latest RHEL 5.5 nightly build. Attached are three hidden log entries from around the time I started running the test case. Here is the firmware level from the service processor command line: Command entered: registry -r fstp/DriverName fips240/b0421a_0910.240 I initiated a service processor dump but I am unclear where the files are supposed to show up. We're running IVM and three Linux LPARs. All of which are hung when I get into this state. Where should I find the service processor dump? Do I need any specific packages installed in order to retrieve the dumps?
------- Comment From smsombar.com 2009-12-22 08:57 EDT------- The attachment Created an attachment (id=50147) [details] Logs from the latest recreation shows only Informational Logs, which normally can safely be ignored. We are mostly concerned with Predictive or Unrecoverable logs, none seen yet. If there is no HMC, look for here: Linux -----------------> /var/log/dump If you don;t find any run this command on the fsp command line: fipsdump --list This will tell us the name of the dump(s) and if they have been uploaded or not to the Linux partition. Note: if there is more than one FSP, please execute this command on both FSPs. -Steven Sombar
Created attachment 379829 [details] fsp dump I found the fsp dump in IVM and was able to download it. Also, we're running VIOS version 1.5.2.1-FP-11.1. I know this is far behind and I'll try to find some time to get it upgraded to 2.1.2.0.
------- Comment From smsombar.com 2009-12-22 11:17 EDT------- FSPDUMP.06B1DFD.01000002.20091221234156 length 131071 does not expand with our tools. :-( It is an "incomplete" dump. - was there adequate time between dump initiation and the upload? - this is a binary file. If it was FTP-ed, was binary mode used? Sometimes an incomplete dump can happen due to hardware problems or if the dump is interrupted for some reason. -Steven Sombar
(In reply to comment #21) > ------- Comment From smsombar.com 2009-12-22 11:17 EDT------- > FSPDUMP.06B1DFD.01000002.20091221234156 > length 131071 > does not expand with our tools. :-( It is an "incomplete" dump. > - was there adequate time between dump initiation and the upload? There were several hours. IVM only showed a 128k dump. How large should they be? I issued the first service processor dump while the VIOS partition was not responding. I issued another one just a few minutes ago. I have not seen it show up in IVM yet. How long does that take? The output from fipsdump --list is now: [0] FSPDUMP.06B1DFD.01000002.20091221234156 id:01000002 o:00000004 s:0005a065 v:01 r=00000000 t:009 e0000 d:00000004 dev:0 [1] FSPDUMP.06B1DFD.01000003.20091222170031 id:01000003 o:0005a06c s:0005c163 v:01 r=00000000 t:009 e0000 d:0005a06c dev:0 Success Exitting fips dump 0: > - this is a binary file. If it was FTP-ed, was binary mode used? No, it was downloaded directly from IVM over HTTP and uploaded to bugzilla. > Sometimes an incomplete dump can happen due to hardware problems > or if the dump is interrupted for some reason. What process is supposed to pull the dump from the service processor to the Linux partition?
------- Comment From smsombar.com 2009-12-22 15:26 EDT------- There should be some Predictive or Unrecoverable SRCs! If not it will be difficult to debug the reboot issue. ASMI's Platform Error Log will show Predictive or Unrecoverable errors. I am perplexed as to why you are not seeing any. A manually (or automatically) initiated service processor dump (fspdump) taken near the time of system failure will also report key Predictive or Unrecoverable errors. It also shows other data asuch as CTRMs and bad hardware. FSPDUMPs (service processor dumps) are normally 300k-1.2M and typically take only a few minutes to upload to the partition (in this case /var/log/dump) or HMC. If the partition or hmc is not up the fsp will keep the dump until it sees something it can upload the dump to. The OS is notified that a dump is available via RTAS notification, I believe (I am not a dump expert). I see FSPDUMP.06B1DFD.01000002.20091221234156 is reported to be only 128k bytes. However the fsp sees its size s=0005a065 which is 368741 bytes. So my guess is that this dump was likely not fully uploaded to the partition. What we in development might do in this situation would be to - mount a dirctory from a "linux companion box" aka companion via nfs to /nfs on the fsp, and offload any fspdumps with the command fipsdump --r or - attach an hmc or - restart the partition. The fsp should complete uploading of the dumps, provided the partition is properly configured. I am also thinking you might want to consult one of the PE support team experts (previously mentioned) as to the best course of action for field data collection. They might have additional suggestions. Regards, Steven Sombar
(In reply to comment #0) > Steps to Reproduce: > 1. reboot -fin on all Linux LPARs at the same time. I think the reboot issue is related to the I/O load we put on the LPARs before we reboot them. Here's a better description of the test case. 1. Shared storage is exported to the Linux LPARs 2. A cluster is configured on the Linux LPARs 3. GFS is mounted on all Linux LPARs 4. Create an I/O load to the GFS mount point on all Linux LPARs 5. Reboot all Linux LPARs w/ `reboot -fin` At this point the VIOS LPAR hangs and the entire system becomes unusable.
Created attachment 379926 [details] fsp dump Here is a new dump I did. This wasn't done around the time the hang occurred, but after the system was rebooted and brought back online.
Created attachment 379933 [details] fsp dump id 1000004 This dump may be more useful. I took it after reproducing the reboot hang again. After rebooting the entire system the dump came up truncated in IVM (256k) so I restarted the VIOS partition and got the entire thing.
------- Comment From smsombar.com 2009-12-23 09:20 EDT------- I have expanded FSPDUMP.06B1DFD.01000004.20091222220853 (to our local storage at /gsa/ausgsa/home/s/m/smsombar/screensms/734634/dump_01000004/dumpout) In file Errorlogs The number of Unrecoverable Errors = 0 The number of Predictive Errors = 0 The number of Recovered Errors = 0 Was this dump manually initiated or did it happen automatically as a result of the hanging partitions? Without any specific errors recorded I am thinking the phyp or VIOS team will need to help debug this one. So I am sending defect 734634/FW502123 (our internal version of this BZ) to the phyp team next. -Steven SOMbar
(In reply to comment #27) > Was this dump manually initiated or did it happen automatically as a > result of the hanging partitions? It was manually initiated from ASMI while the partitions were hung.
------- Comment From smsombar.com 2009-12-23 09:52 EDT------- Currently 734634/FW502123 is owned by ajmac.com. I tried to add him to this BZ as a CC but bz said cryptically CC: ajmac.com did not match anything I guess that means ajmac has no direct BZ access.
------- Comment From lxie.com 2009-12-23 10:35 EDT------- Ramon, have you seen this in your RHCS testing (rhel5 or rhel6)?
------- Comment From kumarr.com 2009-12-28 12:20 EDT------- Hello Red Hat, Our PHYP team which works on VIOS is working on debugging this issue. Could you pleae look at the following question from them and provide us with a sysdump as mentioned below? Please let us know if you are having problems. Thanks! If all partitions hang, have you taken a platform (PHYP) sysdump? If so, where it is located? OIf all partitions hang, have they taken a platform (PHYP) sysdump? If so, where it is located? Otherwise, is a remote connection with piranha phyp viable?
------- Comment From lxie.com 2009-12-29 09:26 EDT------- Ramon, have you seen this in your RHCS testing?
------- Comment From rcvalle.ibm.com 2009-12-29 09:58 EDT------- I didn't see this occur on my RHCS testing environment. Please, can you attach a copy of your cluster configuration file? Do you have any fence devices configured? Are you using two_node="1" in the cluster configuration?
Created attachment 381645 [details] cluster.conf I'm in the process of upgrading all VIOS partitions to 2.1 w/ Fix Pack 22 and interim fix IZ63813. I'll attempt to reproduce the hang and get a partition dump after the upgrade is complete. Attached is the cluster.conf I'm using. I am using IVM for fencing. We have three pSeries model 9110-51A with two Linux LPARs on each in the cluster. I hit this most often during the gfs_fsck_stress test when an I/O load is applied to the shared file system then all nodes are rebooted.
------- Comment From brking.com 2010-01-04 18:07 EDT------- If the problem still occurs after the VIOS is updated, a system dump would be requested when the system is in the hung state. To capture this dump, on ASMI, under System Service Aids -> System Dump, select Initiate. Once the system is back up after the dump, it will be written to /var/adm/ras/platform on the VIOS. There will be both a SYSDUMP and an FSPDUMP. Please upload both of them for analysis. Thanks.
------- Comment From rcvalle.ibm.com 2010-01-05 07:22 EDT------- There are some things I'd like you to try: 1. Create a failover domain for all nodes of your cluster 2. Add the tag <fcdriver>lpfc</ fcdriver> in each "clusternode" tag 3. Change the "name" attribute of each "method" tag to "1" in each "fence" tag 4. Use the full hostname in "ipaddr" attribute of the "fencedevice" tag 5. Create a resource and a service for the GFS filesystem 6. Try to fence one node of your cluster manually using the fence_lpar script and see if it works as expected 7. Make sure that all nodes of your cluster have each other and the fence devices listed in its "/etc/hosts" file (In reply to comment #39) > Created an attachment (id=50211) [details] > cluster.conf > ------- Comment on attachment From nstraz 2010-01-04 16:04:47 > EDT------- > > > I'm in the process of upgrading all VIOS partitions to 2.1 w/ Fix Pack 22 and > interim fix IZ63813. I'll attempt to reproduce the hang and get a partition > dump after the upgrade is complete. > > Attached is the cluster.conf I'm using. I am using IVM for fencing. We have > three pSeries model 9110-51A with two Linux LPARs on each in the cluster. I > hit this most often during the gfs_fsck_stress test when an I/O load is applied > to the shared file system then all nodes are rebooted.
Created attachment 381747 [details] cluster.conf ------- Comment (attachment only) From rcvalle.ibm.com 2010-01-05 07:23 EDT-------
Created attachment 381871 [details] system dump compressed with bzip2 After getting everything updated to the latest VIOS fix level 2.1.2.10-FP-22 I have not yet been able to recreate the hang. I have hit another odd issue. Sometimes when basic-p1 reboots (one of the Linux LPARs) the shared disk is not available. I see messages like this during boot: scsi 0:0:3:0: aborting command. lun 0x8300000000000000, tag 0xc00000007faf2af8 scsi 0:0:3:0: aborted task tag 0xc00000007faf2af8 completed scsi 0:0:3:0: timing out command, waited 22s The shared disk is a Winchester FC-SATA array attached via Emulex LightPulse FC card and QLogic switches. IVM still says the physical disk is available and there were no other messages. After rebooting the LPAR again the disk came back. After a few cycles of this I started getting errors in Linux: KINFO: task gfs_mkfs:3861 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. gfs_mkfs D 000000000ff111f8 6624 3861 3860 (NOTLB) Call Trace: [C00000006795EF50] [C0000000679D2C30] 0xc0000000679d2c30 (unreliable) [C00000006795F120] [C000000000010AA0] .__switch_to+0x124/0x148 [C00000006795F1B0] [C0000000003D8F38] .schedule+0xc08/0xdbc [C00000006795F2C0] [C0000000003D9C3C] .io_schedule+0x58/0xa8 [C00000006795F350] [C0000000000FD2D8] .sync_buffer+0x68/0x80 [C00000006795F3C0] [C0000000003DA054] .__wait_on_bit+0xa0/0x114 [C00000006795F470] [C0000000003DA160] .out_of_line_wait_on_bit+0x98/0xc8 [C00000006795F570] [C0000000000FD19C] .__wait_on_buffer+0x30/0x48 [C00000006795F5F0] [C0000000000FDAC0] .__block_prepare_write+0x3ac/0x440 [C00000006795F720] [C0000000000FDB88] .block_prepare_write+0x34/0x64 [C00000006795F7A0] [C000000000104C4C] .blkdev_prepare_write+0x28/0x40 [C00000006795F820] [C0000000000C09C4] .generic_file_buffered_write+0x420/0x76c [C00000006795F960] [C0000000000C10B4] .__generic_file_aio_write_nolock+0x3a4/0x448 [C00000006795FA60] [C0000000000C1524] .generic_file_aio_write_nolock+0x30/0xa4 [C00000006795FB00] [C0000000000C1AAC] .generic_file_write_nolock+0x78/0xb0 [C00000006795FC70] [C00000000010464C] .blkdev_file_write+0x20/0x34 [C00000006795FCF0] [C0000000000F93FC] .vfs_write+0x118/0x200 [C00000006795FD90] [C0000000000F9B6C] .sys_write+0x4c/0x8c [C00000006795FE30] [C0000000000086A4] syscall_exit+0x0/0x40 ibmvscsi 30000002: Command timed out (1). Resetting connection sd 0:0:3:0: abort bad SRP RSP type 1 sd 0:0:3:0: timing out command, waited 360s end_request: I/O error, dev sdb, sector 104667554 Buffer I/O error on device dm-2, logical block 817712 lost page write due to I/O error on dm-2 Buffer I/O error on device dm-2, logical block 817713 ... ibmvscsi 30000002: Command timed out (1). Resetting connection printk: 46 messages suppressed. sd 0:0:3:0: abort bad SRP RSP type 1 At this point I initiated a system dump which is attached.
------- Comment From tpnoonan.com 2010-01-06 10:03 EDT------- Hi Red Hat, In case IBM needs to do a firmware/software update to the machines at Red Hat, can you please clarify if this problem is on the Red Hat owned PWR systems in MN or on the other PWR systems loaned to Red Hat by IBM? Thanks.
These are the Red Hat owned systems in MN. I already did all the firmware and software upgrades to the latest versions.
------- Comment From kumarr.com 2010-01-06 14:39 EDT------- Red Hat, Thanks for the latest update and the dump. At this point, it looks like you have encountered a different bug. We have some questions/requests for you: 1. Would it be possible for you to provide us with new error logs like fsperror logs and other logs on the VIOS related to the problem you are seeing? 2. Please provide the uname -a output you ran these tests on. 3. If there is anything to add more to the console output that would be great. Thanks!
Created attachment 382071 [details] error and console logs from basic related to system dump Attached are the Error/Event Logs entries from around the time I initiated the system dump. Following that in the attachment are the console logs from basic-p1. The kernel I'm using is the latest development kernel for RHEL 5.5. Linux version 2.6.18-183.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Mon Dec 21 18:42:39 EST 2009^M I've continued running this test and I've hit this again on another pSeries box. I initiated a system dump there too. I also get into a situation where on partition can't see the shared disk and the other can't see its root disk.
------- Comment From brking.com 2010-01-06 16:46 EDT------- Are there any error logs in the VIOS? If you ssh to the VIOS as padmin, then run the "errlog" command it will list the errors. "errlog -ls" will list all the additional error log data for all the errors.
Created attachment 382106 [details] errlog -ls output from basic Thank you for the errlog instructions. There are lots of entries so I've compressed then with bzip2 and attached it.
------- Comment From brking.com 2010-01-07 11:19 EDT------- This is looking like it might be a VIOS issue. I discussed with some folks in VIOS and they need a PMR opened so they can work the problem. Can you open a software PMR against VIOS? Once you have the PMR opened, let me know the PMR number and I can make sure it gets escalated appropriately. They will need to know your configuration in the PMR and the service team will most likely request some logs as well.
I was able to open a PMR over the phone, it is 30158379. While doing so we found out that our software support contract had expired in August 2009. Someone from IBM Sales is supposed to contact me with pricing and I was given the tracking number 1572759.
------- Comment From brking.com 2010-01-14 11:25 EDT------- I've discussed the problem with VIOS development. Would it be possible to provide remote access to the VIOS so that VIOS development can take a look at the system live? This would help immensely in resolving this issue in a timely fashion.
------- Comment From tpnoonan.com 2010-01-15 10:34 EDT------- Hi Red Hat, Since the original problem has been resolved, is the new problem in the way of moving PWR LPAR Red Hat Software Cluster Suite from "tech preview" to full support in RHEL5.5? Or is thje new problem "just" a defect to be resolved? Thanks.
I haven't been able to get through enough testing to determine if the hang on reboot is gone because the new problem of virtual disks disappearing is getting in the way.
I just copied the disk related issues to new bug 555871 and will continue with that issue there. IBM should probably mirror that bug for the VIOS team to follow.
------- Comment From tpnoonan.com 2010-01-15 14:51 EDT------- HI Red Hat, IBM can't reproduce either problem in our test bed. Note that the disk storage you have in MN is not in the support matrix.
While the shared storage is not in the support matrix, the internal storage used for the LPARs' root disk is and that disappears on reboots too. We have our LPARs set to boot from the HD then the network and during my testing they'll boot from the network at times instead of the HD, which is a volume on the local to each system.
Nate, last I heard for this issue the hardware that you are running with wasn't support and IBM is trying to get you supported hardware that you can test with. Should we close this bz, or is there some reason that we should leave it open? Thanks, Steve
Considering this not a bug since the hardware is not supported by IBM.
------- Comment From kumarr.com 2010-02-09 17:37 EDT------- Closing per above comments