Bug 249775
Summary: | Request to backport zFCP NPIV support to RHEL 4 | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 4 | Reporter: | Issue Tracker <tao> | ||||||||||||
Component: | kernel | Assignee: | Hans-Joachim Picht <hpicht> | ||||||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||||||
Severity: | high | Docs Contact: | |||||||||||||
Priority: | medium | ||||||||||||||
Version: | 4.5 | CC: | bhinson, bholden, bugproxy, cward, ghelleks, jglauber, jjarvis, marcobillpeter, peterm, riek, rlerch, swells, syeghiay, tao | ||||||||||||
Target Milestone: | rc | Keywords: | FutureFeature, Reopened, Triaged | ||||||||||||
Target Release: | --- | ||||||||||||||
Hardware: | s390x | ||||||||||||||
OS: | Linux | ||||||||||||||
Whiteboard: | PM_RHEL4_8 | ||||||||||||||
Fixed In Version: | 5.0.0 | Doc Type: | Enhancement | ||||||||||||
Doc Text: |
On Red Hat Enterprise Linux 4.8, N_Port ID Virtualization (NPIV) for System z guests using zFCP is now enabled. NPIV allows a Fibre Channel HBA to log in multiple times to a Fibre Channel fabric using a single physical port (N_Port). With this functionality, a Storage Area Network (SAN) administrator can assign one or more logical unit numbers (LUNs) to a particular System z guest, making that LUN inaccessible to others. For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html
|
Story Points: | --- | ||||||||||||
Clone Of: | Environment: | ||||||||||||||
Last Closed: | 2009-05-18 19:29:43 UTC | Type: | --- | ||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||
Documentation: | --- | CRM: | |||||||||||||
Verified Versions: | Category: | --- | |||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||
Embargoed: | |||||||||||||||
Bug Depends On: | 458076 | ||||||||||||||
Bug Blocks: | 391511, 458752, 461297 | ||||||||||||||
Attachments: |
|
Description
Issue Tracker
2007-07-26 22:43:53 UTC
Description of problem: We appear to have correct hardware connection to our SAN Fabric and the SAN Switching are show N_PIV LOGON's from our zLinux Host. However after insuring the following modules are loaded ZFCP SD_MOD SCSI_MOD scsi_transport_fc As shown here [root@r6sant01 block]# lsmod Module Size Used by autofs4 240136 0 sunrpc 945664 1 vmcp 79880 0 loop 105744 0 qeth 543128 0 md5 57856 1 ipv6 2004592 25 qeth ccwgroup 68608 1 qeth sd_mod 39208 0 zfcp 237856 0 [permanent] scsi_transport_fc 29440 1 zfcp qdio 63824 5 qeth,zfcp scsi_mod 210888 3 sd_mod,zfcp,scsi_transport_fc dm_snapshot 42048 0 dm_zero 19712 0 dm_mirror 56848 0 ext3 201232 2 jbd 101168 1 ext3 dasd_fba_mod 29440 8 dasd_eckd_mod 86016 6 dasd_mod 105368 10 dasd_fba_mod,dasd_eckd_mod dm_mod 107760 6 dm_snapshot,dm_zero,dm_mirror We issue a chccwdev -e 0.0.2000 echo 0xC05076FFD8001180 > /sys/bus/ccw/drivers/zfcp/0.0.2000/port_add echo 0x0000000000000001 > /sys/bus/ccw/drivers/zfcp/0.0.2000/0xc05076ffd8001180/unit_add to add the WWPN according to the Driver and Commands Guide for 2.6 Kernel (we are on RHEL 4 U5, 2.6.9-55.EL) we should see from this command cat /sys/block/sda/device/fcp_lun our LUN information /sys/block/sda does not exist however. in a dmesg we see this zfcp: zfcp_fsf_open_port_handler(2454): FSF_SQ_NO_RETRY_POSSIBLE zfcp: The remote port 0xc05076ffd8001180 on adapter 0.0.2000 could not be opened. Disabling it. zfcp: port erp failed on port 0xc05076ffd8001180 on adapter 0.0.2000 scsi1 : zfcp I am unable to find any action to take on the FSF_SQ_NO_RETRY_POSSIBLE return code on the VM side We are running zVM 5.3 A queury shows this CP Q 2000 FCP 2000 ON FCP 2000 CHPID 30 SUBCHANNEL = 0019 2000 DEVTYPE FCP CHPID 30 FCP 2000 QDIO ACTIVE QIOASSIST ACTIVE 2000 2000 INP + 01 IOCNT = 00000006 ADP = 128 PROG = 000 UNAVAIL = 000 2000 BYTES = 0000000000000000 2000 OUT + 01 IOCNT = 00000022 ADP = 000 PROG = 022 UNAVAIL = 106 2000 BYTES = 00000000000127EC WWPN C05076FFD8001180 I tried using the san_disc tool and it does not seem to function i used the lib-zfcp-hbaapi-1.4 and tried to pull in the lib-zfcp-hbaapi-1.3 version incase of kernel level issue's, but was unable to download due to improper file settings on the the IBM website. it comes down as lib-zfcp-hbaapi-1.3.tar.tar and lost the .gz on the end and created a invalid file I can collect any additional info you need to troubleshoot We are on a deadline ofcourse and need assistance to determine root cause and solution How reproducible: Can be reproduced on every attempt to access SAN Steps to Reproduce: chccwdev -e 0.0.2000 echo 0xC05076FFD8001180 > /sys/bus/ccw/drivers/zfcp/0.0.2000/port_add echo 0x0000000000000001 > /sys/bus/ccw/drivers/zfcp/0.0.2000/0xc05076ffd8001180/unit_add Actual results: in a dmesg we see this zfcp: zfcp_fsf_open_port_handler(2454): FSF_SQ_NO_RETRY_POSSIBLE zfcp: The remote port 0xc05076ffd8001180 on adapter 0.0.2000 could not be opened. Disabling it. zfcp: port erp failed on port 0xc05076ffd8001180 on adapter 0.0.2000 scsi1 : zfcp Expected results: according to the Driver and Commands Guide for 2.6 Kernel (we are on RHEL 4 U5, 2.6.9-55.EL) we should see from this command cat /sys/block/sda/device/fcp_lun our LUN information /sys/block/sda does not exist however. Additional info: This event sent from IssueTracker by bhinson [SEG - Storage] issue 127690 NPIV support was added here: https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=184627 This is upstream in 2.6.14, so is included in RHEL 5, but was never backported to RHEL 4. According to a comment early on, the code only impacts the zfcp device driver, so changes are required to the SCSI midlayer. This is a good sign for consideration in backporting to RHEL 4, but we'll need input from the kernel developer on kABI issues. Let's go ahead and propose this backport for RHEL 4, but let's set the expectation that for the immediate future, RHEL 5 is the solution. This event sent from IssueTracker by bhinson [SEG - Storage] issue 127690 Brad, I think we should update the zfcp driver on RHEL4 completely. That would give the customers missing features like NPIV and also ease porting fixes. Technically it should be possible to do the driver upgrade, but this will surely cost some development time and of course testing efford. When will the customer do the rollout? IBM, any comments on the feasibility to backport the upstream zFCP patch set for NPIV support to RHEL 4? ----- Additional Comments From jan.glauber.com 2007-08-01 11:21 EDT ------- Brad, we are currently estimating the effort to backport NPIV to RHEL4. changed: What |Removed |Added ---------------------------------------------------------------------------- Owner|jan.glauber.com |buendgen.com ------- Additional Comments From jan.glauber.com 2007-08-02 07:17 EDT ------- Reassigning to Reinhard, who will take care of this issue. Created attachment 160754 [details] NPIV backport from 2.6.13 Patch is based on initial NPIV support added to 2.6.13. developWorks link: http://www.ibm.com/developerworks/linux/linux390/linux-2.6.13-s390-01-october2005.html (original patch: linux-2.6.13-s390-01-05-october2005.diff) Created attachment 160756 [details] NPIV backport from 2.6.13 (Repost for IBM BZ mirroring purposes): -------------------------------------- Patch is based on initial NPIV support added to 2.6.13. developWorks link: http://www.ibm.com/developerworks/linux/linux390/linux-2.6.13-s390-01-october2005.html (original patch: linux-2.6.13-s390-01-05-october2005.diff) Do we have anything new to report on this, customer is looking for an update? -Thanks! This event sent from IssueTracker by esammons issue 145211 ------- Comment From pavlic.com 2008-03-12 06:50 EDT------- Hi, we are not going to backport NPIV to RHEL 4.7 and RHEL 4.8 respectively and are still recommending customer to upgrade to RHEL 5 to get NPIV support. Thus we are going to close this bug . Frank Pavlic ------- Comment From mgrf.com 2008-04-23 03:36 EDT------- I re-open this BZ because the discussions re item getting restarted Is there an update for this BZ? ------- Comment From bhinson 2008-01-26 13:31 EST------- Customer has reopened this request. From issue 145211: Another RFE that has come out of the planning session with JPMC. This was previously a hot topic; however, their belief was they could go to RHEL 5 in 1Q08; however, it now seems that has been delayed and would like to pursue this again. Created attachment 310706 [details]
Updated NPIV backport
Updated NPIV patch. Backported to RHEL 2.6.9-75, and included additional
patch:
commit 2448c45965870ca9cfdb66388b4fcc93f1e12bb7
Author: Andreas Herrmann <aherrman.com>
Date: Thu Dec 1 02:50:36 2005 +0100
[SCSI] zfcp: fix adapter initialization
Fixed various problems in opening sequence of adapters which was previously
changed with NPIV support:
o corrected handling when exchange port data function is not supported,
otherwise adapters on z900 cannot be opened anymore
o corrected setup of timer for exchange port data if called from error
recovery
o corrected check of return code of exchange config data
Signed-off-by: Andreas Herrmann <aherrman.com>
Signed-off-by: James Bottomley <James.Bottomley>
Test packages are available at: http://people.redhat.com/bhinson/.zfcpNPIV/ Feedback on NPIV-enabled setup is greatly appreciated. Posted to rhkernel on Jul 2. by Hans-Joachim Picht <hpicht> ------- Comment From gmuelas.com 2008-07-11 03:52 EDT------- Waiting on RH PM decision to continue the activities with this bug... Created attachment 314486 [details]
Updated NPIV backport
Finalized backport
Updated kernel test packages are available at: http://people.redhat.com/bhinson/.zfcpNPIV/ This test kernel also includes all of the patches here, which are proposed for RHEL 4.8: http://people.redhat.com/bhinson/.zfcpNPIV/patches/ Please test and provide feedback. The none NPIV part of the zfcp driver seems to be broken.
I would attach a LUN with out NPIV and the lpar crashed with panic.
Steps for reproduce and Operating System Message:
modprobe zfcp
echo 1 > /sys/bus/ccw/drivers/zfcp/0.0.b409/online
echo 0x500507630318848e > /sys/bus/ccw/drivers/zfcp/0.0.b409/port_add
echo 0x40ca400c00000000
> /sys/bus/ccw/drivers/zfcp/0.0.b409/0x500507630318848e/unit_add
SCSI subsystem initialized
zfcp: Switched fabric fibrechannel network detected at adapter 0.0.b409.
Unable to handle kernel pointer dereference at virtual kernel address
0000000000000000
Oops: 0004 [#1]
CPU: 1 Not tainted
Process zfcperp0.0.b409 (pid: 1319, task: 000000007e0ea040, ksp:
000000007a857d28)
Krnl PSW : 0400000180000000 000000008124655c (zfcp_fsf_open_unit+0x11c/0x254
[zfcp])
Krnl GPRS: 0000000000396f18 0000000000000000 0000000000000001 00000000000000b0
0000000000000000 0000000000396f28 000000007a857de0 0000000002da5570
000000007a916000 0000000002da5570 0000000000000000 0000000002da5570
0000000081230000 000000008124c480 00000000812464f6 000000007a857d30
Krnl Code: 50 20 10 b4 e3 20 b0 28 00 04 a5 1e 08 00 41 30 21 50 58 40
Call Trace:
([<0000000081246490>] zfcp_fsf_open_unit+0x50/0x254 [zfcp])
[<000000008123a076>] zfcp_erp_strategy_do_action+0x1266/0x15f8 [zfcp]
[<000000008123b10a>] zfcp_erp_thread+0x5c2/0x15a0 [zfcp]
[<0000000000019ab6>] kernel_thread_starter+0x6/0xc
[<0000000000019ab0>] kernel_thread_starter+0x0/0xc
<0>Kernel panic - not syncing: Fatal exception: panic_on_oops
The bug is with the NPIV patch. The patched code looks like this in zfcp_fsf_open_unit in zfcp_fsf.c: fsf_req->qtcb->header.port_handle = erp_action->port->handle; fsf_req->qtcb->bottom.support.fcp_lun = erp_action->unit->fcp_lun; if (!(erp_action->adapter->connection_features & FSF_FEATURE_NPIV_MODE)) erp_action->fsf_req->qtcb->bottom.support.option = FSF_OPEN_LUN_SUPPRESS_BOXING; atomic_set_mask(ZFCP_STATUS_COMMON_OPENING, &erp_action->unit->status); fsf_req->data.open_unit.unit = erp_action->unit; fsf_req->erp_action = erp_action; erp_action->fsf_req = fsf_req; erp_action->fsf_req->qtcb... is in invalid pointer until it is initialized in the last line above. The fix would be to change the line in the if-statement to: fsf_req->qtcb->bottom.support.option = Updated patch and kernel RPMs available here: http://people.redhat.com/bhinson/.zfcpNPIV Updating PM score. With the new test kernel '2.6.9-78.EL.zfcptest.2' it is possible to attach LUN's with out npiv support. SCSI subsystem initialized zfcp: Switched fabric fibrechannel network detected at adapter 0.0.b409. Vendor: IBM Model: 2107900 Rev: .104 Type: Direct-Access ANSI SCSI revision: 05 SCSI device sda: 20971520 512-byte hdwr sectors (10737 MB) SCSI device sda: drive cache: write back SCSI device sda: 20971520 512-byte hdwr sectors (10737 MB) SCSI device sda: drive cache: write back Attached scsi disk sda at scsi0, channel 0, id 1, lun 1074544842 Saw the following bug during chpid off/on test (vary off/on via sysfs) ... zfcp: adapter 0.0.b409: no path kernel BUG at include/linux/module.h:397! illegal operation: 0001 [#1] CPU: 0 Not tainted Process zfcperp0.0.b409 (pid: 1327, task: 000000007d6a0040, ksp: 000000007959fbe0) Krnl PSW : 0700000180000000 0000000080921836 (qdio_shutdown+0x6ba/0x714 [qdio]) Krnl GPRS: 00000000000000a6 00000000002980e8 000000000000002a 0700000000000000 0000000080921834 000000000020953e 0000000000000000 070000008091e000 0000000002cf5200 000000007bee6100 0000000000000001 000000008093bc00 000000008091e000 0000000080925bb8 0000000080921834 000000007959fc70 Krnl Code: 00 00 e3 20 0d 8a 00 91 eb 22 00 08 00 0d 41 22 b0 00 41 10 Call Trace: ([<0000000080921834>] qdio_shutdown+0x6b8/0x714 [qdio]) [<0000000081237de2>] zfcp_erp_adapter_strategy_generic+0x8be/0x9cc [zfcp] [<0000000081238e8e>] zfcp_erp_strategy_do_action+0x7e/0x15f8 [zfcp] [<000000008123b10a>] zfcp_erp_thread+0x5c2/0x15a0 [zfcp] [<0000000000019ab6>] kernel_thread_starter+0x6/0xc [<0000000000019ab0>] kernel_thread_starter+0x0/0xc <0>Kernel panic - not syncing: Fatal exception: panic_on_oops kernel: 2.6.9-78.EL.zfcptest.2 Steps to reproduce 1. run i/o to a FCP atached storage subsystem 2. set one of the used chpid offline echo off > /sys/devices/css0/chp0.da/status 3. set the chpid online echo on > /sys/devices/css0/chp0.da/status This bug is reproducible with and w/o NPIV Created attachment 315939 [details] zfcp hostadapter patch I tried to reproduce this panic, but was unsuccessful. However, looking at the backtrace from the panic led me to backport this patch from upstream: http://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.9-rc4/2.6.9-rc4-mm1/broken-out/s390-7-12-zfcp-host-adapter.patch Kernel panic was caused by an invalid reference count, so the most relevant change is to zfcp_erp_action_cleanup(), bringing this a little closer to matching the upstream function. Updated kernel RPMs are here: http://people.redhat.com/bhinson/.zfcpNPIV/ (removing POST as this hasn't yet been submitted to RHkernel) Also, If the same panic occurs on kernel-2.6.9-78.EL.zfcptest.3, can you test this chpid on/off procedure on the RHEL 5.2 kernel for comparison? Thanks chpid off/on retest with kernel: 2.6.9-78.EL.zfcptest.3 Lpar with NPIV environment crashed again with kernel panic. Screen shot of Operating System Messages attached as file: kernel_panic_zfcptest3.JPG As requested the chpid on/off procedure was ecexuted with RHEL 5.2 kernel: 2.6.18-92.1.10.el5 No kernel panic occurs. Created attachment 316275 [details]
Operating System Message of kernel panic
Just to clarify one more thing, does the same chpid off/on test pass on an unpatched 2.6.9-78 (RHEL 4.7) kernel? Thanks Brad asked Hans who asked me to have a look at this. The stack trace from the screenshot shows a problem in module.h triggered from qdio_shutdown. This looks like the problem with the qdio reference counter with offline channels. I could not find the data for the test setup, but i assume that there are multiple subchannels for the same chpid attached to the LPAR, and only one subchannel is in use. Running chpid off/on now with the offline subchannels triggers the qdio bug where the reference counter id decremented in each cycle until it hits the check in module.h. The fix should be this one: # 46531 - 458076 linux-2.6.9-s390-qdio_mod_refcnt.patch and a workaround would be to only attach the subchannels to the LPAR that are actually being used. Two remarks from my side: 1) Would it be possible for future problem reports to include more data? A dump for a crashed system or a dbginfo tar ball for a problem in a running system would certainly help. 2) Would it be possible to include all known s390 RHEL4 patches in this test kernel? If not, do we have a list of known s390 RHEL4 patches that are not yet included? Probably there are more known problems that will be hit again if the tests run without the known fixes. Christof After speaking with Hans, the patch (linux-2.6.9-s390-qdio_mod_refcnt.patch) included in the latest kernel (kernel-2.6.9-78.EL.zfcptest.3) was the wrong version of the patch. I've updated this patch and rebuilt. Updated kernel images are available at: http://people.redhat.com/.zfcpNPIV/ Correction, the link is: http://people.redhat.com/bhinson/.zfcpNPIV At z/Expo in Las Vegas Brad and I worked on this issue again an Brad generated a new kernel rpm. Can you please try to reproduce the remaining bug with the following rpm file? http://people.redhat.com/bhinson/.zfcpNPIV/kernel-2.6.9-78.EL.zfcptest.4.s390x.rpm With best regards, --Hans Has there been any movement on this? I have a LUN configured with NPIV adapter 1940 (CHPID 59) and mounted as /mnt/1. I have run blast on the partition for an hour and performed CHPID off on the adapter. Blast terminated as expected and was not able to use the mount point (not able to write anything to the mount point) as expected and it says " read only filesystem". When the CHPID was made online again, the filesystem seems to be corrupted. I am not able to even write to the disk or remove the contents of the disk even after remounting. [root@h0530014 1]# uname -a Linux h0530014.boeblingen.de.ibm.com 2.6.9-78.EL.zfcptest.4 #1 SMP Thu Oct 16 13:12:03 EDT 2008 s390x s390x s390x GNU/Linux Configured LUN with the following script: [root@h0530014 ~]# cat zfcp_H05LP32_Conf_npiv.sh echo 0x500507630303c562 > /sys/bus/ccw/drivers/zfcp/0.0.1940/port_add echo 0x4014407000000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407100000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407200000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407300000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407400000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407500000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407600000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; echo 0x4014407700000000 > /sys/bus/ccw/drivers/zfcp/0.0.1940/0x500507630303c562/unit_add; sleep 1; FCP adapter: 1940 (CHPID 59) on z/VM Guest H0530014. >> When the CHPID was set of offline, got the following message: [root@h0530014 ~]# ls Message from syslogd@h0530014 at Mon Dec 29 10:00:05 2008 ... h0530014 kernel: journal commit I/O error >> Blast terminated with the following message: Removal bypassed for File /mnt/1/test1/1/2/3/f1024.blt rc = 2 at 12/29/2008 10:03:44 ***************************************************************** * Error Code 30 detected on /mnt/1 at 12/29/2008 10:03:44 * EROFS Read-only file system ***************************************************************** ***************************************************************** * Error Code 30 detected on /mnt/1 at 12/29/2008 10:03:44 * EROFS Read-only file system ***************************************************************** Create ERROR FILE /mnt/1/error.blt Failed 30 EROFS Read-only file system Remove Directory /mnt/1/test1/1/2/3/4 halted due to error in removing contents Test /mnt/1 ended with errors logged to: /root/blast/BLAST_h0530014__mnt_1_12_29_2008_10_03_43.log BLAST Ended on /mnt/1 RC = -1 at 12/29/2008 10:03:44 *=== Run statistics ============================================* | Elapsed time = 0 days, 0 hours, 0 minutes and 1 seconds *===============================================================* ***************************************************** *** All test cases terminated OK *** >> After this pont not able to restart blast as it gives me the same error again and again. Even after remounting the filesystem. >> Log Messages in dmesg and /var/log/messages Dec 29 10:00:56 h0530014 kernel: zfcp: adapter 0.0.1940: operational again Dec 29 10:00:57 h0530014 kernel: zfcp: Switched fabric fibrechannel network detected at adapter 0.0.1940. Dec 29 10:02:23 h0530014 kernel: __journal_remove_journal_head: freeing b_committed_data Dec 29 10:02:23 h0530014 last message repeated 3 times Dec 29 10:02:28 h0530014 kernel: kjournald starting. Commit interval 5 seconds Dec 29 10:02:28 h0530014 kernel: EXT3 FS on sda1, internal journal Dec 29 10:02:28 h0530014 kernel: EXT3-fs: recovery complete. Dec 29 10:02:28 h0530014 kernel: EXT3-fs: mounted filesystem with ordered data mode. Dec 29 10:02:34 h0530014 kernel: inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=458753 Dec 29 10:03:31 h0530014 kernel: EXT3-fs warning (device sda1): ext3_unlink: Deleting nonexistent file (458753), 0 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382400 Dec 29 10:03:44 h0530014 kernel: Aborting journal on device sda1. Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382401 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382402 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382403 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382404 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382405 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382406 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382407 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382408 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382409 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382410 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1): ext3_free_blocks_sb: bit already cleared for block 1382411 Dec 29 10:03:44 h0530014 kernel: EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted Dec 29 10:03:45 h0530014 kernel: EXT3-fs error (device sda1) in ext3_truncate: Journal has aborted Dec 29 10:03:45 h0530014 kernel: EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted Dec 29 10:03:45 h0530014 kernel: EXT3-fs error (device sda1) in ext3_orphan_del: Journal has aborted Dec 29 10:03:45 h0530014 kernel: EXT3-fs error (device sda1) in ext3_reserve_inode_write: Journal has aborted Dec 29 10:03:45 h0530014 kernel: EXT3-fs error (device sda1) in ext3_delete_inode: Journal has aborted Dec 29 10:03:45 h0530014 kernel: __journal_remove_journal_head: freeing b_committed_data Dec 29 10:03:45 h0530014 last message repeated 12 times Dec 29 10:03:45 h0530014 kernel: ext3_abort called. Dec 29 10:03:45 h0530014 kernel: EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Dec 29 10:03:45 h0530014 kernel: Remounting filesystem read-only Dec 29 10:06:44 h0530014 kernel: sb orphan head is 688129 Dec 29 10:06:44 h0530014 kernel: sb_info orphan list: Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 10 times Dec 29 10:06:44 h0530014 kernel: 8347 at 000000006cb4e108: mode 100755,8347 at 00008347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 53 times Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 54 times Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at8347 at 000000006cb4e108: mode 1008347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 53 times Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: <38347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 53 times Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: 8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 53 times Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e18347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: 8347 at 000000006cb4e108: mode 100755, nlink 1, next 88347 at 0008347 at 000000006cb4e108: mode 108347 at 000000006cb4e108: mode8347 at 000000006cb4e108: mode 100755, nli8347 at 000000006cb4e108: mode 100755, nlink 1, ne8347 at 00008347 at 00008347 at 000000006cb4e108: mode 100755, 8347 at 000000006cb4e108: mode 100758347 at 000000006cb4e108: mod8347 at 000000006cb4e108: mode8347 at 000000006cb4e108: mode 100758347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755,8347 at 000000006cb4e108: mode 1007558347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755, nl8347 at 0008347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode8347 at 000000006cb4e108: mode 100755,8347 at 000000006cb4e108: mode 100755, n8347 at 000000006cb4e108: mode 10078347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755, n8347 at 0008347 at 000000006cb4e108: mode 108347 at 000000006cb4e108: mode 100755, nlink8347 at 000000006cb4e108: mode 100755, nli8347 at 000000006cb4e108: mode 100755, 8347 at 000000006cb4e108: mode 1008347 at 000000006cb4e108: mode 100755, 8347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755, nlin8347 at88347 at 0008347 at 000000006cb4e108: mode 100755, nlin8347 at 000000006cb4e108: mode 100755, 8347 at 8347 at 000000006cb4e108: m8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sda1:688347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 last message repeated 9 times Dec 29 10:06:44 h0530014 kernel: inode 8347 at 000000006cb4e108: mode 100755, 8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: ino8348347 at 000000006cb4e108: mode 108347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755,8348347 at 000000006cb4e108: mode8347 at 000000006cb4e108: mode 100758347 at 000000006cb4e108: mode 100755, nli8347 at 000000006cb4e108: mode 100755,8347 at 000000006cb4e108: mode 100755, n8347 at 00008347 at 000000006cb4e108: mode 1007558347 at 000000006cb4e108: mode 100755, n8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: 8347 at 000000006cb4e108: mode 100755, 8347 at 000000006cb4e108: mode 100755, nlink 8347 at 000000006cb4e108: mode 100755,8347 at 000000006cb4e108: mode 100755, 8347 at 000008347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode8347 at 000000006cb4e108: mode 10078347 at 000000006cb4e108: mode 100755, nli8347 at 000000006cb4e108: mode 100755, n8347 at 008347 at 000000006cb4e108: mode8347 at 000000006cb4e108: mode 8347 at 000000006cb4e108: mode 100755,8347 at 000000006cb4e8347 at 000000006cb4e108: mode 8347 at 000000006cb4e108: mode 100755, n8347 at 000000006cb4e108: mode 100755, nl8347 at 000000006cb4e108: mode 100755, nli8347 at 000000006cb4e108: mode 100755, nli8347 at 000000006cb4e108: mode 10078347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode8347 at 000000006cb4e108: mode 100755,8347 at 000000006cb4e108: mode 100755, n8347 at 000000006cb4e108: mode 100755, nlink 1, ne8347 at 000000006cb4e108: mode 100758347 at 000000006cb4e108: mode 88347 at 000000006cb4e108: mode 100755, nlink 1, 8347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: inode sd8347 at 8347 at 000000006cb4e108: mode 100755, nlink 1, ne8347 at 000000006cb48347 at 0000000068347 at 000000006cb4e108: mode 100755, nlink 1, next 0 Dec 29 10:06:44 h0530014 kernel: 8347 at 008347 at 000000006cb4e108: mode 100755, nlink 1, next 0 >> Sometimes it works on running fsck on the partition. Other Messages: __journal_remove_journal_head: freeing b_committed_data ext3_abort called. EXT3-fs error (device sda1): ext3_journal_start_sb: Detected aborted journal Remounting filesystem read-only inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=671745 EXT3-fs error (device sda1): ext3_xattr_get: inode 689110: bad block 1376771 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=689110 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443394 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443395 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443396 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443397 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443398 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443399 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443400 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443401 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443402 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443403 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443404 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443405 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443406 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443407 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443408 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443409 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443410 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443411 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443412 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443413 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443414 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443415 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443416 inode_doinit_with_dentry: getxattr returned 5 for dev=sda1 ino=443417 increasing severity to block as in this scenario data corruption occurred. And data integrity problems can't be tolerated on System z Comments from Volker Sameske: Okay, this problem can not and will not be fixed --> works as designed. In a setup without multipathing exists only one path to the disk. In case this path will be interrupted, the filesystem gets I/O errors. To this point in time the filesystem on the disk may be in an inconsistent state. To avoid data loss this state will be frozen immediately to avoid data loss (working on an inconsistent filesystem). This means, the filesystem will mount itself read-only immediately. Once the device is available again, the filesystem tries to recover from this failure. This should be possible in most cases. But there may be some cases, where it is not possible to recover. To eliminate that risk, our recommendation is to use multipathing. It is unlikely that all paths will fail same time. But if, there is a feature called "queue if no path" which queues I/O requests until the device comes back or the system runs out of memory. In the latter case the system will freeze with the next malloc request. But as soon as the device is back, the I/O queue will be emptied and the system should be accessible again. Using SCSI disks directly (without multipathing) does not support "queue if no path". As a follow up to comment #50, can we retest this with multipath? since the latest patch (presumably this one reposted in Oct 2nd, 2008): http://post-office.corp.redhat.com/archives/rhkernel-list/2008-October/msg00070.html didn't work based on the test result, comment#65, move this bug back to ASSI., and based on comment#66, raise as proposed blocker. Committed in 79.EL . RPMS are available at http://people.redhat.com/vgoyal/rhel4/ Overall tests went well. On Multipath, if queue_if_no_path is enabled, there is no I/O Stall. As the problem with I/O failure when no path is available is expected... trying with multipath with queueing enables goes fine. Still some setup needs to be done for the tests... Tests with 256 fcp adapters is waiting for the setup to be done. Will be performing the tests asap when the setup is available. Release note added. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. New Contents: N_Port ID Virtualization (NPIV) utilizes a recent extension to the International Committee for Information Technology Standardization (INCITS) Fibre Channel standard. This extension allows a Fibre Channel HBA to log in multiple times to a Fibre Channel fabric using a single physical port (N_Port). (The previous implementation of the standard required a single physical FCP channel for each login.) For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html N_Port ID Virtualization (NPIV) utilizes a recent extension to the International Committee for Information Technology Standardization (INCITS) Fibre Channel standard. This extension allows a Fibre Channel HBA to log in multiple times to a Fibre Channel fabric using a single physical port (N_Port). (The previous implementation of the standard required a single physical FCP channel for each login.) For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html Removed association with Issue Tracker #127690 and Issue Tracker #145211, as those tickets have been resolved. The Red Hat backported code is tested and sent to specific customer. Adapt severity of this reverse mirror on IBM site from "block" to "high" to reflect severity on Red Hat site This BZ is now used to track inclusion into RHEL 4.8 ------- Comment From schamart.ibm.com 2009-03-20 06:54 EDT------- Currently i am performing the tests in RHEL4.8. I am running High I/O with failover for over 48 hours. Will update the results once they are available. Till now the following problems were reported for NPIV in 4.8 LTC Bugzilla No 51980 / RIT 272434 ------- Comment From mjr.ibm.com 2009-03-24 11:36 EDT------- (In reply to comment #61) > Currently i am performing the tests in RHEL4.8. I am running High I/O with > failover for over 48 hours. Will update the results once they are available. Did that complete? > > Till now the following problems were reported for NPIV in 4.8 > > LTC Bugzilla No 51980 / RIT 272434 If we're already tracking new problems related to NPIV in other bugs, maybe this one can be closed? ------- Comment From schamart.ibm.com 2009-03-25 00:27 EDT------- Tests executed successfully for NPIV in both good path and failover tests. No problems found during the I/O. ------- Comment From mjr.ibm.com 2009-03-25 02:34 EDT------- Great, thank you for trying this out. I will mark this bug closed, then. Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1,6 +1 @@ -N_Port ID Virtualization (NPIV) utilizes a recent extension to the International +This update enables N_Port ID Virtualization (NPIV) for System z guests using zFCP. NPIV allows a Fibre Channel HBA to log in multiple times to a Fibre Channel fabric using a single physical port (N_Port). With this functionality, a Storage Area Network (SAN) administrator can assign one or more logical unit numbers (LUNs) to a particular System z guest, making that LUN inaccessible to others. For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html-Committee for Information Technology Standardization (INCITS) Fibre Channel -standard. This extension allows a Fibre Channel HBA to log in multiple times to a -Fibre Channel fabric using a single physical port (N_Port). (The previous -implementation of the standard required a single physical FCP channel for each -login.) For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html ~~ Attention Partners! Snap 1 Released ~~ RHEL 4.8 Snapshot 1 has been released on partners.redhat.com. There should be a fix present, which addresses this bug. NOTE: there is only a short time left to test, please test and report back results on this bug at your earliest convenience. If you encounter any issues, please set the bug back to the ASSIGNED state and describe the issues you encountered. If you have found a NEW bug, clone this bug and describe the issues you encountered. Further questions can be directed to your Red Hat Partner Manager. If you have VERIFIED the bug fix. Please select your PartnerID from the Verified field above. Please leave a comment with your test results details. Include which arches tested, package version and any applicable logs. - Red Hat QE Partner Management Release note updated. If any revisions are required, please set the "requires_release_notes" flag to "?" and edit the "Release Notes" field accordingly. All revisions will be proofread by the Engineering Content Services team. Diffed Contents: @@ -1 +1 @@ -This update enables N_Port ID Virtualization (NPIV) for System z guests using zFCP. NPIV allows a Fibre Channel HBA to log in multiple times to a Fibre Channel fabric using a single physical port (N_Port). With this functionality, a Storage Area Network (SAN) administrator can assign one or more logical unit numbers (LUNs) to a particular System z guest, making that LUN inaccessible to others. For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html+On Red Hat Enterprise Linux 4.8, N_Port ID Virtualization (NPIV) for System z guests using zFCP is now enabled. NPIV allows a Fibre Channel HBA to log in multiple times to a Fibre Channel fabric using a single physical port (N_Port). With this functionality, a Storage Area Network (SAN) administrator can assign one or more logical unit numbers (LUNs) to a particular System z guest, making that LUN inaccessible to others. For further information, see "Introducing N_Port Identifier Virtualization for IBM System z9, REDP-4125" available at http://www.redbooks.ibm.com/abstracts/redp4125.html An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1024.html |