Bug 442524
Summary: | kobject_issues_in_SAN_environment | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Jan Lieskovsky <jlieskov> |
Component: | kernel | Assignee: | Jes Sorensen <Jes.Sorensen> |
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | low | Docs Contact: | |
Priority: | low | ||
Version: | 5.2 | CC: | arozansk, coughlan, lwang, mchristi, peterm, rproffit |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-05-04 14:56:03 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 430984 | ||
Bug Blocks: |
Description
Jan Lieskovsky
2008-04-15 11:10:26 UTC
More details from GregKH: Here is an exerpt from logging at 'info' and 'notice' priority levels for the event: ------------------------------------------------------------ --------------- Feb 22 01:36:54 rsg1 kernel: qla2xxx 0000:03:09.0: scsi(3:0:1): Abort command issued -- 1 3818fa 2002. Feb 22 01:37:20 rsg1 hotplug: Event 871 requested remove for device: scsi_disk Feb 22 01:37:20 rsg1 hotplug: Event 870 requested remove for device: scsi_device Feb 22 01:37:20 rsg1 hotplug: Event 872 requested remove for device: block Feb 22 01:37:20 rsg1 hotplug: Event 873 requested remove for device: block Feb 22 01:37:20 rsg1 hotplug: Event 874 requested remove for device: scsi Feb 22 01:37:24 rsg1 kernel: qla2xxx 0000:03:09.0: scsi(3:0:1): Abort command issued -- 1 3818fb 2002. Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: scsi: Device offlined - not ready after error recovery Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: [sdd] Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK,SUGGEST_OK Feb 22 01:39:24 rsg1 kernel: scsi scan: INQUIRY result too short (5), using 36 Feb 22 01:39:24 rsg1 kernel: scsi 3:0:0:1: Direct-Access FCTARGET lvm(58,6) 0.9 PQ: 0 ANSI: 3 Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] 418775040 512-byte hardware sectors (214413 MB) Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] Write Protect is off Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] Write cache: disabled, read cache: enabled, doesn't support DPO or FUA Feb 22 01:39:24 rsg1 hotplug: Event 876 requested add for device: scsi_disk Feb 22 01:39:24 rsg1 hotplug: Event 875 requested add for device: scsi Feb 22 01:39:24 rsg1 kernel: sde: sde1 Feb 22 01:39:24 rsg1 hotplug: Event 877 requested add for device: block Feb 22 01:39:24 rsg1 hotplug: Event 878 requested add for device: block Feb 22 01:39:24 rsg1 kernel: sd 3:0:0:1: [sde] Attached SCSI disk ------------------------------------------------------------ --------------- Which corresponds to the following messages at 'warn' priority: ------------------------------------------------------------ --------------- Feb 22 01:37:20 rsg1 kernel: rport-3:0-1: blocked FC remote port time out: removing target and saving binding Feb 22 01:37:24 rsg1 kernel: end_request: I/O error, dev sdd, sector 418774920 Feb 22 01:37:24 rsg1 kernel: md: super_written gets error=-5, uptodate=0 Feb 22 01:37:24 rsg1 kernel: raid1: Disk failure on sdd1, disabling device. Feb 22 01:37:24 rsg1 kernel: ^IOperation continuing on 1 devices Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: rejecting I/O to dead device Feb 22 01:37:24 rsg1 kernel: md: super_written gets error=-5, uptodate=0 Feb 22 01:37:24 rsg1 kernel: scsi 3:0:0:1: rejecting I/O to dead device Feb 22 01:37:24 rsg1 kernel: md: super_written gets error=-5, uptodate=0 Feb 22 01:37:24 rsg1 kernel: RAID1 conf printout: Feb 22 01:37:24 rsg1 kernel: --- wd:1 rd:2 Feb 22 01:37:24 rsg1 kernel: disk 0, wo:0, o:1, dev:sdc1 Feb 22 01:37:24 rsg1 kernel: disk 1, wo:1, o:0, dev:sdd1 Feb 22 01:37:24 rsg1 kernel: RAID1 conf printout: Feb 22 01:37:24 rsg1 kernel: --- wd:1 rd:2 Feb 22 01:37:24 rsg1 kernel: disk 0, wo:0, o:1, dev:sdc1 Feb 22 01:39:26 rsg1 kernel: kobject_add failed for 3:0:0:1 with -EEXIST, don't try to register things with the same name in the same directory. Feb 22 01:39:26 rsg1 kernel: [<c01bddde>] kobject_shadow_add+0x101/0x10a Feb 22 01:39:26 rsg1 kernel: [<c01edc5d>] device_add+0x7c/0x374 Feb 22 01:39:26 rsg1 kernel: [<c020cb47>] scsi_sysfs_add_sdev+0x27/0x148 Feb 22 01:39:26 rsg1 kernel: [<c020b09f>] scsi_add_lun+0x2b7/0x2cc Feb 22 01:39:26 rsg1 kernel: [<c020b228>] scsi_probe_and_add_lun+0x174/0x206 Feb 22 01:39:26 rsg1 kernel: [<c020b37d>] scsi_sequential_lun_scan+0xc3/0xda Feb 22 01:39:26 rsg1 kernel: [<c020b98e>] __scsi_scan_target+0xd1/0xe8 Feb 22 01:39:26 rsg1 kernel: [<c020ba4d>] scsi_scan_target+0xa8/0xc2 Feb 22 01:39:26 rsg1 kernel: [<c0213529>] fc_scsi_scan_rport+0x0/0x73 Feb 22 01:39:26 rsg1 kernel: [<c021357e>] fc_scsi_scan_rport+0x55/0x73 Feb 22 01:39:26 rsg1 kernel: [<c01227f9>] run_workqueue+0x77/0xf3 Feb 22 01:39:26 rsg1 kernel: [<c0122875>] worker_thread+0x0/0xb1 Feb 22 01:39:26 rsg1 kernel: [<c012291c>] worker_thread+0xa7/0xb1 Feb 22 01:39:26 rsg1 kernel: [<c01256e9>] autoremove_wake_function+0x0/0x33 Feb 22 01:39:26 rsg1 kernel: [<c01256e9>] autoremove_wake_function+0x0/0x33 Feb 22 01:39:26 rsg1 kernel: [<c0122875>] worker_thread+0x0/0xb1 Feb 22 01:39:26 rsg1 kernel: [<c0125307>] kthread+0x34/0x55 Feb 22 01:39:26 rsg1 kernel: [<c01252d3>] kthread+0x0/0x55 Feb 22 01:39:26 rsg1 kernel: [<c01030c7>] kernel_thread_helper+0x7/0x10 Feb 22 01:39:26 rsg1 kernel: ======================= ------------------------------------------------------------ --------------- As can be seen from the first set of logs the /dev/sdd device is what gets kicked out of the RAID1 device. The SAN target coming back on-line triggers an event which results in the same device being re-discovered as /dev/sde. My assumption is device discovery is forced as the 'next' SCSI device due to the fact the RAID1 MD device is still holding a reference to the failed /dev/sdd device, thus preventing its re-use. Thats at least how I interpret the: Feb 22 01:39:26 rsg1 kernel: kobject_add failed for 3:0:0:1 with -EEXIST, don't try to register things with the same name in the same directory. And the resultant stack trace. The interaction of this with udev tends to complicate the situation from a systems administration point of view. The loss of the /dev/sdd device is picked up by udevd which causes node entries for the block device and its associated partitions to be removed. This tends to violate the premise of 'least surprise' when the systems staff logs in to recover the situation and can't remove the failed device from the RAID1 device and are potentially forced to add a completely different device back into the MD device to recover from the event. Contacted Alan and Aristeu Rozanski due this one -- if this issue is the same as BZ#239604 [RHEL5] console: kobject_add failed, which is already tracked in BZ. Aristeu's reaction: Hello Jan, These issues aren't related. this message is displayed when any part of the kernel try to register an object with an existing name and usually means that there's a bug on the caller. On the vcs case, there's a race on the virtual console code while opening a not totally closed virtual console. If there isn't a BZ# about this RAID bug, please create one and contact the RAID maintainer. -- Aristeu There are two distinctly different bugs in this issue. The first is the kicking the error up to the filesystem and causing the ext3 filesystem to go readonly. That particular bug is a duplicate of bug 430984, so I'm going to entirely ignore that part of this bug report. The second issue is the reregistration of the fc disk under a different sd name due to the kobject still existing when the device is rescanned. I'm going to defer my decision on whether or not that can be fixed in this release cycle until after I've completed 430984 and can see if it impacts this at all. This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?". Unfortunately the previous automated notification about the non-inclusion of this request in Red Hat Enterprise Linux 5.3 used the wrong text template. It should have read: this request has been reviewed by Product Management and is not planned for inclusion in the current minor release of Red Hat Enterprise Linux. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?" or raise an exception. (In reply to comment #3) > There are two distinctly different bugs in this issue. The first is the kicking > the error up to the filesystem and causing the ext3 filesystem to go readonly. > That particular bug is a duplicate of bug 430984, so I'm going to entirely > ignore that part of this bug report. The second issue is the reregistration of > the fc disk under a different sd name due to the kobject still existing when the > device is rescanned. I'm going to defer my decision on whether or not that can > be fixed in this release cycle until after I've completed 430984 and can see if > it impacts this at all. Looks like the second issue is a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=668905 even though it was reported there much later. There's a reference to an upstream fix, so I will dupe this bug to 668905 and work it there. Jes *** This bug has been marked as a duplicate of bug 668905 *** |