205781 – multipath/SCSI hotplug issues on RHEL4 x86_64 2.6.9-34.ELsmp

Bug 205781 - multipath/SCSI hotplug issues on RHEL4 x86_64 2.6.9-34.ELsmp

Summary: multipath/SCSI hotplug issues on RHEL4 x86_64 2.6.9-34.ELsmp

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.3
Hardware:	ia32e
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Dave Wysochanski
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2006-09-08 14:27 UTC by Nick Strugnell
Modified:	2018-10-19 20:31 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-07 05:24:01 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Nick Strugnell 2006-09-08 14:27:19 UTC

Description of problem:
I'm having real problems getting my servers to come up with a consistent
number of SCSI devices, is there a timeout or something between drivers
being loaded and SCSI hotplug deciding to create the devices?

For the record, I have 125 LUNs with 8 paths to each LUN making a total
of 1000 SCSI devices.

A normal boot takes about 3-5 minutes but after booting I will be
missing about 20% of my dm-multipath devices due to all 8 paths not
being present. The remaining multipath devices will have a varying
number of paths present.

All this is due to only about 30-40% of the sd devices being created.

Version-Release number of selected component (if applicable):
2.6.9-34.ELsmp

How reproducible:
always

Steps to Reproduce:
1. Set up a SAN with 125 devices, each with 8 paths
2. Configure multipathing
3. Boot
  
Actual results:
Many paths and several dm-multipath devices will be missing. The missing
dm-multipath devices will be those for which all 8 paths are missing.

Expected results:
All pats and mpath devices available

Additional info:
I do have a foolproof workaround in rc.local:

for I in /dev/raw/raw* ; do raw $I 0 0 ; done
service multipathd stop
multipath -F
rmmod qla2300
modprobe qla2300
multipath -v2
service multipathd start
service rawdevices start


This increases boot time to about 20 minutes and the load goes to 100+
during the multipath creation but I do get all my SCSI devices.

So, the question is, what is happening differently when the qlogic
driver is loaded at boot time, compared to when I reload it in rc.local?

I've briefly discussed this with Rob Kenna and he has requested a BZ be created.

Comment 1 Tom Coughlan 2006-09-13 13:49:04 UTC

Nick, do you get the same result with U4?

Are you using LVM? If so, have you adjusted pvcreate --metadatacopies as
described in the man page?

Ryan, I think you were seeing something like this. Was there a solution?

Comment 2 Ryan Powers 2006-09-13 16:58:48 UTC

I ran into a similar issue with a large number of paths. I was using i386 and
the lpfc driver. I wasn't able to reliably reproduce the issue of missing paths,
though, as every reboot could give a completely different outcome. I believe we
may be experiencing the same issue here, but I recall the sd devices being
created on my test system, but the paths were not discovered by multipath.

Tom, I think the slow boot is a result of the workaround code Nick added to
rc.local.

Something I noticed recently when I went back to read about the system with
16,000 LUNs connected, this may have to do with dropped hotplug events. They
experienced events being dropped and were able to ensure all events were handled
by increasing the udev buffer to 16M from the 1M it had. It is also noted that
this change was made upstream, though there's no mention of a version. It's
possible that our udev package needs this patched in to support this many disks.

Comment 3 Dave Wysochanski 2006-09-13 17:33:19 UTC

Adding netapp engineers since I recall someone finding a similar bug in rhel4 u3
- couldn't find any bugzilla on it though.

Comment 4 Nick Strugnell 2006-09-14 07:27:24 UTC

Tom -

Unfortunately client will not run with a different kernel unless a full root
cause analysis points that way - they are in UAT and configuration is supposed
to be frozen - original FAT tests were done with half as much storage and the
problem didn't show up then.

We are not using LVM - this is raw devices for ASM/Oracle.

Nick

Comment 5 NetApp filed bugzillas 2006-09-15 05:57:44 UTC

We in Netapp also have found similar symptoms in one of our test. When large
number of luns are made visible to a host, iscsi layer is able to ceate device
nodes in the /dev namespace for all the visible lun. But the multipathing layer
misses to create entries for some of the scsi devices. The test tried to have
single path for each LUN, so we should be able to see the same number of
/dev/sd* entries and /dev/dm* entries. But, we see some /dev/dm* entries.
multipath layer misses to create devices for 3or 4 entries for a range of 120 to
180 iscsi devices. 

Stesp to recreate.
1. Map 150 iscsi LUNS to a host from filer
2. start iscsi service, we would see all 150 luns
3. start multipath service, we would see multipath create entries for < 150
iscsi devices, generally it would miss 3 to 4 luns.
4. a restart of multipath service again also misses few luns, this time, it
would be different set of devices. Which points to some timing issue.

This has been seen on both rhel4 u3 and rhel4 u4 x86 versions.

Comment 6 Ryan Powers 2006-09-15 18:15:31 UTC

(In reply to comment #5)
> We in Netapp also have found similar symptoms in one of our test. When large
> number of luns are made visible to a host, iscsi layer is able to ceate device
> nodes in the /dev namespace for all the visible lun. But the multipathing layer
> misses to create entries for some of the scsi devices. The test tried to have
> single path for each LUN, so we should be able to see the same number of
> /dev/sd* entries and /dev/dm* entries. But, we see some /dev/dm* entries.
> multipath layer misses to create devices for 3or 4 entries for a range of 120 to
> 180 iscsi devices. 

This is the same behavior I observed with FC, however, I was unable to
consistantly reproduce it.

Comment 8 Dave Wysochanski 2006-10-12 04:47:53 UTC

Nick, I think this might be your problem:
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=185569
Can you disable the network service (chkconfig network off) reboot your machine
(do you have a serial console or physical console), and let me know the results?

Ryan, I think your problem may be a different one, but identical to what NetApp
was seeing.  Do you have a /var/log/messages file?  Also, are you using iSCSI or FC?

Comment 9 Dave Wysochanski 2006-11-09 04:28:57 UTC

I am hitting something similar to this problem now on one of my setups (rhel4
u4).  At the moment I am running a test overnight but should be able to do the
experiment in #8 tomorrow.  If this is the problem, I'll be sure to update bz
185569.

Comment 10 Dave Wysochanski 2006-11-09 18:09:20 UTC

Ok, initially I thought I was seeing missing paths (subject of this bug), but
apparently that's not the case.  I'm just seeing the multipath device maps get
created without all paths in them (the other problem).

I rebooted my system with network disabled, and nothing changes (multipath
device maps get created with only a single path in them - should all be 2 paths
in my setup).  My setup is an MSA1000 with A/P array with 14 LUNs direct
connected to a QLA2342.

Comment 11 Dave Wysochanski 2007-02-22 14:44:07 UTC

This is still on my radar screen but unfortunately I have not had many cycles to
investigate the original problem (/dev/sd*'s not appearing when there's a lot of
disks in the system).  I am not sure I ever investigated Ryan's comment #2
(patch for udev to increase hotplug event buffer) so maybe this is the next step.

Note You need to log in before you can comment on or make changes to this bug.