Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 723298

Summary:

[device-mapper-multipath] host connected to 180 devices takes very long time to boot (4-5 hours)

Product:

Red Hat Enterprise Linux 6

Reporter:

Haim <hateya>

Component:

device-mapper-multipath

Assignee:

Ben Marzinski <bmarzins>

Status:

CLOSED DUPLICATE

QA Contact:

Storage QE <storage-qe>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

6.2

CC:

acathrow, agk, bmarzins, coughlan, cpelland, czhang, ddumas, dwysocha, heinzm, mbroz, mgoldboi, prajnoha, prockai, pstehlik, smizrahi, yeylon, ykaul, zkabelac

Target Milestone:

Target Release:

6.2

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-12-16 17:59:21 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

707622, 756082, 773650, 773651, 773665, 773677, 773696

Attachments:

Description	Flags
logs	none

Description Haim 2011-07-19 16:33:19 UTC

Created attachment 513837 [details]
logs

Description of problem:

case:

1) host connected with 180 devices via FC, 4 paths. 
2) reboot host 
3) ping is not available to host for several hours
4) connecting to console (which is sometimes available) i see a flood of the following output:

c8d02c1ade/6355c334-af2d-4dc2-a6end_request: I/O error, dev dm-9, sector 1350567808
4a-7098f359449d:end_request: I/O error, dev dm-9, sector 1350567920
 read failed aftend_request: I/O error, dev dm-9, sector 134219776
er 0 of 4096 at end_request: I/O error, dev dm-9, sector 134219784
1073733632: Inpuend_request: I/O error, dev dm-9, sector 134219776
t/output error
end_request: I/O error, dev dm-9, sector 1084229504
  /dev/2b27b725-end_request: I/O error, dev dm-9, sector 1084229616
9063-4a27-899a-3end_request: I/O error, dev dm-9, sector 671090688
ec8d02c1ade/6355end_request: I/O error, dev dm-9, sector 671090696
c334-af2d-4dc2-aend_request: I/O error, dev dm-9, sector 671090688
64a-7098f359449dend_request: I/O error, dev dm-2, sector 102762368
: read failed afend_request: I/O error, dev dm-2, sector 102762480
ter 0 of 4096 atend_request: I/O error, dev dm-2, sector 100665344
 0: Input/outputend_request: I/O error, dev dm-2, sector 100665352
 error
  /dev/2end_request: I/O error, dev dm-2, sector 100665344
b27b725-9063-4a2end_request: I/O error, dev dm-4, sector 513804160
7-899a-3ec8d02c1end_request: I/O error, dev dm-4, sector 513804272
ade/6355c334-af2end_request: I/O error, dev dm-4, sector 511707136
d-4dc2-a64a-7098end_request: I/O error, dev dm-4, sector 511707144
f359449d: read fend_request: I/O error, dev dm-4, sector 511707136
ailed after 0 ofend_request: I/O error, dev dm-3, sector 425723776
 4096 at 4096: Iend_request: I/O error, dev dm-3, sector 425723888

this is reproducible. 

few interesting facts:

1) once host is up. all devices are up and everything works
2) if i remove all devices from storage system, reboot host, it finish boot in 5 minutes

this is a serious pain.

Comment 2 Ben Marzinski 2011-07-19 19:39:34 UTC

Can you try the following:

1.check how many paths you are monitoring with
# multipathd paths count
2. disable multipathd with
# chkconfig multipathd off
3. reboot the machine.
see how long it takes to reboot the machine without multipathd running.
4. make sure all of the scsi devices have been created
5. start multipathd
# service multipathd start
6. wait for multipath to finish creating all the devices, you can check with
# multipathd paths count

See if it goes faster this way.  The issue I'm checking for is that sometimes scsi devices get presented in a non-optimal order, and it takes multipathd a long time to create the multipath devices. If multipathd can see all of the devices at once, it often goes much faster.  I want to see if that's the issue you are running into.

Comment 3 Haim 2011-08-04 13:24:30 UTC

still get same results: 

Before reboot:
[root@rhev-a8c-02 ~]# multipathd paths count
Paths: 1044
Busy: False

chkconfig multipathd off 

reboot 

machine still takes lots of time to boot.

Comment 4 Ben Marzinski 2011-08-04 17:25:47 UTC

Do you think that this is still multipath that is making the boot take so long? Is multipath running in the initramfs? Is this a multipathed root system?

If this isn't a multipathd root system, can you leave multipathd chkconfiged off, and delete /etc/multipath.conf (save a copy to restore it later). Finally,make sure all your multipath devices are removed, and then rebuild your initramfs
with the --hostonly option

# dracut --force --hostonly

Note. This will overwrite your existing initramfs, so you may want to make a backup copy in case something goes wrong, and to make it easy to switch back.

This should make sure that multipath isn't doing anything on your system. If bootup still takes a long time, then it's not multipath at fault.

Comment 5 Denise Dumas 2011-09-20 17:17:30 UTC

Ping - if this is indeed a blocker, can someone please respond with the requested data?

Comment 14 Ben Marzinski 2011-09-27 22:05:55 UTC

This looks like its a repeat of Bug 500998, but not for iscsi devices, which was my original guess in Comment 2. Unfortunately, for some reason if I delete /etc/multipath.conf, it's getting restored whenever I reboot the node. I assume that this isn't news to the RHEV people. If you take Haim's first scenario from comment 7, and change the order, so it is:

- host is not connected to FC.
- host boots up very quickly
- stop multipathd
- connect host with FC and scan scsi bus
- start multipathd

The host never goes unresponsive, and multipathd finishes its work within a couple of minutes.

This issue is this. Once multipathd starts up, it creates multipath devices as soon as it sees a valid path. It has no idea how many paths there will eventually be, and the order it gets paths is the order udev sends them in. In this case, the first paths multipathd is seeing are not on the primary controller. This forces multipath to tresspass the LUN to make it active. Later, when the primary path appears, multipath Tresspasses the LUN back. Assuming it gets the uevents for all the wrong paths first, multipath will need to do two tresspasses for every LUN. In the past, we've advised customers that if they have a large number of LUNs that have a hardware handler that needs to get run to switch active paths, they need to make sure that those devices are discovered in the initramfs, and that multipathd does not run till after the initramfs, unless it's necessary. When this happens, multipathd will see all the paths when it starts up, and build the devices without having to trespass the LUNs.

I've discussed the possibility of having multipath be able to delay the creation of a device if it only has seen passive paths. However, it can't wait forever, since there is no guarantee that the active path will ever appear, and adding this feature is not trivial.

Comment 15 Ben Marzinski 2011-09-27 23:36:34 UTC

Looking at the intramfs, it appears like the qla2xxx module should be loaded.  However, if the scsi device handler modules aren't loaded before multipath starts creating paths, you will get IO errors, slowdown and system unresponsiveness, according to Bug 690523, even if multipath isn't in the initramfs.  In fact, this bug looks pretty much like a dup of 690523.

Comment 17 Ben Marzinski 2011-12-16 17:59:21 UTC


*** This bug has been marked as a duplicate of bug 690523 ***