Hide Forgot
Created attachment 513837 [details] logs Description of problem: case: 1) host connected with 180 devices via FC, 4 paths. 2) reboot host 3) ping is not available to host for several hours 4) connecting to console (which is sometimes available) i see a flood of the following output: c8d02c1ade/6355c334-af2d-4dc2-a6end_request: I/O error, dev dm-9, sector 1350567808 4a-7098f359449d:end_request: I/O error, dev dm-9, sector 1350567920 read failed aftend_request: I/O error, dev dm-9, sector 134219776 er 0 of 4096 at end_request: I/O error, dev dm-9, sector 134219784 1073733632: Inpuend_request: I/O error, dev dm-9, sector 134219776 t/output error end_request: I/O error, dev dm-9, sector 1084229504 /dev/2b27b725-end_request: I/O error, dev dm-9, sector 1084229616 9063-4a27-899a-3end_request: I/O error, dev dm-9, sector 671090688 ec8d02c1ade/6355end_request: I/O error, dev dm-9, sector 671090696 c334-af2d-4dc2-aend_request: I/O error, dev dm-9, sector 671090688 64a-7098f359449dend_request: I/O error, dev dm-2, sector 102762368 : read failed afend_request: I/O error, dev dm-2, sector 102762480 ter 0 of 4096 atend_request: I/O error, dev dm-2, sector 100665344 0: Input/outputend_request: I/O error, dev dm-2, sector 100665352 error /dev/2end_request: I/O error, dev dm-2, sector 100665344 b27b725-9063-4a2end_request: I/O error, dev dm-4, sector 513804160 7-899a-3ec8d02c1end_request: I/O error, dev dm-4, sector 513804272 ade/6355c334-af2end_request: I/O error, dev dm-4, sector 511707136 d-4dc2-a64a-7098end_request: I/O error, dev dm-4, sector 511707144 f359449d: read fend_request: I/O error, dev dm-4, sector 511707136 ailed after 0 ofend_request: I/O error, dev dm-3, sector 425723776 4096 at 4096: Iend_request: I/O error, dev dm-3, sector 425723888 this is reproducible. few interesting facts: 1) once host is up. all devices are up and everything works 2) if i remove all devices from storage system, reboot host, it finish boot in 5 minutes this is a serious pain.
Can you try the following: 1.check how many paths you are monitoring with # multipathd paths count 2. disable multipathd with # chkconfig multipathd off 3. reboot the machine. see how long it takes to reboot the machine without multipathd running. 4. make sure all of the scsi devices have been created 5. start multipathd # service multipathd start 6. wait for multipath to finish creating all the devices, you can check with # multipathd paths count See if it goes faster this way. The issue I'm checking for is that sometimes scsi devices get presented in a non-optimal order, and it takes multipathd a long time to create the multipath devices. If multipathd can see all of the devices at once, it often goes much faster. I want to see if that's the issue you are running into.
still get same results: Before reboot: [root@rhev-a8c-02 ~]# multipathd paths count Paths: 1044 Busy: False chkconfig multipathd off reboot machine still takes lots of time to boot.
Do you think that this is still multipath that is making the boot take so long? Is multipath running in the initramfs? Is this a multipathed root system? If this isn't a multipathd root system, can you leave multipathd chkconfiged off, and delete /etc/multipath.conf (save a copy to restore it later). Finally,make sure all your multipath devices are removed, and then rebuild your initramfs with the --hostonly option # dracut --force --hostonly Note. This will overwrite your existing initramfs, so you may want to make a backup copy in case something goes wrong, and to make it easy to switch back. This should make sure that multipath isn't doing anything on your system. If bootup still takes a long time, then it's not multipath at fault.
Ping - if this is indeed a blocker, can someone please respond with the requested data?
This looks like its a repeat of Bug 500998, but not for iscsi devices, which was my original guess in Comment 2. Unfortunately, for some reason if I delete /etc/multipath.conf, it's getting restored whenever I reboot the node. I assume that this isn't news to the RHEV people. If you take Haim's first scenario from comment 7, and change the order, so it is: - host is not connected to FC. - host boots up very quickly - stop multipathd - connect host with FC and scan scsi bus - start multipathd The host never goes unresponsive, and multipathd finishes its work within a couple of minutes. This issue is this. Once multipathd starts up, it creates multipath devices as soon as it sees a valid path. It has no idea how many paths there will eventually be, and the order it gets paths is the order udev sends them in. In this case, the first paths multipathd is seeing are not on the primary controller. This forces multipath to tresspass the LUN to make it active. Later, when the primary path appears, multipath Tresspasses the LUN back. Assuming it gets the uevents for all the wrong paths first, multipath will need to do two tresspasses for every LUN. In the past, we've advised customers that if they have a large number of LUNs that have a hardware handler that needs to get run to switch active paths, they need to make sure that those devices are discovered in the initramfs, and that multipathd does not run till after the initramfs, unless it's necessary. When this happens, multipathd will see all the paths when it starts up, and build the devices without having to trespass the LUNs. I've discussed the possibility of having multipath be able to delay the creation of a device if it only has seen passive paths. However, it can't wait forever, since there is no guarantee that the active path will ever appear, and adding this feature is not trivial.
Looking at the intramfs, it appears like the qla2xxx module should be loaded. However, if the scsi device handler modules aren't loaded before multipath starts creating paths, you will get IO errors, slowdown and system unresponsiveness, according to Bug 690523, even if multipath isn't in the initramfs. In fact, this bug looks pretty much like a dup of 690523.
*** This bug has been marked as a duplicate of bug 690523 ***