Bug 1383842 - [LLNL 7.5 Bug] iSCSI multipath fails to boot 10% of the time
Summary: [LLNL 7.5 Bug] iSCSI multipath fails to boot 10% of the time
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: iscsi-initiator-utils
Version: 7.3
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: rc
: 7.5
Assignee: Chris Leech
QA Contact: Filip Suba
URL:
Whiteboard:
Depends On:
Blocks: 1599298
TreeView+ depends on / blocked
 
Reported: 2016-10-11 23:53 UTC by Ben Woodard
Modified: 2021-09-03 13:48 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-04 18:46:32 UTC
Target Upstream Version:


Attachments (Terms of Use)
Console, rdsosreport.txt, and journalctl output (646.70 KB, text/plain)
2018-06-26 23:52 UTC, Trent D'Hooge
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Bugzilla 1330865 1 None None None 2021-09-09 11:49:54 UTC

Internal Links: 1330865

Description Ben Woodard 2016-10-11 23:53:07 UTC
Description of problem:
The latest 7.3 beta is much better than 7.2 when booting iSCSI multipath. See: bz#1330865 However, about 10% of the nodes still do not boot. What we see is the node PXE boots, starts booting, loads the network card driver. At that time I lose console connection and the node needs to renegotiate with the switch. 

We think that there may be a timing issue where iscsi is trying to wire up the second target at the same time the link is still coming back up. That said, I see in dracut where it is checking for a good link... but I could be missing something and maybe it starts the wire up before the driver is loaded and while the wire up is happening the driver loads and needs to relink, then causing iscsi to fail. What we see is:

[   44.994155]  connection2:0: Could not send nopout
[   75.089995]  connection2:0: detected conn error (1020)

Version-Release number of selected component (if applicable):
dracut-033-462.el7.x86_64
iscsi-initiator-utils-6.2.0.873-35.el7.x86_64

How reproducible:
10% of the time

Comment 2 Ben Woodard 2016-10-12 00:00:23 UTC
Trent,

Since this might be related to the driver and the switch and Opal is down for power work. Can you fill in the details about those two things.

Comment 4 Lukáš Nykrýn 2016-10-12 12:24:20 UTC
Could you boot these machines with rd.debug on kernel cmdline and when the issue occurs upload or save somewhere the content of /run/initramfs/rdsosreport.txt and output of journalctl before you reboot?

Comment 6 Trent D'Hooge 2018-06-26 23:52:36 UTC
Created attachment 1454837 [details]
Console, rdsosreport.txt, and journalctl output

Comment 7 Trent D'Hooge 2018-06-26 23:54:43 UTC
This error is not one I remember, maybe a clue to what is going on. I see it on all the nodes that fail to boot.

sysfs: cannot create duplicate filename '/devices/platform/host10/session1/connection1:0'

Comment 8 Lukáš Nykrýn 2018-06-28 14:17:45 UTC
Based on the log looks more like iscsi issue.

Comment 9 Trent D'Hooge 2018-06-28 15:03:24 UTC
Do you have a feel for server or client side. My guess is client. If I don't multi-path I don't have failures. I have the same amount of load going to the server side. 

To make things more annoying, I found by using rd.break=pre-pivot and using multi-path, I don't see the issue..... Same when I turn on rd.debug, timing issue somewhere in there.

Comment 10 Trent D'Hooge 2018-07-26 00:25:13 UTC
tested on 1000 nodes the modules.d/95iscsi/iscsiroot.sh from github, and this seems to address the issue. all 1000 nodes booted first time. looks like they changed from iscsistart to iscsid 

commit b31f3fe0d1bea66078ef65c736df03a150f74607

Comment 11 Travis Gummels 2018-07-26 15:29:57 UTC
Hi Lukáš,

Trent @ LLNL has noted that a closer to upstream version of the iscsi startup scripts resolves the issue for them.  He is specifically calling out:

https://github.com/dracutdevs/dracut/commit/b31f3fe0d1bea66078ef65c736df03a150f74607

Would it be possible to pull this change in to 7.6 this late in the schedule?  Could you build an rpm with this change for LLNL to validate?

Thank you,

Travis

Comment 12 Trent D'Hooge 2018-07-26 15:42:05 UTC
Note I just tested iscsiroot.sh as is.  I was just calling out what commit I felt likely brought in the fix.

Comment 13 Lukáš Nykrýn 2018-07-26 16:35:32 UTC
Definitely not for 7.6. With the complexity of the patch I am not sure if we want to do such change in rhel7 generally.

Comment 15 Travis Gummels 2020-08-04 18:46:32 UTC
LLNL has been carrying a later iscsiroot.sh which at last report had resolved the issue.  Since RHEL 7 isn't entertaining any further enhancements, defects have to clear a high bar for inclusion and the earlier concern with even including the change I'm closing this bug.  As far as I can discern the version Trent was using (or a later version) is in RHEL 8 (dracut v49).  If RHEL 8 exhibits the same defect please log a new bug.


Note You need to log in before you can comment on or make changes to this bug.