Bug 1383842

Summary: [LLNL 7.5 Bug] iSCSI multipath fails to boot 10% of the time
Product: Red Hat Enterprise Linux 7 Reporter: Ben Woodard <woodard>
Component: iscsi-initiator-utilsAssignee: Chris Leech <cleech>
Status: CLOSED WONTFIX QA Contact: Filip Suba <fsuba>
Severity: medium Docs Contact:
Priority: medium    
Version: 7.3CC: dracut-maint-list, lnykryn, tdhooge, tgummels
Target Milestone: rc   
Target Release: 7.5   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-04 18:46:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On:    
Bug Blocks: 1599298    
Attachments:
Description Flags
Console, rdsosreport.txt, and journalctl output none

Description Ben Woodard 2016-10-11 23:53:07 UTC
Description of problem:
The latest 7.3 beta is much better than 7.2 when booting iSCSI multipath. See: bz#1330865 However, about 10% of the nodes still do not boot. What we see is the node PXE boots, starts booting, loads the network card driver. At that time I lose console connection and the node needs to renegotiate with the switch. 

We think that there may be a timing issue where iscsi is trying to wire up the second target at the same time the link is still coming back up. That said, I see in dracut where it is checking for a good link... but I could be missing something and maybe it starts the wire up before the driver is loaded and while the wire up is happening the driver loads and needs to relink, then causing iscsi to fail. What we see is:

[   44.994155]  connection2:0: Could not send nopout
[   75.089995]  connection2:0: detected conn error (1020)

Version-Release number of selected component (if applicable):
dracut-033-462.el7.x86_64
iscsi-initiator-utils-6.2.0.873-35.el7.x86_64

How reproducible:
10% of the time

Comment 2 Ben Woodard 2016-10-12 00:00:23 UTC
Trent,

Since this might be related to the driver and the switch and Opal is down for power work. Can you fill in the details about those two things.

Comment 4 Lukáš Nykrýn 2016-10-12 12:24:20 UTC
Could you boot these machines with rd.debug on kernel cmdline and when the issue occurs upload or save somewhere the content of /run/initramfs/rdsosreport.txt and output of journalctl before you reboot?

Comment 6 Trent D'Hooge 2018-06-26 23:52:36 UTC
Created attachment 1454837 [details]
Console, rdsosreport.txt, and journalctl output

Comment 7 Trent D'Hooge 2018-06-26 23:54:43 UTC
This error is not one I remember, maybe a clue to what is going on. I see it on all the nodes that fail to boot.

sysfs: cannot create duplicate filename '/devices/platform/host10/session1/connection1:0'

Comment 8 Lukáš Nykrýn 2018-06-28 14:17:45 UTC
Based on the log looks more like iscsi issue.

Comment 9 Trent D'Hooge 2018-06-28 15:03:24 UTC
Do you have a feel for server or client side. My guess is client. If I don't multi-path I don't have failures. I have the same amount of load going to the server side. 

To make things more annoying, I found by using rd.break=pre-pivot and using multi-path, I don't see the issue..... Same when I turn on rd.debug, timing issue somewhere in there.

Comment 10 Trent D'Hooge 2018-07-26 00:25:13 UTC
tested on 1000 nodes the modules.d/95iscsi/iscsiroot.sh from github, and this seems to address the issue. all 1000 nodes booted first time. looks like they changed from iscsistart to iscsid 

commit b31f3fe0d1bea66078ef65c736df03a150f74607

Comment 11 Travis Gummels 2018-07-26 15:29:57 UTC
Hi Lukáš,

Trent @ LLNL has noted that a closer to upstream version of the iscsi startup scripts resolves the issue for them.  He is specifically calling out:

https://github.com/dracutdevs/dracut/commit/b31f3fe0d1bea66078ef65c736df03a150f74607

Would it be possible to pull this change in to 7.6 this late in the schedule?  Could you build an rpm with this change for LLNL to validate?

Thank you,

Travis

Comment 12 Trent D'Hooge 2018-07-26 15:42:05 UTC
Note I just tested iscsiroot.sh as is.  I was just calling out what commit I felt likely brought in the fix.

Comment 13 Lukáš Nykrýn 2018-07-26 16:35:32 UTC
Definitely not for 7.6. With the complexity of the patch I am not sure if we want to do such change in rhel7 generally.

Comment 15 Travis Gummels 2020-08-04 18:46:32 UTC
LLNL has been carrying a later iscsiroot.sh which at last report had resolved the issue.  Since RHEL 7 isn't entertaining any further enhancements, defects have to clear a high bar for inclusion and the earlier concern with even including the change I'm closing this bug.  As far as I can discern the version Trent was using (or a later version) is in RHEL 8 (dracut v49).  If RHEL 8 exhibits the same defect please log a new bug.