Hide Forgot
Description of problem: The latest 7.3 beta is much better than 7.2 when booting iSCSI multipath. See: bz#1330865 However, about 10% of the nodes still do not boot. What we see is the node PXE boots, starts booting, loads the network card driver. At that time I lose console connection and the node needs to renegotiate with the switch. We think that there may be a timing issue where iscsi is trying to wire up the second target at the same time the link is still coming back up. That said, I see in dracut where it is checking for a good link... but I could be missing something and maybe it starts the wire up before the driver is loaded and while the wire up is happening the driver loads and needs to relink, then causing iscsi to fail. What we see is: [ 44.994155] connection2:0: Could not send nopout [ 75.089995] connection2:0: detected conn error (1020) Version-Release number of selected component (if applicable): dracut-033-462.el7.x86_64 iscsi-initiator-utils-6.2.0.873-35.el7.x86_64 How reproducible: 10% of the time
Trent, Since this might be related to the driver and the switch and Opal is down for power work. Can you fill in the details about those two things.
Could you boot these machines with rd.debug on kernel cmdline and when the issue occurs upload or save somewhere the content of /run/initramfs/rdsosreport.txt and output of journalctl before you reboot?
Created attachment 1454837 [details] Console, rdsosreport.txt, and journalctl output
This error is not one I remember, maybe a clue to what is going on. I see it on all the nodes that fail to boot. sysfs: cannot create duplicate filename '/devices/platform/host10/session1/connection1:0'
Based on the log looks more like iscsi issue.
Do you have a feel for server or client side. My guess is client. If I don't multi-path I don't have failures. I have the same amount of load going to the server side. To make things more annoying, I found by using rd.break=pre-pivot and using multi-path, I don't see the issue..... Same when I turn on rd.debug, timing issue somewhere in there.
tested on 1000 nodes the modules.d/95iscsi/iscsiroot.sh from github, and this seems to address the issue. all 1000 nodes booted first time. looks like they changed from iscsistart to iscsid commit b31f3fe0d1bea66078ef65c736df03a150f74607
Hi Lukáš, Trent @ LLNL has noted that a closer to upstream version of the iscsi startup scripts resolves the issue for them. He is specifically calling out: https://github.com/dracutdevs/dracut/commit/b31f3fe0d1bea66078ef65c736df03a150f74607 Would it be possible to pull this change in to 7.6 this late in the schedule? Could you build an rpm with this change for LLNL to validate? Thank you, Travis
Note I just tested iscsiroot.sh as is. I was just calling out what commit I felt likely brought in the fix.
Definitely not for 7.6. With the complexity of the patch I am not sure if we want to do such change in rhel7 generally.
LLNL has been carrying a later iscsiroot.sh which at last report had resolved the issue. Since RHEL 7 isn't entertaining any further enhancements, defects have to clear a high bar for inclusion and the earlier concern with even including the change I'm closing this bug. As far as I can discern the version Trent was using (or a later version) is in RHEL 8 (dracut v49). If RHEL 8 exhibits the same defect please log a new bug.