Bug 2083271 - iscsi: Keep existing session on "session exists"
Summary: iscsi: Keep existing session on "session exists"
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.50.0.13
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.1
: ---
Assignee: Nir Soffer
QA Contact: Shir Fishbain
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-09 15:07 UTC by Evelina Shames
Modified: 2022-06-23 05:54 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-06-23 05:54:58 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?
pm-rhel: blocker?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 166 0 None Merged iscsi: Keep existing session on "session exists" 2022-05-10 12:20:23 UTC
Red Hat Issue Tracker RHV-45992 0 None None None 2022-05-09 15:21:17 UTC

Description Evelina Shames 2022-05-09 15:07:07 UTC
Description of problem:
If logging into a node fails with "session exists" (error 15):

iscsiadm: initiator reported error (15 - session exists)

We use to remove the node, which disconnects the node and removes it.
This is not new behavior, but it seems that in 4.4 this cleanup was not
effective in the case of logging in to the same connection more than
once, and now it reliably disconnects the first node and leaves the host
without any nodes, which makes it non-operational.

Version-Release number of selected component (if applicable):
RHV-4.5.0

How reproducible:
always

Steps to Reproduce:
1. Have several dup iscsi connections in engine DB and an iscsi SD.
2. Deactivate-activate host


Actual results:
If logging into a node fails with "session exists" (error 15), it removes the node, which disconnects the node and removes it.

Expected results:
if you try to connect to the same target more than once, the host should end with one connected target.

Additional info:

Comment 1 Avihai 2022-05-09 15:37:28 UTC
This one is killing most of QE automation runs as hosts goes to non-operational state as we(RHV QE) use ansible role storages/tasks/main.yml (https://github.com/oVirt/ovirt-ansible-collection/blob/master/roles/infra/roles/storages/tasks/main.yml) for all our environment setups.
 
This occurs only in RHV4.4SP1/RHEL8.6 thus a regression. 

Not sure why connecting to an existing ISCSI connection drops the connection - does not make sense but from @Nir's patch this was also the behavior on RHV4.4/RHEL<=8.5 BUT in RHV4.4 the connection was NOT dropped thus this issue of host non-operational (due to dropped ISCSI connections) was not seen.

You can hit this in several ways:
1) Using ovirt-ansible-collection[1]:

This will setup RHV with several ISCSI storage connections (A,B,C).
The storage update task will overwrite the connections in the RHV engine with one of the connections (A) creating multiple duplicate ISCSI connections - Bug 2079896 and engine will allow it bug 2079903.

This will create the following:
When the host deactivate->activate or reboot, VDSM will try and reconnect to the same duplicate ISCSI connection(A)
-> This will result in DROPPING the A(dup) ISCSI connection from the host -> Host goes to non-operational state for several minutes(!) as it does not see the storage backend.

2) Using API/UI - the customer can now override storage connections and create this issue as engine does not block it - bug 2079903
Same as 1) will occur.

The bottom line:
This should be fixes either in this bug or any other mentioned bugs: bug 2079903 or bug 2079896 to avoid this nasty regression in RHV4.4SP1.

We hope not many customer use this script or will add manually or via script duplicate connection but still better fix it. 


[1] https://github.com/oVirt/ovirt-ansible-collection/pull/493

Comment 2 RHEL Program Management 2022-05-09 15:37:34 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 3 Arik 2022-05-09 15:42:44 UTC
We can propose a WA for QE if needed

Comment 4 Nir Soffer 2022-05-09 16:16:10 UTC
I compared current code with

- commit 1f8eee87972a0e5c (virt: CPU hotplugging
  support for dedicated CPUs) from Tue Dec 14 17:45:32 2021. This was the last
  commit before the changes related to bug 1787192.

- commit 3d0583b3a474bb41 (New release: 4.40.100.2). last 4.4 commit.

In both older commits we did:

    for con in connections:
        add iscsi node
        login to node
        if login failed, remove node

After the change we do:

    for con in connections:
        add iscsi node

    run login concurrently with all connections
    
when login does:

    login to node
    if login fail, remove the node

So theoretically removing the node on login failure was is not new and we
should see the same behavior in 4.4 when engine has duplicate connections.

It is possible that when logging in concurrently we are more likely to fail
when "session exists" but it does not make sense.

Comment 5 Nir Soffer 2022-05-09 16:22:16 UTC
I backported the fix to 4.5.0 for testing in QE environment:
https://github.com/oVirt/vdsm/pull/168

Comment 6 Michal Skrivanek 2022-05-10 07:54:21 UTC
(In reply to Nir Soffer from comment #5)
> I backported the fix to 4.5.0 for testing in QE environment:
> https://github.com/oVirt/vdsm/pull/168

there's no need for a backport, we won't be releasing this in 4.5.0. Nighlty builds are 4.5.1 already, so if the fix/change is already there then all is good...

Comment 7 Arik 2022-05-10 07:58:59 UTC
(In reply to Michal Skrivanek from comment #6)
> (In reply to Nir Soffer from comment #5)
> > I backported the fix to 4.5.0 for testing in QE environment:
> > https://github.com/oVirt/vdsm/pull/168
> 
> there's no need for a backport, we won't be releasing this in 4.5.0. Nighlty
> builds are 4.5.1 already, so if the fix/change is already there then all is
> good...

sure, it's just for testing

Comment 9 Shir Fishbain 2022-06-13 04:44:19 UTC
Verified

The "iscsi: Keep existing session on "session exists" doesn't exists when deactivate/activate the host.
Moreover, we don't have several duplicate iscsi connections in engine DB.

vdsm-4.50.1.2-1.el8ev.x86_64
ovirt-engine-4.5.1.1-0.14.el8ev.noarch

Comment 10 Sandro Bonazzola 2022-06-23 05:54:58 UTC
This bugzilla is included in oVirt 4.5.1 release, published on June 22nd 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.1 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.