Description of problem: Host goes to non-operational for several minutes due to "15 - session exists" error in vdsm after the deactivation of the host. In this case, we do not have duplicate connections in the engine DB but on the host ISCSI cache - which triggers the same error (15 - session exists) as seen in https://bugzilla.redhat.com/show_bug.cgi?id=2083271. Version-Release number of selected component (if applicable): ovirt-engine-4.5.1.1-0.14.el8ev vdsm-4.50.1.2-1.el8ev How reproducible: 100% Steps to Reproduce: 1. On the host: Need to have one connection (for examplex.x.x.x) on an iscsi SD, that has two entries: Create a stale file(/var/lib/iscsi/nodes/xxxxxxx/default) for node.tpgt =1 or add it manually to the ISCSI DB. In our case ALL QE hosts have in ISCSI cache on portal:x.x.x.x two tpgt connections (node.tpgt = 1, node.tpgt = 1034). 2. If additional ISCSI connections exist that are not x.x.x.x - Remove them via RHV UI. 3. Move host to maintenance -> Activate. Actual results: initiator reported error (15 - session exists) The host goes non-operational for several minutes. After ~4 minutes host is auto recovered. Expected results: We shouldn't see the error "15 - session exists", and the host shouldn't switch to non-operational when deactivated. Additional info: This is a regression since in bug 2083271 we got to the non-operational state of the host due to duplicate connections in the DB - now the DB has just one connection.
(In reply to sshmulev from comment #0) > Steps to Reproduce: > 1. On the host: > Need to have one connection (for examplex.x.x.x) on an iscsi SD, that has > two entries: > > Create a stale file(/var/lib/iscsi/nodes/xxxxxxx/default) for node.tpgt =1 > or add it manually to the ISCSI DB. This is not valid way to reproduce. Why do you touch iscsi DB manually? It you touch it manually you own it and you need to fix it yourself. > In our case ALL QE hosts have in ISCSI cache on portal:x.x.x.x two tpgt > connections (node.tpgt = 1, node.tpgt = 1034). Not clear what is the iscsi configuration you are using. > 2. If additional ISCSI connections exist that are not x.x.x.x - Remove them > via RHV UI. Not clear what you mean by that. > Expected results: > We shouldn't see the error "15 - session exists", and the host shouldn't > switch to non-operational when deactivated. If the host has existing session and vdsm tries to reconnect, the error is expected. Not clear why the host move to non-operational. > Additional info: > This is a regression since in bug 2083271 we got to the non-operational > state of the host due to duplicate connections in the DB - now the DB has > just one connection. This is a regression only if you show that the same flow worked in previous versions. The provided reproduction instructions are not helpful. We need clear instructions so developer can follow them and reproduce the failure.
Looking at the attached logs: 1. Trying to connect to iqn.1992-08.com.netapp:vserver-rhv-qe on 10.46.16.10:3260 2022-06-15 11:30:23,559+0300 INFO (jsonrpc/6) [vdsm.api] START connectStorageServer(domType=3, spUUID='15e0fff7-39ba-4e91-be6f-161f6077b57e', conList=[{'password': '********', 'port': '3260', 'iqn': 'iqn.1992-08.com.netapp:vserver-rhv-qe', 'connection': '10.46.16.10', 'ipv6_enabled': 'false', 'id': 'cb463ed5-9d3a-45f9-92f7-31d5cfcd1684', 'user': '', 'tpgt': '1034'}]) from=::ffff:10.35.232.214,51286, flow_id=55f66251, task_id=c0798740-d934-4e64-bd09-9a88585f575e (api:48) 2022-06-15 11:30:23,560+0300 INFO (jsonrpc/6) [storage.iscsi] Adding iscsi node for target 10.46.16.10:3260,1034 iqn.1992-08.com.netapp:vserver-rhv-qe iface default (iscsi:209) 2022-06-15 11:30:23,765+0300 DEBUG (jsonrpc/6) [storage.iscsi] iface.net_ifacename not provided, skipping rp filter setup (iscsi:605) 2022-06-15 11:30:23,765+0300 INFO (jsonrpc/6) [storage.storageServer] Log in to 1 targets using 1 workers (storageServer:614) 2022-06-15 11:30:23,768+0300 INFO (iscsi-login/0) [storage.iscsi] Logging in to iscsi target 10.46.16.10:3260,1034 iqn.1992-08.com.netapp:vserver-rhv-qe via iface default (iscsi:231) 2. iscsiadm login failed 2022-06-15 11:30:25,058+0300 ERROR (iscsi-login/0) [storage.storageServer] Could not login to target (storageServer:621) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/iscsiadm.py", line 322, in node_login "-p", portal, "-l"]) File "/usr/lib/python3.6/site-packages/vdsm/storage/iscsiadm.py", line 125, in run_cmd out = commands.run(cmd, sudo=True) File "/usr/lib/python3.6/site-packages/vdsm/common/commands.py", line 102, in run raise cmdutils.Error(args, p.returncode, out, err) vdsm.common.cmdutils.Error: Command ['/sbin/iscsiadm', '-m', 'node', '-T', 'iqn.1992-08.com.netapp:vserver-rhv-qe', '-I', 'default', '-p', '10.46.16.10:3260,1034', '-l'] failed with rc=15 out=b'Logging in to [iface: default, target: iqn.1992-08.com.netapp:vserver-rhv-qe, portal: 10.46.16.10,3260]\nLogging in to [iface: default, target: iqn.1992-08.com.netapp:vserver-rhv-qe, portal: 10.46.16.10,3260]\nLogin to [iface: default, target: iqn.1992-08.com.netapp:vserver-rhv-qe, portal: 10.46.16.10,3260] successful.\n' err=b'iscsiadm: Could not login to [iface: default, target: iqn.1992-08.com.netapp:vserver-rhv-qe, portal: 10.46.16.10,3260].\niscsiadm: initiator reported error (15 - session exists)\niscsiadm: Could not log into all portals\n' 3. Vdsm report the result to engine 2022-06-15 11:30:28,142+0300 INFO (jsonrpc/6) [vdsm.api] FINISH connectStorageServer return={'statuslist': [{'id': 'cb463ed5-9d3a-45f9-92f7-31d5cfcd1684', 'status': 465}]} from=::ffff:10.35.232.214,51286, flow_id=55f66251, task_id=c0798740-d934-4e64-bd09-9a88585f575e (api:54) Maybe engine moves the host to non-operational because the single connection requested failed. We can consider returning success for existing connections since "session exists" means that we already have a connection. We don't see removal of the existing session - so this is not related to bug 2083271.
Removed the Regression keyword since the bug show no evidence of a regression. To add a Regression please provide clear reproduction instructions and specify the previous version tested that behaved differently.
(In reply to Nir Soffer from comment #3) > Maybe engine moves the host to non-operational because the single > connection requested failed. We can consider returning success for > existing connections since "session exists" means that we already > have a connection. +1 Yeah, we had that kind of philosophical questions in the past - you try to remove an image and find that it doesn't exist, so did you succeed or not? :) I think that also in that case we've said that if we've reached the desired state, the operation can treat that as a successful operation
(In reply to Arik from comment #5) > I think that also in that case we've said that if we've reached the desired > state, the operation can treat that as a successful operation let me rephrase this one: I think that also in the aforementioned case we've said that if we reach the desired state, we can treat that as a successful operation
Another scenario via RHV only manually to reach this state: Remove and recreate the same ISCSI connection result in an additional connection with node.tpgt = 1 which is the default and the original connection is the ISCSI cache. Looks like there is no way to add a storage connection vi UI storage connection with ISCSI portal != 1 which can result in adding a duplicate entry in ISCSI cache for the same target IP and from there to the seen issue if you have one ISCSI storage connection for a storage domain from the host. Additional scenario details: 1) Create DC, cluster , one host1 , storage backend connected with multiple ISCSI targets (IP1,IP2) 2) Add first ISCSI storage domain A via UI : - Discovery on IP1 - Login to all targets(IP1,IP2) - Add one LUN 3) Checkout host ISCSI cache on host1 via CLI command host> iscsiadm -m node -o show | egrep "node.name|node.tpg|node.conn\[0\]\.address" ;iscsiadm -m session Output: host ISCSI cache looks good with node.tpgt != 1 and expected active connection. node.name = X node.tpgt =! 1 node.conn[0].address = IP1 4) Move storage domain A to maintenance 5) Remove the ISCSI storage connection IP1: Goto Storage -> Storage Domains -> storage domain A -> Connections -> Remove IP1 connection 6) Add the same ISCSI storage connection IP1 as seen before in this page (address , port=3260 always, target) . * Important note : I did not see a way to add tpgt target. 7) Checkout host ISCSI cache via CLI command: host> iscsiadm -m node -o show | egrep "node.name|node.tpg|node.conn\[0\]\.address" ;iscsiadm -m session Output: host ISCSI cache does NOT looks good and have now two entries to same target IP1: 1st entry: node.name = X node.tpgt = 1 node.conn[0].address = IP1 2nd entry: node.name = X node.tpgt =! 1 node.conn[0].address = IP1 * From this point host cache is dirty with 2 entries to the same target IP1 and causes the 'session exist' error in VDSM and engine BUT for engine to think host has no connectivity and move to non-operational engine needs to think no ISCSI storage connections exist to this storage domain A. 8) Remove ISCSI storage connection IP2 so you are left only with one ISCSI storage connection 9) Deactivate -> Activate host Host1 goes to non-operational for several minutes(~2-4) and then auto-recovers.
(In reply to Avihai from comment #8) > Another scenario via RHV only manually to reach this state: > Remove and recreate the same ISCSI connection result in an additional > connection with node.tpgt = 1 which is the default and the original > connection is the ISCSI cache. > > Looks like there is no way to add a storage connection vi UI storage > connection with ISCSI portal != 1 which can result in adding a duplicate > entry in ISCSI cache What is iscsi cache? Maybe you mean iscsi db? /var/lib/iscsi/... is the iscsi database - this is not a cache. > for the same target IP and from there to the seen issue > if you have one ISCSI storage connection for a storage domain from the host. > > Additional scenario details: > 1) Create DC, cluster , one host1 , storage backend connected with multiple > ISCSI targets (IP1,IP2) > 2) Add first ISCSI storage domain A via UI : > - Discovery on IP1 > - Login to all targets(IP1,IP2) > - Add one LUN > > 3) Checkout host ISCSI cache on host1 via CLI command > host> iscsiadm -m node -o show | egrep > "node.name|node.tpg|node.conn\[0\]\.address" ;iscsiadm -m session > > Output: > host ISCSI cache looks good with node.tpgt != 1 and expected active > connection. > > node.name = X > node.tpgt =! 1 > node.conn[0].address = IP1 > > > 4) Move storage domain A to maintenance > 5) Remove the ISCSI storage connection IP1: > Goto Storage -> Storage Domains -> storage domain A -> Connections -> Remove > IP1 connection > 6) Add the same ISCSI storage connection IP1 as seen before in this page > (address , port=3260 always, target) . > * Important note : > I did not see a way to add tpgt target. > > 7) Checkout host ISCSI cache via CLI command: > > host> iscsiadm -m node -o show | egrep > "node.name|node.tpg|node.conn\[0\]\.address" ;iscsiadm -m session > > Output: > host ISCSI cache does NOT looks good and have now two entries to same target > IP1: > 1st entry: > node.name = X > node.tpgt = 1 > node.conn[0].address = IP1 > > 2nd entry: > node.name = X > node.tpgt =! 1 > node.conn[0].address = IP1 Are you sure that the host iscsi db was modifed after adding a connection on engine side while the storage domain was in maintenance? > * From this point host cache is dirty with 2 entries to the same target IP1 > and causes the 'session exist' error in VDSM and engine BUT for engine to > think host has no connectivity and move to non-operational engine needs to > think no ISCSI storage connections exist to this storage domain A. > > 8) Remove ISCSI storage connection IP2 so you are left only with one ISCSI > storage connection > 9) Deactivate -> Activate host > > Host1 goes to non-operational for several minutes(~2-4) and then > auto-recovers. OK, this looks like reproducible instructions. The flow, removing and adding back iscsi connection, and then removing the second connection and leaving the host with a single connection looks like very unlikely situation, so I'm lowering the severity to medium.
I would argue that the priority should be low/medium in that case and the severity should remain high (if it may lead to the host switching to maintenance state) but that's less important, if it's not a regression then it's something for 4.5.2 and the approach described in comment 3 makes perfect sense to me
(In reply to Nir Soffer from comment #9) > What is iscsi cache? Maybe you mean iscsi db? Yes. > Are you sure that the host iscsi db was modifed after adding a connection on > engine side while the storage domain was in maintenance? You are correct, storage domain should be activated to see the changes in host ISCSI node DB. Corrected and shorter scenario B ( Thanks for that :)): 1) Create DC, cluster , one host1 , storage backend connected with multiple ISCSI targets (IP1,IP2) 2) Add first ISCSI storage domain A via UI : - Discovery on IP1 - Login to all targets(IP1,IP2) - Add one LUN 3) Checkout host ISCSI cache on host1 via CLI command host> iscsiadm -m node -o show | egrep "node.name|node.tpg|node.conn\[0\]\.address" ;iscsiadm -m session Output: host ISCSI cache looks good with node.tpgt != 1 and expected active connection. node.name = X node.tpgt =! 1 node.conn[0].address = IP1 node.name = X node.tpgt =! 1 node.conn[0].address = IP2 4) Move storage domain A to maintenance 5) Remove the ISCSI storage connection IP1: Goto Storage -> Storage Domains -> storage domain A -> Connections -> Detach and Remove IP1 connection 6) Add the same ISCSI storage connection IP1 as seen before in this page (address , port=3260 always, target) . Goto Storage -> Storage Domains -> storage domain A -> Connections -> Add IP1 connection * Important note : I did not see a way to add tpgt target so tpgt=1 always(!) 7) Remove ISCSI storage connection IP2 so you are left only with one ISCSI storage connection Goto Storage -> Storage Domains -> storage domain A -> Connections -> Detach and Remove IP2 connection 8) Activate host Host goes to non-operational for several minutes(2-4 min) and then recovers. Future host deactivate-activate will again cause non-operational for several minutes(2-4 min) (!) > The flow, removing and adding back iscsi connection, and then removing the > second > connection and leaving the host with a single connection looks like very > unlikely > situation, so I'm lowering the severity to medium. I disagree as this will occur also when an admin/customer wants to change faulty target IP or add a new ISCSI target during maintenance which is very probable and common. Removing an ISCSI connection faulty IP, fixing the backend, and re-adding it is a common operation and sometimes customers will be left with one working storage connection and will expect RHV to support and not disable the hosts for minutes each time he deactivate-activate host which is a very severe outcome. Moving back to high severity.
I could not reproduce the issue in my environment. This what I tested: 1. Move host to maintenance 2. Move storage domain to maintenance 3. Edit storage domain connections: 3.1 Remove connection 192.168.122.34:3260 iqn.2003-01.org.alpine.01 3.2 Add connection 192.168.122.34:3260 iqn.2003-01.org.alpine.01 3.3 Remove connection 192.168.122.35:3260 iqn.2003-01.org.alpine.01 4. Activate storage domain 5. On host, connect manually to 192.168.122.35:3260 iqn.2003-01.org.alpine.01 # iscsiadm -m session tcp: [10] 192.168.122.34:3260,1 iqn.2003-01.org.alpine.01 (non-flash) 6. Activate host I got no "session exist" errors, likely because I don't have unexpected nodes on my host. # iscsiadm -m node -P1 Target: iqn.2003-01.org.alpine.01 Portal: 192.168.122.34:3260,1 Iface Name: default Portal: 192.168.122.35:3260,1 Iface Name: default Target: iqn.2003-01.org.alpine.02 Portal: 192.168.122.34:3260,1 Iface Name: default Portal: 192.168.122.35:3260,1 Iface Name: default # iscsiadm -m session -P1 Target: iqn.2003-01.org.alpine.01 (non-flash) Current Portal: 192.168.122.34:3260,1 Persistent Portal: 192.168.122.34:3260,1 ********** Interface: ********** Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.1994-05.com.redhat:71982f294f4 Iface IPaddress: 192.168.122.23 Iface HWaddress: default Iface Netdev: default SID: 13 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE Target: iqn.2003-01.org.alpine.02 (non-flash) Current Portal: 192.168.122.34:3260,1 Persistent Portal: 192.168.122.34:3260,1 ********** Interface: ********** Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.1994-05.com.redhat:71982f294f4 Iface IPaddress: 192.168.122.23 Iface HWaddress: default Iface Netdev: default SID: 14 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE Current Portal: 192.168.122.35:3260,1 Persistent Portal: 192.168.122.35:3260,1 ********** Interface: ********** Iface Name: default Iface Transport: tcp Iface Initiatorname: iqn.1994-05.com.redhat:71982f294f4 Iface IPaddress: 192.168.122.23 Iface HWaddress: default Iface Netdev: default SID: 15 iSCSI Connection State: LOGGED IN iSCSI Session State: LOGGED_IN Internal iscsid Session State: NO CHANGE Avihai, Can you explain how you add the extra node with the unexpected tpgt on your system?
I could reproduce the error with manual iscsiadm commands. Host initial state: # iscsiadm -m node -P1 Target: iqn.2003-01.org.alpine.01 Portal: 192.168.122.34:3260,1 Iface Name: default Target: iqn.2003-01.org.alpine.02 Portal: 192.168.122.34:3260,1 Iface Name: default Portal: 192.168.122.35:3260,1 Iface Name: default # iscsiadm -m session iscsiadm: No active sessions. Adding unexpected portal: # iscsiadm -m node -T iqn.2003-01.org.alpine.01 -I default -p 192.168.122.34:3260,2 --op=new New iSCSI node [tcp:[hw=,ip=,net_if=,iscsi_if=default] 192.168.122.34,3260,2 iqn.2003-01.org.alpine.01] added # iscsiadm -m node -P1 Target: iqn.2003-01.org.alpine.01 Portal: 192.168.122.34:3260,1 Iface Name: default Portal: 192.168.122.34:3260,2 Iface Name: default Target: iqn.2003-01.org.alpine.02 Portal: 192.168.122.34:3260,1 Iface Name: default Portal: 192.168.122.35:3260,1 Iface Name: default Trying to log in: # iscsiadm -m node -T iqn.2003-01.org.alpine.01 -p 192.168.122.34:3260,2 -l Logging in to [iface: default, target: iqn.2003-01.org.alpine.01, portal: 192.168.122.34,3260] Logging in to [iface: default, target: iqn.2003-01.org.alpine.01, portal: 192.168.122.34,3260] Login to [iface: default, target: iqn.2003-01.org.alpine.01, portal: 192.168.122.34,3260] successful. iscsiadm: Could not login to [iface: default, target: iqn.2003-01.org.alpine.01, portal: 192.168.122.34,3260]. iscsiadm: initiator reported error (15 - session exists) iscsiadm: Could not log into all portals # iscsiadm -m session tcp: [20] 192.168.122.34:3260,1 iqn.2003-01.org.alpine.01 (non-flash) So the issue is having invalid configuration - 2 iscsi nodes with the same aadress, port, ipn, iface. If this configuration is created by engine, please open engine bug. On vdsm side, while this configuration is invalid, we do have one connection on the right interface, so keeping it and warning about the invalid configuration can help users.
Avihai, can you test the upstream build with QE environment? https://github.com/oVirt/vdsm/actions/runs/2530243581
This is a user error. We never asked anyone to connect to the same target up front. The solution is simple - do not connect to anything prior to installation/adding SD. Fixing it to ignore existing sessions is a good fail safe though
I opened a new bug about the NonOperational status with a different scenario as you can see in: https://bugzilla.redhat.com/show_bug.cgi?id=2099640
(In reply to Kobi Hakimi from comment #17) > I opened a new bug about the NonOperational status with a different scenario > as you can see in: > https://bugzilla.redhat.com/show_bug.cgi?id=2099640 Kobi, can you test it with https://github.com/oVirt/vdsm/actions/runs/2530243581? if it's handled by the fix for this bug then we need to understand whether that duplication is a new thing or what's its implication in order see if it is worth looking into that..
Arik, after I got the explanation from Nir on how to do this update I did the following steps to try the new vdsm build 1. Open the Engine and reproduce the issue on the first host by moving the host to Maintenance then Activate. 2. The host becomes NonOperational and after ~5 minutes the host was restored and becomes UP. 3. SSH to the same host and check that indeed we have the extra tpgt=1 portal: iscsiadm -m node -P1 4. Download the new vdsm version from rpm-el8( https://github.com/oVirt/vdsm/actions/runs/2530243581 ) 5. Extract the zip file: unzip rpm-el8.zip -d rpm-el8 and cd to rpm-el8 6. Run update: dnf upgrade *.rpm When moving the first host to Maintenance > Activate the host sty UP and Running!! Looks Great!! Notes: - Tested on HE and not HE environments. - The extra tpgt=1 portal is still there.
(In reply to Michal Skrivanek from comment #15) > This is a user error. We never asked anyone to connect to the same target up > front. The solution is simple - do not connect to anything prior to > installation/adding SD. > Fixing it to ignore existing sessions is a good fail safe though We found two valid scenario's so far to reach this state which are not user's error but RHV's issues leading to non-operational -> Moving back to High Sev. Even if the user do not connect to anything prior to installation/adding SD he will hit the same issue. Valid Scenario's: Scenario 1: Adding a new storage connection webadmin - described in bug 2099997 Scenario 2: Add new storage domain via ansible ovirt_storage_domain module - described in bug 2099640
Verified. Versions: vdsm-4.50.2-1.el8ev.x86_64 ovirt-engine-4.5.1.3-654.af4ac851f145.33.el8ev.noarch The "session exists" error still shows up due to the extra tpgt=1 portal, but this time the host doesn't go to non-operational mode.
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022. Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.