Bug 1566083

Summary:

NetAPP - Duplicate ISCSI storage connections causing(probably) poor cleanup of ISCSI connections after host goes to maintenance

Product:

[oVirt] ovirt-engine

Reporter:

Avihai <aefrat>

Component:

BLL.Storage

Assignee:

Nobody <nobody>

Status:

CLOSED DEFERRED

QA Contact:

Avihai <aefrat>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

4.2.2.6

CC:

aefrat, bugs, eshenitz

Target Milestone:

---

Keywords:

Automation

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2021-09-29 11:33:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine , vdsm logs, print screen of storage domain window	none

Description Avihai 2018-04-11 13:29:03 UTC

Created attachment 1420326 [details]
engine , vdsm logs, print screen of storage domain window

Description of problem:
This occurs on NetApp storage server 
netapp-cluster3.mgmt.lab.eng.tlv2.redhat.com - DOES NOT OCCUR ON ExtremIO storage.

One & only host(only host on the DC/cluster/SD).
First, after discovery, login and storage domain creation using host1 we see duplicate ISCSI connections so before host goes to maintenance 4 connections appear which looks duplicates of each other.

Also note that 10.46.16.9 is the ip for FQDN "mantis-iscsi-lif2.lab.eng.tlv2.redhat.com" .

So actually all 4 connections are the same storage connections:

iscsiadm -m session Before host goes to maintenance:
tcp: [16] 10.46.16.9:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [17] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [18] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [19] 10.46.16.9:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)

Than after Moving a host with ISCSI storage connection to maintenance not all connections are cleared as expected.

After host goes to maintenance still 2 connections remain:
[root@infra4-nested-3 ~]# iscsiadm -m session
tcp: [17] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [18] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)

On VDSM log - we can see only 2 connections was disconnected :
conList=[
{'id': '5a3be19c-1fb9-4d49-9962-812061db97f1', 'connection': 'mantis-iscsi-lif2.lab.eng.tlv2.redhat.com', 'iqn': 'iqn.1992-08.com.netapp:vserver-rhv-qe', 'user': u'', 'tpgt': '1033', 'password': '********', 'port': '3260'},
{'id': 'b71e88ab-8e66-48d2-bffa-ae69d58bf7cc', 'connection': '10.46.16.9', 'iqn': 'iqn.1992-08.com.netapp:vserver-rhv-qe', 'user': u'', 'tpgt': '1033', 'password': '********', 'port': '3260'}
]


2018-04-11 15:46:53,149+0300 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call StoragePool.disconnect succeeded in 2.94 seconds (__init__:573)
2018-04-11 15:46:53,177+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='63d6df2a-e4bd-4f95-b06d-ae53c2bec80e') moving from state init -> state preparing (task:602)
2018-04-11 15:46:53,178+0300 INFO  (jsonrpc/0) [vdsm.api] START disconnectStorageServer(domType=3, spUUID='3a173838-741d-440d-9047-ae74bbad5307', conList=[{'id': '5a3be19c-1fb9-4d49-9962-812061db97f1', 'connecti
on': 'mantis-iscsi-lif2.lab.eng.tlv2.redhat.com', 'iqn': 'iqn.1992-08.com.netapp:vserver-rhv-qe', 'user': u'', 'tpgt': '1033', 'password': '********', 'port': '3260'}, {'id': 'b71e88ab-8e66-48d2-bffa-ae69d58bf7c
c', 'connection': '10.46.16.9', 'iqn': 'iqn.1992-08.com.netapp:vserver-rhv-qe', 'user': u'', 'tpgt': '1033', 'password': '********', 'port': '3260'}], options=None) from=::ffff:10.46.16.68,56952, flow_id=3122df0
a, task_id=63d6df2a-e4bd-4f95-b06d-ae53c2bec80e (api:46)
2018-04-11 15:46:53,179+0300 INFO  (jsonrpc/0) [storage.Server.ISCSI] disconnecting (storageServer:551)
2018-04-11 15:46:53,215+0300 DEBUG (jsonrpc/0) [storage.Misc.excCmd] /usr/bin/taskset --cpu-list 0-0 /usr/bin/sudo -n /sbin/iscsiadm -m session -r 16 -u (cwd None) (commands:65)
2018-04-11 15:46:53,364+0300 DEBUG (jsonrpc/0) [storage.Misc.excCmd] SUCCESS: <err> = ''; <rc> = 0 (commands:86)
2018-04-11 15:46:53,365+0300 DEBUG (jsonrpc/0) [storage.ISCSI] iface.net_ifacename not provided, skipping rp filter setup (iscsi:572)
2018-04-11 15:46:53,365+0300 INFO  (jsonrpc/0) [storage.Server.ISCSI] disconnecting (storageServer:551)


2018-04-11 15:46:54,523+0300 INFO  (jsonrpc/0) [vdsm.api] FINISH disconnectStorageServer return={'statuslist': [{'status': 0, 'id': '5a3be19c-1fb9-4d49-9962-812061db97f1'}, {'status': 0, 'id': 'b71e88ab-8e66-48d2-bffa-ae69d58bf7cc'}]} from=::ffff:10.46.16.68,56952, flow_id=3122df0a, task_id=63d6df2a-e4bd-4f95-b06d-ae53c2bec80e (api:52)
2018-04-11 15:46:54,524+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='63d6df2a-e4bd-4f95-b06d-ae53c2bec80e') finished: {'statuslist': [{'status': 0, 'id': '5a3be19c-1fb9-4d49-9962-812061db97f1'}, {'status': 0, 'id': 'b71e88ab-8e66-48d2-bffa-ae69d58bf7cc'}]} (task:1201)
2018-04-11 15:46:54,524+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='63d6df2a-e4bd-4f95-b06d-ae53c2bec80e') moving from state preparing -> state finished (task:602)
2018-04-11 15:46:54,524+0300 DEBUG (jsonrpc/0) [storage.ResourceManager.Owner] Owner.releaseAll requests {} resources {} (resourceManager:910)
2018-04-11 15:46:54,525+0300 DEBUG (jsonrpc/0) [storage.ResourceManager.Owner] Owner.cancelAll requests {} (resourceManager:947)
2018-04-11 15:46:54,525+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='63d6df2a-e4bd-4f95-b06d-ae53c2bec80e') ref 0 aborting False (task:1002)
2018-04-11 15:46:54,526+0300 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call StoragePool.disconnectStorageServer succeeded in 1.35 seconds (__init__:573)


Engine log:
Relevant time shows no issues:
2018-04-11 15:46:48,857+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-9) [0d9cf46d-5fb0-4696-8219-245ea5139f74] EVENT_ID: USER_VDS_MAINTENANCE_WITHOUT_REASON(620), Host host_mixed_3 was switched to Maintenance mode by admin@internal-authz.
2018-04-11 15:46:50,190+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [] Updated host status from 'Preparing for Maintenance' to 'Maintenance' in database, host 'host_mixed_3'(676884d5-d403-4280-a7c8-fc295f5f768b)
2018-04-11 15:46:50,199+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-198334) [] Clearing cache of pool: '3a173838-741d-440d-9047-ae74bbad5307' for problematic entities of VDS: 'host_mixed_3'.
2018-04-11 15:46:50,199+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-198334) [] Removing vds '[676884d5-d403-4280-a7c8-fc295f5f768b]' from the domain in maintenance cache
2018-04-11 15:46:50,199+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-198334) [] Removing host(s) '[676884d5-d403-4280-a7c8-fc295f5f768b]' from hosts unseen domain report cache
2018-04-11 15:46:50,201+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [] START, DisconnectStoragePoolVDSCommand(HostName = host_mixed_3, DisconnectStoragePoolVDSCommandParameters:{hostId='676884d5-d403-4280-a7c8-fc295f5f768b', storagePoolId='3a173838-741d-440d-9047-ae74bbad5307', vds_spm_id='1'}), log id: 77c65443
2018-04-11 15:46:51,239+03 INFO  [org.ovirt.engine.core.vdsbroker.storage.StoragePoolDomainHelper] (EE-ManagedThreadFactory-engineScheduled-Thread-2) [] Storage Pool '3a173838-741d-440d-9047-ae74bbad5307' - Updating Storage Domain 'd290a93f-42ce-4988-8aaf-8d1d1b30775c' status from 'Active' to 'Unknown', reason: no reporting hosts
2018-04-11 15:46:51,244+03 INFO  [org.ovirt.engine.core.bll.storage.pool.SetStoragePoolStatusCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-2) [33d26b39] Running command: SetStoragePoolStatusCommand internal: true. Entities affected :  ID: 3a173838-741d-440d-9047-ae74bbad5307 Type: StoragePool
2018-04-11 15:46:51,249+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-2) [33d26b39] EVENT_ID: SYSTEM_CHANGE_STORAGE_POOL_STATUS_PROBLEMATIC(980), Invalid status on Data Center dc1. Setting status to Non Responsive.
2018-04-11 15:46:53,148+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [] FINISH, DisconnectStoragePoolVDSCommand, log id: 77c65443
2018-04-11 15:46:53,151+03 INFO  [org.ovirt.engine.core.bll.storage.pool.DisconnectHostFromStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [3122df0a] Running command: DisconnectHostFromStoragePoolServersCommand internal: true. Entities affected :  ID: 3a173838-741d-440d-9047-ae74bbad5307 Type: StoragePool

At discovery time we see some ERRORS in Engine log: 

2018-04-11 15:37:28,194+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-2) [7401dcb3-460d-4a60-840b-d1c63543981b] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command DiscoverSendTargetsVDS failed: Failed discovery of iSCSI targets: "portal=infra4-nested-3.lab.eng.tlv2.redhat.com:3260, err=(21, [], ['iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: connection login retries (reopen_max) 5 exceeded', 'iscsiadm: No portals found'])"

2018-04-11 15:37:28,194+03 ERROR [org.ovirt.engine.core.bll.storage.connection.DiscoverSendTargetsQuery] (default task-2) [7401dcb3-460d-4a60-840b-d1c63543981b] Query 'DiscoverSendTargetsQuery' failed: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to DiscoverSendTargetsVDS, error = Failed discovery of iSCSI targets: "portal=infra4-nested-3.lab.eng.tlv2.redhat.com:3260, err=(21, [], ['iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: connection login retries (reopen_max) 5 exceeded', 'iscsiadm: No portals found'])" (Failed with error iSCSIDiscoveryError and code 475)
2018-04-11 15:37:28,194+03 ERROR [org.ovirt.engine.core.bll.storage.connection.DiscoverSendTargetsQuery] (default task-2) [7401dcb3-460d-4a60-840b-d1c63543981b] Exception: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed in vdscommand to DiscoverSendTargetsVDS, error = Failed discovery of iSCSI targets: "portal=infra4-nested-3.lab.eng.tlv2.redhat.com:3260, err=(21, [], ['iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: cannot make connection to 10.46.16.97: Connection refused', 'iscsiadm: connection login retries (reopen_max) 5 exceeded', 'iscsiadm: No portals found'])" (Failed with error iSCSIDiscoveryError and code 475)


Version-Release number of selected component (if applicable):
engine 4.2.2.6-0.1
vdsm-4.20.23-1

How reproducible:
100%

Steps to Reproduce:
1)Create new DC & cluster c1 
2) Move host to new cluster (c1) & activate
3) On host check that no other ISCSI connections exist in your host via "iscsiadm -m session" and disconnect from all existing connections.
4) create new ISCSI storage domain , discover Netapp storage targets via IP " 10.46.16.9"  & add a lun (see print screen attached):
5) checkout ISCSI connections" 

Before host goes to maintenance 4 connections appear which looks duplicates of each other , Also note that 10.46.16.9 is the ip for FQDN "mantis-iscsi-lif2.lab.eng.tlv2.redhat.com" .
So actually all 4 connections are the same storage connections.

iscsiadm -m session:
tcp: [16] 10.46.16.9:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [17] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [18] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [19] 10.46.16.9:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)

6) Move host to maintenance.



Actual results:
After host goes to maintenance still 2 connections remain:
[root@infra4-nested-3 ~]# iscsiadm -m session
tcp: [17] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)
tcp: [18] mantis-iscsi-lif2.lab.eng.tlv2.redhat.com:3260,1033 iqn.1992-08.com.netapp:vserver-rhv-qe (non-flash)

Expected results:
After host goes to maintenance No storage connections should remain 


Additional info:
This occurs on NetApp storage server 
netapp-cluster3.mgmt.lab.eng.tlv2.redhat.com - DOES NOT OCCUR ON ExtreamIO storage.

Comment 1 Avihai 2018-04-11 13:34:58 UTC

This is not a regression bug but is related to moving to a new Netapp move of infra team (which we run our TierX tests ) on.

This is an automation blocker as it blocks many tests in the storage connections test plan.

Comment 2 Allon Mureinik 2018-04-12 12:19:47 UTC

We had the same issue since 3.1. I don't understand how this can be an automation blocker now.

Comment 3 Avihai 2018-04-12 13:24:13 UTC

(In reply to Allon Mureinik from comment #2)
> We had the same issue since 3.1.
I do not claim this issue is a 'regression' issue but automation blocker.

> I don't understand how this can be an automation blocker now.

We worked with a different NetApp ISCSI storage provider until recent Lab move and we did not see this issue.(why ? no idea)

Now we see the issue & it blocks automation tests thus automation blocker.

Comment 5 Sandro Bonazzola 2019-01-28 09:34:41 UTC

This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 7 Eyal Shenitzky 2021-08-23 12:29:19 UTC

This bug/RFE is more than 2 years old and it didn't get enough attention so far, and is now flagged as pending close. 
Please review if it is still relevant and provide additional details/justification/patches if you believe it should get more attention for the next oVirt release.

Comment 8 Michal Skrivanek 2021-09-29 11:33:06 UTC

This bug didn't get any attention in a long time, and it's not planned in foreseeable future. oVirt development team has no plans to work on it.
Please feel free to reopen if you have a plan how to contribute this feature/bug fix.