Bug 2099640

Summary: Adding iscsi storage domain with ansible module ovirt_storage_domain / webadmin adds an extra ISCSI portal <IP:3260,1>
Product: [oVirt] ovirt-engine Reporter: Kobi Hakimi <khakimi>
Component: BLL.StorageAssignee: Mark Kemel <mkemel>
Status: CLOSED WONTFIX QA Contact: Shir Fishbain <sfishbai>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 4.5.1CC: aefrat, ahadas, bugs, mnecas, mperina, nsoffer, sfishbai
Target Milestone: ---Flags: pm-rhel: ovirt-4.5?
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-23 09:14:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Kobi Hakimi 2022-06-21 11:48:19 UTC
Description of problem:
Adding iscsi storage domain with ansible module ovirt_storage_domain adds an extra portal 

Version-Release number of selected component (if applicable):
vdsm-4.50.1.3-1.el8ev.x86_64
ovirt-ansible-collection-2.1.0-1.el8ev.noarch
ovirt-engine-4.5.1.2-0.11.el8ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Remove all iscsi storages from the engine
2. Run on the first host the following commands: 
   [root@first_host ~]# iscsiadm -m session --logout
   iscsiadm: No matching sessions found

   [root@first_host ~]# iscsiadm -m node -o delete
   [root@first_host ~]# 
3. Make sure we have a clean env with the following commands:
   [root@first_host ~]# iscsiadm -m session -P1
   iscsiadm: No active sessions.

   [root@first_host ~]# iscsiadm -m node -P1
   iscsiadm: No records found
4. Run ansible yaml file which add iscsi storage domain with the ovirt module ovirt_storage_domain


Actual results:
The iscsi storage domain added as expected but when we run the command:
[root@first_host ~]# iscsiadm -m node -P1
Target: iqn.example.netapp:example-server
	Portal: 10.0.x.y:3260,1
		Iface Name: default
	Portal: 10.0.x.y:3260,1034
		Iface Name: default
[root@caracal01 ~]# 
we can see extra portal 
Portal: 10.0.x.y:3260,1
which added to the node list
as a side effect of this when we move the host to maintenance then activate
the host become NonOperational for ~5 minutes until the autorecover

Expected results:
1. Not to add the extra potral:
	Portal: 10.0.x.y:3260,1
2. Move the host to maintenance then activate work without NonOperational status 

Additional info:
ansible logs and the ansible script attached

Comment 3 Kobi Hakimi 2022-06-21 11:55:20 UTC
btw
when I add the same lun as iscsi storage domain manually it didn't reproduce.

Comment 4 Avihai 2022-06-21 12:48:22 UTC
This extra portal "Portal: 10.0.x.y:3260,1" is an additional scenario/root cause of host moving to non-operational state (after deactivate-actiate host)described bug 2097614.

We saw this issue (extra portal) in ALL QE ENVs on the first host after we used the ovirt_storage_domain playbook to build/deploy QE ENVs.
Due to the effect of this bug may have on customers using this ansible script I'm setting severity to High.

To clarify:
This bug should take care of the ansible side that triggers that extra portal addition and bug 2097614 should take care of engine side that moves the host to non-operational state.

Comment 5 Kobi Hakimi 2022-06-21 14:02:47 UTC
in our case, we build our iscsi storage domain with the infra role and at this point, we can see that the extra portal added in:
https://github.com/oVirt/ovirt-ansible-collection/blame/master/roles/infra/roles/storages/tasks/main.yml#L27

Comment 8 Arik 2022-06-21 14:13:57 UTC
(In reply to Avihai from comment #4)
> Due to the effect of this bug may have on customers using this ansible
> script I'm setting severity to High.
> 
> To clarify:
> This bug should take care of the ansible side that triggers that extra
> portal addition and bug 2097614 should take care of engine side that moves
> the host to non-operational state.

ok, so realizing that the transition to non-operational should be prevented by the fix for bz 2097614,
1. why do we care about the duplication? do we know of any side effect of that other than the transition to non-operational? 
2. why do we assume it's Ansible that causes this and not any other component in the chain? did we check if that happens with previous 4.4 versions?

Comment 9 Nir Soffer 2022-06-21 17:22:50 UTC
(In reply to Arik from comment #8)
> 1. why do we care about the duplication? do we know of any side effect of
> that other than the transition to non-operational? 

The fix may not work for all cases, and we get a warning for every connection when
we have duplicate portals. This is not a valid configuration and we need to fix this
issue.

> 2. why do we assume it's Ansible that causes this and not any other
> component in the chain? did we check if that happens with previous 4.4
> versions?

Looking at the discover targets code in vdsm, it seems that the tpgt value
come from the server, and returned by vdsm to engine. Engine should call keep
the tpgt value and use it later when connecting to storage server.

My guess is that the ansible code is specifying tpgt=1 (maybe as default)
instead of using the values returned by engine.

We know that using engine UI we never see this issue.

Comment 10 Nir Soffer 2022-06-21 18:20:11 UTC
Here is an example of iscsi discovery, and how engine keeps the info:

2022-06-21 21:06:39,734+0300 INFO  (jsonrpc/2) [vdsm.api] START discoverSendTargets(con={'connection': 'alpine', 'port': '3260', 'user': '', 'password': '', 'ipv6_enabled': 'false'}) from=::ffff:192.168.122.10,40802, flow_id=ae541098-cc78-4645-8455-caa892cfe65f, task_id=fa8d5017-82b9-4362-ba68-f492cc71bca8 (api:48)

2022-06-21 21:06:39,953+0300 INFO  (jsonrpc/2) [vdsm.api] FINISH discoverSendTargets return={'targets': ['iqn.2003-01.org.alpine.04', 'iqn.2003-01.org.alpine.04', 'iqn.2003-01.org.alpine.03', 'iqn.2003-01.org.alpine.03', 'iqn.2003-01.org.alpine.02', 'iqn.2003-01.org.alpine.02', 'iqn.2003-01.org.alpine.01', 'iqn.2003-01.org.alpine.01'], 'fullTargets': ['192.168.122.35:3260,1 iqn.2003-01.org.alpine.04', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.04', '192.168.122.35:3260,1 iqn.2003-01.org.alpine.03', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.03', '192.168.122.35:3260,1 iqn.2003-01.org.alpine.02', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.02', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.01', '192.168.122.35:3260,1 iqn.2003-01.org.alpine.01']} from=::ffff:192.168.122.10,40802, flow_id=ae541098-cc78-4645-8455-caa892cfe65f, task_id=fa8d5017-82b9-4362-ba68-f492cc71bca8 (api:54)

The formatted reply:

{
  ...
  "fullTargets": [
    "192.168.122.35:3260,1 iqn.2003-01.org.alpine.04",
    "192.168.122.34:3260,1 iqn.2003-01.org.alpine.04",
    "192.168.122.35:3260,1 iqn.2003-01.org.alpine.03",
    "192.168.122.34:3260,1 iqn.2003-01.org.alpine.03",
    "192.168.122.35:3260,1 iqn.2003-01.org.alpine.02",
    "192.168.122.34:3260,1 iqn.2003-01.org.alpine.02",
    "192.168.122.34:3260,1 iqn.2003-01.org.alpine.01",
    "192.168.122.35:3260,1 iqn.2003-01.org.alpine.01"
  ]
}

On engine side this info is kept as:

engine=# select connection,iqn,port,portal from storage_server_connections where connection='192.168.122.34';
-[ RECORD 1 ]-------------------------
connection | 192.168.122.34
iqn        | iqn.2003-01.org.alpine.02
port       | 3260
portal     | 1
-[ RECORD 2 ]-------------------------
connection | 192.168.122.34
iqn        | iqn.2003-01.org.alpine.01
port       | 3260
portal     | 1

The term "portal" is wrong, the value 1 is the tpgt (Target Portal Group Tag).

So most likely the ansible script specifies portal=1 when it should not specify
anything, or instead of the actual value reported when discovering the targets.

Comment 11 Nir Soffer 2022-06-21 18:29:42 UTC
Similar to bug 1493914.

Comment 12 Nir Soffer 2022-06-21 20:08:41 UTC
More testing shows that it is possible to get the same result from the UI, by adding
a new connection. Since the UI does not have a way to specify the tgpt value 
(stored as portal in engine db), new connections always use portal=1. It is possible
that the ansible script does not specify the portal and engine uses the default.

Comment 13 Arik 2022-06-22 07:11:31 UTC
(In reply to Nir Soffer from comment #12)
> More testing shows that it is possible to get the same result from the UI,
> by adding
> a new connection. Since the UI does not have a way to specify the tgpt value 
> (stored as portal in engine db), new connections always use portal=1. It is
> possible
> that the ansible script does not specify the portal and engine uses the
> default.

Thanks Nir, that makes more sense - I see that we have a default value of "1" on the backend side [1] which is used by the REST-API layer so it might be related to both the aforementioned flows (Ansible and UI) that use the API (but that doesn't seem to be a new thing, it's been that way for 8 years as far as I can tell)

That almost answers my second question in comment 2 - the other part, which seems to be more relevant now, is whether it also happens with previous 4.4 versions and what's the implication of that.
I understand that the fix for bz 2097614 does not necessarily fix this, but it may, right? if it does and that's something that has "always" been this way then it's far less interesting...

Moving to ovirt-engine and lowering the severity until we get a good picture of whether it's something new and what's the implication

[1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.5.1.2/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/businessentities/StorageServerConnections.java#L15

Comment 14 Arik 2022-06-22 07:13:51 UTC
comment 2 -> comment 8

Comment 16 Avihai 2022-06-22 14:05:58 UTC
(In reply to Arik from comment #8)
> (In reply to Avihai from comment #4)
> > Due to the effect of this bug may have on customers using this ansible
> > script I'm setting severity to High.
> > 
> > To clarify:
> > This bug should take care of the ansible side that triggers that extra
> > portal addition and bug 2097614 should take care of engine side that moves
> > the host to non-operational state.
> 
> ok, so realizing that the transition to non-operational should be prevented
> by the fix for bz 2097614,
> 1. why do we care about the duplication?
It's something RHV adds and should not be there that might cause future issues if RHEL changes and better not add it at all.

> do we know of any side effect of
> that other than the transition to non-operational? 
Not at the moment.
But it's enough for RHEL for change behavior (like issue a new Error type on duplicate nodes in ISCSI DB ) and we will get another issue.


> 2. why do we assume it's Ansible that causes this and not any other
> component in the chain?
Via UI adding a new ISCSI storage domain works without seeing this extra tptg=1 entire in host ISCSI DB but when using the ansible ovirt_storage_domain we do see extra tptg=1 entire.
I'm not sure ansible that is telling the engine to add it or the engine itself.

> did we check if that happens with previous 4.4
> versions?
It was there in older version but as we had the engine duplication issue we did not noticed it or gave it much attention.
ALso in RHEL7(RHV4.3) , RHEL did not issue 'session exist' or any error with duplicate ISCSI node DB when logging so we did not have any issues and did not looks at those entries at all.
In short nothing new, it was there for a long time.

Comment 18 Arik 2022-06-27 14:22:20 UTC
*** Bug 2099997 has been marked as a duplicate of this bug. ***

Comment 19 Arik 2022-06-27 14:24:40 UTC
We plan to solve it by using the data we get from the discovery and match it with the one specified in the request to get the proper tgpt setting and prevent ending with duplicate connections (so the default tgpt would not be used unless we didn't find a connection that can be correlated with the specified input)

Comment 21 Arik 2022-09-23 09:14:40 UTC
(In reply to Arik from comment #19)
> We plan to solve it by using the data we get from the discovery and match it
> with the one specified in the request to get the proper tgpt setting and
> prevent ending with duplicate connections (so the default tgpt would not be
> used unless we didn't find a connection that can be correlated with the
> specified input)

We didn't get to that so far and the protection we've added in VDSM and ovirt-engine renders this part as a nice-to-have now
Therefore closing but maybe if something similar would happen with el9, we can get back to this since it was a nice idea..