Bug 2099640
| Summary: | Adding iscsi storage domain with ansible module ovirt_storage_domain / webadmin adds an extra ISCSI portal <IP:3260,1> | ||
|---|---|---|---|
| Product: | [oVirt] ovirt-engine | Reporter: | Kobi Hakimi <khakimi> |
| Component: | BLL.Storage | Assignee: | Mark Kemel <mkemel> |
| Status: | CLOSED WONTFIX | QA Contact: | Shir Fishbain <sfishbai> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.5.1 | CC: | aefrat, ahadas, bugs, mnecas, mperina, nsoffer, sfishbai |
| Target Milestone: | --- | Flags: | pm-rhel:
ovirt-4.5?
|
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-09-23 09:14:40 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Kobi Hakimi
2022-06-21 11:48:19 UTC
btw when I add the same lun as iscsi storage domain manually it didn't reproduce. This extra portal "Portal: 10.0.x.y:3260,1" is an additional scenario/root cause of host moving to non-operational state (after deactivate-actiate host)described bug 2097614. We saw this issue (extra portal) in ALL QE ENVs on the first host after we used the ovirt_storage_domain playbook to build/deploy QE ENVs. Due to the effect of this bug may have on customers using this ansible script I'm setting severity to High. To clarify: This bug should take care of the ansible side that triggers that extra portal addition and bug 2097614 should take care of engine side that moves the host to non-operational state. in our case, we build our iscsi storage domain with the infra role and at this point, we can see that the extra portal added in: https://github.com/oVirt/ovirt-ansible-collection/blame/master/roles/infra/roles/storages/tasks/main.yml#L27 (In reply to Avihai from comment #4) > Due to the effect of this bug may have on customers using this ansible > script I'm setting severity to High. > > To clarify: > This bug should take care of the ansible side that triggers that extra > portal addition and bug 2097614 should take care of engine side that moves > the host to non-operational state. ok, so realizing that the transition to non-operational should be prevented by the fix for bz 2097614, 1. why do we care about the duplication? do we know of any side effect of that other than the transition to non-operational? 2. why do we assume it's Ansible that causes this and not any other component in the chain? did we check if that happens with previous 4.4 versions? (In reply to Arik from comment #8) > 1. why do we care about the duplication? do we know of any side effect of > that other than the transition to non-operational? The fix may not work for all cases, and we get a warning for every connection when we have duplicate portals. This is not a valid configuration and we need to fix this issue. > 2. why do we assume it's Ansible that causes this and not any other > component in the chain? did we check if that happens with previous 4.4 > versions? Looking at the discover targets code in vdsm, it seems that the tpgt value come from the server, and returned by vdsm to engine. Engine should call keep the tpgt value and use it later when connecting to storage server. My guess is that the ansible code is specifying tpgt=1 (maybe as default) instead of using the values returned by engine. We know that using engine UI we never see this issue. Here is an example of iscsi discovery, and how engine keeps the info:
2022-06-21 21:06:39,734+0300 INFO (jsonrpc/2) [vdsm.api] START discoverSendTargets(con={'connection': 'alpine', 'port': '3260', 'user': '', 'password': '', 'ipv6_enabled': 'false'}) from=::ffff:192.168.122.10,40802, flow_id=ae541098-cc78-4645-8455-caa892cfe65f, task_id=fa8d5017-82b9-4362-ba68-f492cc71bca8 (api:48)
2022-06-21 21:06:39,953+0300 INFO (jsonrpc/2) [vdsm.api] FINISH discoverSendTargets return={'targets': ['iqn.2003-01.org.alpine.04', 'iqn.2003-01.org.alpine.04', 'iqn.2003-01.org.alpine.03', 'iqn.2003-01.org.alpine.03', 'iqn.2003-01.org.alpine.02', 'iqn.2003-01.org.alpine.02', 'iqn.2003-01.org.alpine.01', 'iqn.2003-01.org.alpine.01'], 'fullTargets': ['192.168.122.35:3260,1 iqn.2003-01.org.alpine.04', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.04', '192.168.122.35:3260,1 iqn.2003-01.org.alpine.03', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.03', '192.168.122.35:3260,1 iqn.2003-01.org.alpine.02', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.02', '192.168.122.34:3260,1 iqn.2003-01.org.alpine.01', '192.168.122.35:3260,1 iqn.2003-01.org.alpine.01']} from=::ffff:192.168.122.10,40802, flow_id=ae541098-cc78-4645-8455-caa892cfe65f, task_id=fa8d5017-82b9-4362-ba68-f492cc71bca8 (api:54)
The formatted reply:
{
...
"fullTargets": [
"192.168.122.35:3260,1 iqn.2003-01.org.alpine.04",
"192.168.122.34:3260,1 iqn.2003-01.org.alpine.04",
"192.168.122.35:3260,1 iqn.2003-01.org.alpine.03",
"192.168.122.34:3260,1 iqn.2003-01.org.alpine.03",
"192.168.122.35:3260,1 iqn.2003-01.org.alpine.02",
"192.168.122.34:3260,1 iqn.2003-01.org.alpine.02",
"192.168.122.34:3260,1 iqn.2003-01.org.alpine.01",
"192.168.122.35:3260,1 iqn.2003-01.org.alpine.01"
]
}
On engine side this info is kept as:
engine=# select connection,iqn,port,portal from storage_server_connections where connection='192.168.122.34';
-[ RECORD 1 ]-------------------------
connection | 192.168.122.34
iqn | iqn.2003-01.org.alpine.02
port | 3260
portal | 1
-[ RECORD 2 ]-------------------------
connection | 192.168.122.34
iqn | iqn.2003-01.org.alpine.01
port | 3260
portal | 1
The term "portal" is wrong, the value 1 is the tpgt (Target Portal Group Tag).
So most likely the ansible script specifies portal=1 when it should not specify
anything, or instead of the actual value reported when discovering the targets.
Similar to bug 1493914. More testing shows that it is possible to get the same result from the UI, by adding a new connection. Since the UI does not have a way to specify the tgpt value (stored as portal in engine db), new connections always use portal=1. It is possible that the ansible script does not specify the portal and engine uses the default. (In reply to Nir Soffer from comment #12) > More testing shows that it is possible to get the same result from the UI, > by adding > a new connection. Since the UI does not have a way to specify the tgpt value > (stored as portal in engine db), new connections always use portal=1. It is > possible > that the ansible script does not specify the portal and engine uses the > default. Thanks Nir, that makes more sense - I see that we have a default value of "1" on the backend side [1] which is used by the REST-API layer so it might be related to both the aforementioned flows (Ansible and UI) that use the API (but that doesn't seem to be a new thing, it's been that way for 8 years as far as I can tell) That almost answers my second question in comment 2 - the other part, which seems to be more relevant now, is whether it also happens with previous 4.4 versions and what's the implication of that. I understand that the fix for bz 2097614 does not necessarily fix this, but it may, right? if it does and that's something that has "always" been this way then it's far less interesting... Moving to ovirt-engine and lowering the severity until we get a good picture of whether it's something new and what's the implication [1] https://github.com/oVirt/ovirt-engine/blob/ovirt-engine-4.5.1.2/backend/manager/modules/common/src/main/java/org/ovirt/engine/core/common/businessentities/StorageServerConnections.java#L15 (In reply to Arik from comment #8) > (In reply to Avihai from comment #4) > > Due to the effect of this bug may have on customers using this ansible > > script I'm setting severity to High. > > > > To clarify: > > This bug should take care of the ansible side that triggers that extra > > portal addition and bug 2097614 should take care of engine side that moves > > the host to non-operational state. > > ok, so realizing that the transition to non-operational should be prevented > by the fix for bz 2097614, > 1. why do we care about the duplication? It's something RHV adds and should not be there that might cause future issues if RHEL changes and better not add it at all. > do we know of any side effect of > that other than the transition to non-operational? Not at the moment. But it's enough for RHEL for change behavior (like issue a new Error type on duplicate nodes in ISCSI DB ) and we will get another issue. > 2. why do we assume it's Ansible that causes this and not any other > component in the chain? Via UI adding a new ISCSI storage domain works without seeing this extra tptg=1 entire in host ISCSI DB but when using the ansible ovirt_storage_domain we do see extra tptg=1 entire. I'm not sure ansible that is telling the engine to add it or the engine itself. > did we check if that happens with previous 4.4 > versions? It was there in older version but as we had the engine duplication issue we did not noticed it or gave it much attention. ALso in RHEL7(RHV4.3) , RHEL did not issue 'session exist' or any error with duplicate ISCSI node DB when logging so we did not have any issues and did not looks at those entries at all. In short nothing new, it was there for a long time. *** Bug 2099997 has been marked as a duplicate of this bug. *** We plan to solve it by using the data we get from the discovery and match it with the one specified in the request to get the proper tgpt setting and prevent ending with duplicate connections (so the default tgpt would not be used unless we didn't find a connection that can be correlated with the specified input) (In reply to Arik from comment #19) > We plan to solve it by using the data we get from the discovery and match it > with the one specified in the request to get the proper tgpt setting and > prevent ending with duplicate connections (so the default tgpt would not be > used unless we didn't find a connection that can be correlated with the > specified input) We didn't get to that so far and the protection we've added in VDSM and ovirt-engine renders this part as a nice-to-have now Therefore closing but maybe if something similar would happen with el9, we can get back to this since it was a nice idea.. |