Bug 2079903 - Engine should not allow duplicate connection entries in DB which causes hosts move to 'Non-Operationl state
Summary: Engine should not allow duplicate connection entries in DB which causes hosts...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.5.0.5
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-4.5.2
: 4.5.2
Assignee: Mark Kemel
QA Contact: Evelina Shames
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-04-28 13:13 UTC by Evelina Shames
Modified: 2022-08-30 08:47 UTC (History)
6 users (show)

Fixed In Version: ovirt-engine-4.5.2
Clone Of:
Environment:
Last Closed: 2022-08-30 08:47:42 UTC
oVirt Team: Storage
Embargoed:
sbonazzo: ovirt-4.5+
michal.skrivanek: blocker-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt ovirt-engine pull 551 0 None Merged Change iSCSI storage connection validation to properly avoid duplications 2022-07-31 10:54:02 UTC
Red Hat Issue Tracker RHV-45887 0 None None None 2022-04-28 17:10:56 UTC

Description Evelina Shames 2022-04-28 13:13:45 UTC
Description of problem:
As part of bug 2079896, engine should not allow having duplicate connections in the DB.

When we have duplicate connections and we try to deactivate and activate back a host (or restart vdsm) -> the host goes to 'Non-Operational' state, and we see "Could not login to target" errors in VDSM Log:

2022-04-28 10:56:30,332+0300 INFO  (jsonrpc/7) [storage.iscsi] Adding iscsi node for target x.x.x.10:3260,1034 x.com.netapp:vserver-r
hv-qe iface default (iscsi:209)
2022-04-28 10:56:30,483+0300 INFO  (jsonrpc/7) [storage.iscsi] Adding iscsi node for target x.x.x.10:3260,1048 x.com.netapp:vserver-r
hv-qe iface default (iscsi:209)
2022-04-28 10:56:30,623+0300 INFO  (jsonrpc/7) [storage.iscsi] Adding iscsi node for target x.x.x.10:3260,1033 x.com.netapp:vserver-r
hv-qe iface default (iscsi:209)
2022-04-28 10:56:30,753+0300 INFO  (jsonrpc/7) [storage.storageServer] Log in to 3 targets using 3 workers (storageServer:614)
2022-04-28 10:56:30,754+0300 INFO  (iscsi-login/0) [storage.iscsi] Logging in to iscsi target x.x.x.10:3260,1034 x.com.netapp:vserver
-rhv-qe via iface default (iscsi:231)
2022-04-28 10:56:30,759+0300 INFO  (iscsi-login/1) [storage.iscsi] Logging in to iscsi target x.x.x.10:3260,1048 x.com.netapp:vserver
-rhv-qe via iface default (iscsi:231)
2022-04-28 10:56:30,768+0300 INFO  (iscsi-login/2) [storage.iscsi] Logging in to iscsi target x.x.x.10:3260,1033 x.com.netapp:vserver
-rhv-qe via iface default (iscsi:231)
2022-04-28 10:56:30,888+0300 INFO  (iscsi-login/1) [storage.iscsi] Removing iscsi node for target x.x.x.10:3260,1048 x.com.netapp:vse
rver-rhv-qe iface default (iscsi:244)
2022-04-28 10:56:30,897+0300 INFO  (iscsi-login/2) [storage.iscsi] Removing iscsi node for target x.x.x.10:3260,1033 x.com.netapp:vse
rver-rhv-qe iface default (iscsi:244)
2022-04-28 10:56:31,082+0300 INFO  (iscsi-login/0) [storage.iscsi] Removing iscsi node for target x.x.x.10:3260,1034 x.com.netapp:vse
rver-rhv-qe iface default (iscsi:244)
2022-04-28 10:56:31,331+0300 ERROR (iscsi-login/1) [storage.storageServer] Could not login to target

DB looks like this:
                            connection                            |                  iqn                  | port | portal 
------------------------------------------------------------------+---------------------------------------+------+--------
 x.x.x.10                                                       | x.com.netapp:vserver-rhv-qe | 3260 | 1033
 x.x.x.10                                                      | x.com.netapp:vserver-rhv-qe | 3260 | 1034
 x.x.x.10                                                      | x.com.netapp:vserver-rhv-qe | 3260 | 1048


Version-Release number of selected component (if applicable):
Not sure when it started, but we start seeing this in RHV 4.5
was investigated in rhv-4.5.0-7

How reproducible:
always in our lab.

Steps to Reproduce:
1. Have a clean fresh installed 4.5 RHV environment. 
2. Add Data center, cluster, hosts
3. Add iscsi storage domains (iscsi with multiple targets, 3 in our case)
   check DB to see these 3 connections:
   /usr/share/ovirt-engine/dbscripts/engine-psql.sh -c "select connection,iqn,port,portal from storage_server_connections;"
4. Run 'Fetch storages' and 'Update storage parameters' - during this process, the iscsi connection list in the engine DB changes to 3 duplicate connections
   check DB again to see it.

Actual results:
the iscsi connection list in the engine DB changes to 3 duplicate connections

Expected results:
the iscsi connection list in the engine DB should remain as it was - the engine should block any attempt of duplicate connections in the DB


Additional info:

Comment 2 Arik 2022-04-28 17:01:14 UTC
I see no reason for this not to happen even before so let's add the regression flag when/if we see that it is not possible with a previous version
Also dropping the automation blocker flags since even if we add a validation here - it just means that the tests would stop due to that check
As I see this, it's a change we can do to fail early when the client provides us with duplicate connections - but that really shouldn't happen

Comment 3 Avihai 2022-05-01 12:49:45 UTC
Readding Automation blocker and test blocker flags as this blocks automation and customers can add duplicate IP's and this should be blocked.
Raising to High as this cause hosts to go to non-operational which is disruptive .

We might had this behavior in automation ( dup IP's ) before but NOW (RHV4.5/RHEL8.6) host goes to non-operational which did not occur before.
This is the FUNCTIONAL regression part. ( might be RHEL but still regression)

Comment 4 Arik 2022-05-01 15:16:25 UTC
(In reply to Avihai from comment #3)
> Readding Automation blocker and test blocker flags as this blocks automation
> and customers can add duplicate IP's and this should be blocked.

Sure, I didn't close the bug - I thought I targeted it but seems I didn't so targeting it now, it makes sense to handle in a way that it would fail early

> Raising to High as this cause hosts to go to non-operational which is
> disruptive .
> 
> We might had this behavior in automation ( dup IP's ) before but NOW
> (RHV4.5/RHEL8.6) host goes to non-operational which did not occur before.
> This is the FUNCTIONAL regression part. ( might be RHEL but still regression)

So what would be the expected result then (until we add a validation that would fail on invalid input)?

Comment 5 RHEL Program Management 2022-05-01 15:16:33 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 6 Avihai 2022-05-02 06:07:36 UTC
(In reply to Arik from comment #4)
> (In reply to Avihai from comment #3)
> > Readding Automation blocker and test blocker flags as this blocks automation
> > and customers can add duplicate IP's and this should be blocked.
> 
> Sure, I didn't close the bug - I thought I targeted it but seems I didn't so
> targeting it now, it makes sense to handle in a way that it would fail early
> 
> > Raising to High as this cause hosts to go to non-operational which is
> > disruptive .
> > 
> > We might had this behavior in automation ( dup IP's ) before but NOW
> > (RHV4.5/RHEL8.6) host goes to non-operational which did not occur before.
> > This is the FUNCTIONAL regression part. ( might be RHEL but still regression)
> 
> So what would be the expected result then (until we add a validation that
> would fail on invalid input)?
For this bug we agree that duplicate IP's should be blocked to avoid the host going to non-operational.
The reason it's important and in my opinion should be added to 4.5.0 as it cause host to go to non-operational state.

About Expected results until validation is reached -> It should be as in RHV4.4/4.3 :
Host should NOT go to non-operational state after deactivating-activating it.
This was the behavior at until RHV 4.5.

We should find out why in RHV4.4SP1 (Or perhaps it's because of RHEL8.6) host goes not non-operational state after duplicate ISCSI storage connections are allowed + host deactivate-activate.

Comment 7 Arik 2022-05-02 06:20:36 UTC
The purpose of non-operational state is to indicate that something is wrong - you have no connection to the storage, network is missing, etc
So it's actually better to set the host to non-operational state in this case rather than proceeding with duplicate settings which are incorrect
I understand it requires some changes in our automation but I would say that users would prefer to know about the issue rather than having their system operating with incorrect settings without them knowing, no?
I think this bug is nice-to-have, we should really concentrate on fixing the core issue with is the duplicate settings we get from Ansible

Comment 8 Michal Skrivanek 2022-05-02 11:55:36 UTC
indeed. Non Operational is the right behavior in this case. The reason for this in automation is already tracked in bug 2079896

Is there anything else to check, perhaps if this can be done by manually adding 2 same connections?

Comment 9 Evelina Shames 2022-07-13 14:55:41 UTC
(In reply to Arik from comment #7)
> The purpose of non-operational state is to indicate that something is wrong
> - you have no connection to the storage, network is missing, etc
> So it's actually better to set the host to non-operational state in this
> case rather than proceeding with duplicate settings which are incorrect
> I understand it requires some changes in our automation but I would say that
> users would prefer to know about the issue rather than having their system
> operating with incorrect settings without them knowing, no?
> I think this bug is nice-to-have, we should really concentrate on fixing the
> core issue with is the duplicate settings we get from Ansible

The purpose of this bug is that the engine should NOT allow dup connections - if this was blocked by the engine, we could see this much earlier and not only when deactivating-activating the host and it goes non-operational and everything breaks.

I don't agree that this bug is nice to have, the engine should block the ability of having dup connections to avoid broken envs.

Comment 10 Avihai 2022-07-21 11:14:50 UTC
After discussion with Arik we understand that indeed we want to block users from inserting duplicate connection entries and fix this bug.

It's not and urgent fix - this is why:
Due to VDSM bug 2097614 fix and verification in 4.5.2, even if user/ansible enters a duplicate connection to engine DB or via UI this should not break host deactivating-activating as VDSM will no longer remove the last existing connection thus causing host to go to non-operational state.

But as we do not want to take any risk with user entering dup connections both me and Arik agree this should be fixed.

Comment 11 Evelina Shames 2022-08-08 07:02:22 UTC
Verified on: ovirt-engine-4.5.2.1-0.1.el8ev

The engine doesn't allow dup connections with a clear message:
"Cannot add Storage Connection. Storage connection 5866559d-cca2-4356-9f08-2f8c63be20bb already exists"

Comment 12 Sandro Bonazzola 2022-08-30 08:47:42 UTC
This bugzilla is included in oVirt 4.5.2 release, published on August 10th 2022.
Since the problem described in this bug report should be resolved in oVirt 4.5.2 release, it has been closed with a resolution of CURRENT RELEASE.
If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.