Bug 2054745 - Setting SD to maintenance fails and turns the SD to inactive mode as a result
Summary: Setting SD to maintenance fails and turns the SD to inactive mode as a result
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: 4.50
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.5.0
: 4.50.0.7
Assignee: Vojtech Juranek
QA Contact: sshmulev
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-15 15:55 UTC by sshmulev
Modified: 2022-04-20 06:33 UTC (History)
6 users (show)

Fixed In Version: vdsm-4.50.0.7
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-04-20 06:33:59 UTC
oVirt Team: Storage
Embargoed:
pm-rhel: ovirt-4.5?
pm-rhel: blocker?


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github oVirt vdsm pull 77 0 None open Bz2054745 2022-02-16 21:44:38 UTC
Red Hat Issue Tracker RHV-44704 0 None None None 2022-02-15 16:07:49 UTC

Description sshmulev 2022-02-15 15:55:32 UTC
Description of problem:
Setting SD to maintenance fails and turns the SD to inactive mode.
Reproduced this on ISCSI,FCP.
Doesn't reproduce on NFS and gluster.
After the SD turns to inactive mode, it is possible to activate it.


Version-Release number of selected component (if applicable):
ovirt-engine-4.5.0-582.gd548206.185.el8ev.noarch
vdsm-4.50.0.5-1.el8ev.x86_64

How reproducible:
100%

Steps to Reproduce:
Use the web admin UI-> got to "Compute" -> "data centers" -> choose iscsi SD or FCP -> set to maintenance mode

Actual results:
after a few seconds you will get message like this:
"VDSM host_mixed_1 command DisconnectStorageServerVDS failed: Error storage server disconnection: ('domType=3, spUUID=a370dd63-8234-47db-a373-3040ea63f4e1, conList=[]',)
VDSM host_mixed_3 command DisconnectStorageServerVDS failed: Error storage server disconnection: ('domType=3, spUUID=a370dd63-8234-47db-a373-3040ea63f4e1, conList=[]',)
VDSM host_mixed_2 command DisconnectStorageServerVDS failed: Error storage server disconnection: ('domType=3, spUUID=a370dd63-8234-47db-a373-3040ea63f4e1, conList=[]',)
Failed to deactivate Storage Domain iscsi_0 (Data Center golden_env_mixed)."

and on the SPM host we get a Traceback:
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper
    result = ctask.prepare(func, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper
    return m(self, *a, **kw)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1190, in prepare
    raise self.error
  File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 884, in _run
    return fn(*args, **kargs)
  File "<decorator-gen-119>", line 2, in disconnectStorageServer
  File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method
    ret = func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2232, in disconnectStorageServer
    results = storageServer.disconnect(domType, conList)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 932, in disconnect
    con_class, connections = _prepare_connections(dom_type, con_defs)
  File "/usr/lib/python3.6/site-packages/vdsm/storage/storageServer.py", line 942, in _prepare_connections
    con_class = ConnectionFactory.registeredConnectionTypes[con_info.type]
UnboundLocalError: local variable 'con_info' referenced before assignment


Expected results:
Setting the host to maintenance should succeed without any failures.

Additional info:
This is a regression, that does not reproduce in 4.4.10 latest build.
Seems like the fix from this bug influenced this action (https://bugzilla.redhat.com/1787192)

Comment 2 Vojtech Juranek 2022-02-15 17:54:41 UTC
This is caused by engine, not sending connection info to vdsm:

    2022-02-15 15:48:25,654+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStorageServerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-37827) [storagedomains_syncAction_bb8aad06-f] START, DisconnectStorageServerVDSCommand(HostName = host_mixed_3, StorageServerConnectionManagementVDSParameters:{hostId='1d0c3077-9101-4685-bfbb-4efadc7c8899', storagePoolId='a370dd63-8234-47db-a373-3040ea63f4e1', storageType='ISCSI', connectionList='[]', sendNetworkEventOnFailure='true'}), log id: 5499f753

connectionList is empty list, which caused vdsm to crash.

This being said, vdsm code can be improved to handle this situation more gracefully.

Comment 3 Michal Skrivanek 2022-02-16 10:43:04 UTC
workaround exists: "After the SD turns to inactive mode, it is possible to activate it." -> high severity.

Comment 4 RHEL Program Management 2022-02-16 10:43:12 UTC
This bug report has Keywords: Regression or TestBlocker.
Since no regressions or test blockers are allowed between releases, it is also being identified as a blocker for this release. Please resolve ASAP.

Comment 5 Vojtech Juranek 2022-02-16 17:58:28 UTC
this happen only when there are more SDs connected to the same iSCSI target and in such case we cannot disconnect the tager, i.e. this is expected engine behaviour and should be fixed in vdsm to do nothing when connection list is empty.

Comment 6 Nir Soffer 2022-02-16 23:52:22 UTC
We need engine bug, it should never send disconnect request without connections
to disconnect. This is invalid request. Unfortunately we must support broken
engines so we cannot fail the request.

The vdsm schema should be updated to require non-empty connection list for
both connectStorageServer and disconnectStorageServer.

Comment 7 Avihai 2022-02-22 07:29:03 UTC
(In reply to Michal Skrivanek from comment #3)
> workaround exists: "After the SD turns to inactive mode, it is possible to
> activate it." -> high severity.

Yes, it has a WA for activation but the fact remains the SD can not be moved to maintenance so customers can not detach SD and move to a different DC or engine which is a huge blocker IMO -> changing to urgent.

Comment 8 sshmulev 2022-03-01 12:57:59 UTC
Verified successfully.

Versions:
ovirt-engine-4.5.0-743c0a787472.211.el8ev.noarch
vdsm-4.50.0.7-1.el8ev

Verified steps:
1) Maintenance on SDs(FCP, ISCSI, NFS, Gluster) on HE environment and on a regular one.
2) Activate all the SDs that were set on maintenance mode.

Expected results:
All steps should pass successfully without error logs.

Actual results: 
As expected.

Comment 10 Sandro Bonazzola 2022-04-20 06:33:59 UTC
This bugzilla is included in oVirt 4.5.0 release, published on April 20th 2022.

Since the problem described in this bug report should be resolved in oVirt 4.5.0 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.