Bug 1667723 - After setting SPM priority, DisconnectStorageServer sometimes fails with NullPointerException for an iSCSI disconnection
Summary: After setting SPM priority, DisconnectStorageServer sometimes fails with Null...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Storage
Version: 4.3.0
Hardware: x86_64
OS: Unspecified
unspecified
medium
Target Milestone: ovirt-4.3.4
: 4.3.4.2
Assignee: Ahmad Khiet
QA Contact: Avihai
URL:
Whiteboard:
: 1656742 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-20 12:11 UTC by Elad
Modified: 2019-11-06 13:43 UTC (History)
5 users (show)

Fixed In Version: ovirt-engine-4.3.4.2
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-11 06:24:12 UTC
oVirt Team: Storage
rule-engine: ovirt-4.3+


Attachments (Terms of Use)
logs (1.61 MB, application/zip)
2019-01-20 12:11 UTC, Elad
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 98405 0 master MERGED engine: removing host while its being disconnected from storage servers 2019-03-14 15:55:55 UTC
oVirt gerrit 98824 0 ovirt-engine-4.3 MERGED Revert "engine: removing host while its being disconnected from storage servers" 2019-03-24 14:05:59 UTC
oVirt gerrit 99378 0 master MERGED engine: Removing host while it's being disconnected from storage servers 2019-05-28 11:48:38 UTC
oVirt gerrit 100362 0 ovirt-engine-4.3 MERGED engine: Removing host while it's being disconnected from storage servers 2019-05-28 15:09:50 UTC
oVirt gerrit 101344 0 master MERGED engine: remove and maintenance lock conflict 2019-06-30 15:14:54 UTC
oVirt gerrit 101382 0 ovirt-engine-4.3 MERGED engine: remove and maintenance lock conflict 2019-07-25 19:41:07 UTC

Description Elad 2019-01-20 12:11:37 UTC
Created attachment 1521961 [details]
logs

Description of problem:
Sometimes, DisconnectStorageServerVDSCommand fails with NullPointerException for a request to the host to disconnect an iSCSI connection


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.8.2-0.1.el7ev.noarch
vdsm-4.20.46-1.el7ev.x86_64


How reproducible:
Executed ~10 times, reproduced ~5 times, all for disconnection from iSCSI 

Steps to Reproduce:
Storage pool with iSCSI:
1. Set host (not the current SPM) SPM priority to 5
2. Put the host in maintenance 


Actual results:
Sometimes DisconnectStorageServerVDSCommand fails with NullPointerException:


2019-01-12 09:41:09,496+02 ERROR [org.ovirt.engine.core.bll.storage.pool.DisconnectHostFromStoragePoolServersCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-83) [11570294] Command 'org.ovirt.engine.core
.bll.storage.pool.DisconnectHostFromStoragePoolServersCommand' failed: EngineException: java.lang.NullPointerException (Failed with error ENGINE and code 5001)


It doesn't seem that this NPE affect the host maintenance procedure.


Expected results:
No NPE

Comment 1 Sandro Bonazzola 2019-01-28 09:34:33 UTC
This bug has not been marked as blocker for oVirt 4.3.0.
Since we are releasing it tomorrow, January 29th, this bug has been re-targeted to 4.3.1.

Comment 2 Ahmad Khiet 2019-02-13 12:30:30 UTC
need more logs when this issue reproducing.

Comment 3 Elad 2019-02-13 12:31:46 UTC
I'll provide the logs when I see the issue again but they won't be any different from the ones that are already attached

Comment 4 Benny Zlotnik 2019-03-07 16:18:35 UTC
Some notes on the issue:
Setting the SPM priority isn't really relevant here, the issue is that the host is being removed while being disconnected from the storage servers:

Host moved to maintenance:
2019-01-12 09:41:03,746+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.HostMonitoring] (EE-ManagedThreadFactory-engineScheduled-Thread-83) [] Updated host status from 'Preparing for Maintenance' to 'Maintenance' in database, host 'host_mixed_1'(51bfacf7-177b-481b-95f7-426e335c040e)

Then, we disconnect it from the storage pool and storage servers (this is done in async fashion by the VdsEventListener:
2019-01-12 09:41:03,822+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-83) [] START, DisconnectStoragePoolVDSCommand(HostName = host_mixed_1, DisconnectStoragePoolVDSCommandParameters:{hostId='51bfacf7-177b-481b-95f7-426e335c040e', storagePoolId='e027b568-15b6-11e9-8bfd-001a4a168bfc', vds_spm_id='1'}), log id: 5a9592bf

While the host was being disconnected, a remove command was issued on that host:
2019-01-12 09:41:06,041+02 INFO  [org.ovirt.engine.core.bll.RemoveVdsCommand] (default task-39) [hosts_delete_a1d39f63-0337-4f24] Lock Acquired to object 'EngineLock:{exclusiveLocks='[51bfacf7-177b-481b-95f7-426e335c040e=VDS]', sharedLocks=''}'

However DisconnectStoragPoolVDS wasn't finished at the time:
2019-01-12 09:41:09,437+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DisconnectStoragePoolVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-83) [] FINISH, DisconnectStoragePoolVDSCommand, log id: 5a9592bf

And DisconnectStorageServerVDSCommand failed because RemoveVdsCommand was already in motion

Comment 5 Ahmad Khiet 2019-03-10 12:12:32 UTC
as Benny mentioned, the issue is that the host is being removed while its being disconnected from the storage servers to move the host from active to maintenance.

the issue caused by the locking group is different for remove command and maintenance command.

in the patch, the groups has been set to be the same locking group.

Comment 6 Avihai 2019-06-02 12:05:16 UTC
Verified at ovirt engine 4.3.4.2-0.1.el7 with the same scenario as provided in the initial description, no NPE/ERROR occured.

Comment 7 Sandro Bonazzola 2019-06-11 06:24:12 UTC
This bugzilla is included in oVirt 4.3.4 release, published on June 11th 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.3.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 8 Ahmad Khiet 2019-11-06 13:43:21 UTC
*** Bug 1656742 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.