Bug 1263695 - [engine-backend] AddStoragePoolWithStorages fails with NullPointerException after iSCSI connection failure
[engine-backend] AddStoragePoolWithStorages fails with NullPointerException a...
Status: CLOSED CURRENTRELEASE
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine (Show other bugs)
3.6.0
x86_64 Unspecified
unspecified Severity high
: ovirt-3.6.1
: 3.6.0
Assigned To: Maor
Elad
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-09-16 08:36 EDT by Elad
Modified: 2016-03-10 07:00 EST (History)
9 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Previously when importing a storage domain and the engine was unable to read the storage domain metadata, it threw an exception. Fixed the query checking so that if the engine is unable to read the meta data it provides a warning. The VDSM will cause the operation to fail naturally when being attached.
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Storage
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
logs from engine and host (2.15 MB, application/x-gzip)
2015-09-16 08:36 EDT, Elad
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
oVirt gerrit 47020 master MERGED core: Surround vds command with try catch. Never
oVirt gerrit 47034 ovirt-engine-3.6 MERGED core: Surround vds command with try catch. Never

  None (edit)
Description Elad 2015-09-16 08:36:02 EDT
Created attachment 1073998 [details]
logs from engine and host

Description of problem:
A vdsm failure to connect to storage server while creating a storage pool (first storage domain in the DC) is not handled right by engine. The CanDoActionFailure fails with a NullPointerException.
Happened on a hosted-engine environment though I don't think it's relevant.

Version-Release number of selected component (if applicable):
rhevm-3.6.0-12
rhevm-3.6.0-0.15.master.el6.noarch
vdsm-4.17.5-1.el7ev.noarch

How reproducible:
Need a vdsm failure to connect to the storage server (login to iSCSI target in my case)  

Steps to Reproduce:
1. Activate 2 hosts in a cluster
2. Initiate iSCSI storage domain creation and cause one of the hosts to fail its login to the iSCSI server (happened spontaneously in my case but could be reproduced using iptables/firewalld connectivity block during the iSCSI login) 


Actual results:
One of the hosts fails to connect to the storage server.

Thread-2831::ERROR::2015-09-16 06:13:30,496::hsm::2454::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2451, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 473, in connect
    iscsi.addIscsiNode(self._iface, self._target, self._cred)
  File "/usr/share/vdsm/storage/iscsi.py", line 201, in addIscsiNode
    iscsiadm.node_login(iface.name, portalStr, target.iqn)
  File "/usr/share/vdsm/storage/iscsiadm.py", line 314, in node_login
    raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: default, target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260] (multiple)'], ['iscsiadm: Could not login to [iface: default, targe
t: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals'])
Thread-2831::DEBUG::2015-09-16 06:13:30,496::hsm::2478::Storage.HSM::(connectStorageServer) knownSDs: {}


This failure is not handled right in engine. The CanDoActionFailure fails with a NullPointerException and in Webadmin we get 'Internal engine error' message.


2015-09-16 00:08:28,784 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-11) [7c4f4cb1] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_1 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2015-09-16 00:08:28,785 ERROR [org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand] (ajp-/127.0.0.1:8702-6) [] Error during CanDoActionFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand.isStorageDomainAttachedToStoragePool(AddStoragePoolWithStoragesCommand.java:367) [bll.jar:]



Expected results:
The CanDoActionFailure should succeed.

Additional info:
logs from engine and host
Comment 1 Maor 2015-10-06 09:31:35 EDT
Worth to remember: Once the host fails to read the Storage Domain metadata, for validating if the Storage Domain is attached to another Data Center, the engine should not stop the attach operation (and of course not through an NPE on the process) but try to connect the Storage Domain to the DC and if in the worst case the Storage Domain is still attached to another Data Center, VDSM should fail the operation.
Comment 2 Elad 2015-12-01 06:21:30 EST
A vdsm failure to connect to storage server (iSCSI) is handled right by engine.

iSCSI login failure:  

jsonrpc.Executor/3::ERROR::2015-12-01 11:14:48,378::hsm::2465::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2462, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 480, in connect
    iscsi.addIscsiNode(self._iface, self._target, self._cred)
  File "/usr/share/vdsm/storage/iscsi.py", line 201, in addIscsiNode
    iscsiadm.node_login(iface.name, portalStr, target.iqn)
  File "/usr/share/vdsm/storage/iscsiadm.py", line 314, in node_login
    raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: default, target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260] (multiple)'], ['iscsiadm: Could not login to [iface: default, target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals'])

	
Engine:
	
Operation Canceled
Error while executing action Attach Storage Domain: Network error during communication with the Host.

Tested using:
rhevm-3.6.1-0.2.el6.noarch
vdsm-4.17.11-0.el7ev.noarch
Comment 3 Allon Mureinik 2016-03-10 05:37:59 EST
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE
Comment 4 Allon Mureinik 2016-03-10 05:38:22 EST
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE
Comment 5 Allon Mureinik 2016-03-10 05:44:11 EST
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE
Comment 6 Allon Mureinik 2016-03-10 07:00:54 EST
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Note You need to log in before you can comment on or make changes to this bug.