Bug 1263695

Summary: [engine-backend] AddStoragePoolWithStorages fails with NullPointerException after iSCSI connection failure
Product: Red Hat Enterprise Virtualization Manager Reporter: Elad <ebenahar>
Component: ovirt-engineAssignee: Maor <mlipchuk>
Status: CLOSED CURRENTRELEASE QA Contact: Elad <ebenahar>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: amureini, bgraveno, gklein, lsurette, rbalakri, Rhev-m-bugs, tnisan, yeylon, ykaul
Target Milestone: ovirt-3.6.1   
Target Release: 3.6.0   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Previously when importing a storage domain and the engine was unable to read the storage domain metadata, it threw an exception. Fixed the query checking so that if the engine is unable to read the meta data it provides a warning. The VDSM will cause the operation to fail naturally when being attached.
Story Points: ---
Clone Of: Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs from engine and host none

Description Elad 2015-09-16 12:36:02 UTC
Created attachment 1073998 [details]
logs from engine and host

Description of problem:
A vdsm failure to connect to storage server while creating a storage pool (first storage domain in the DC) is not handled right by engine. The CanDoActionFailure fails with a NullPointerException.
Happened on a hosted-engine environment though I don't think it's relevant.

Version-Release number of selected component (if applicable):
rhevm-3.6.0-12
rhevm-3.6.0-0.15.master.el6.noarch
vdsm-4.17.5-1.el7ev.noarch

How reproducible:
Need a vdsm failure to connect to the storage server (login to iSCSI target in my case)  

Steps to Reproduce:
1. Activate 2 hosts in a cluster
2. Initiate iSCSI storage domain creation and cause one of the hosts to fail its login to the iSCSI server (happened spontaneously in my case but could be reproduced using iptables/firewalld connectivity block during the iSCSI login) 


Actual results:
One of the hosts fails to connect to the storage server.

Thread-2831::ERROR::2015-09-16 06:13:30,496::hsm::2454::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2451, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 473, in connect
    iscsi.addIscsiNode(self._iface, self._target, self._cred)
  File "/usr/share/vdsm/storage/iscsi.py", line 201, in addIscsiNode
    iscsiadm.node_login(iface.name, portalStr, target.iqn)
  File "/usr/share/vdsm/storage/iscsiadm.py", line 314, in node_login
    raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: default, target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260] (multiple)'], ['iscsiadm: Could not login to [iface: default, targe
t: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals'])
Thread-2831::DEBUG::2015-09-16 06:13:30,496::hsm::2478::Storage.HSM::(connectStorageServer) knownSDs: {}


This failure is not handled right in engine. The CanDoActionFailure fails with a NullPointerException and in Webadmin we get 'Internal engine error' message.


2015-09-16 00:08:28,784 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (org.ovirt.thread.pool-7-thread-11) [7c4f4cb1] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: Host hosted_engine_1 is not responding. Host cannot be fenced automatically because power management for the host is disabled.
2015-09-16 00:08:28,785 ERROR [org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand] (ajp-/127.0.0.1:8702-6) [] Error during CanDoActionFailure.: java.lang.NullPointerException
        at org.ovirt.engine.core.bll.storage.AddStoragePoolWithStoragesCommand.isStorageDomainAttachedToStoragePool(AddStoragePoolWithStoragesCommand.java:367) [bll.jar:]



Expected results:
The CanDoActionFailure should succeed.

Additional info:
logs from engine and host

Comment 1 Maor 2015-10-06 13:31:35 UTC
Worth to remember: Once the host fails to read the Storage Domain metadata, for validating if the Storage Domain is attached to another Data Center, the engine should not stop the attach operation (and of course not through an NPE on the process) but try to connect the Storage Domain to the DC and if in the worst case the Storage Domain is still attached to another Data Center, VDSM should fail the operation.

Comment 2 Elad 2015-12-01 11:21:30 UTC
A vdsm failure to connect to storage server (iSCSI) is handled right by engine.

iSCSI login failure:  

jsonrpc.Executor/3::ERROR::2015-12-01 11:14:48,378::hsm::2465::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2462, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 480, in connect
    iscsi.addIscsiNode(self._iface, self._target, self._cred)
  File "/usr/share/vdsm/storage/iscsi.py", line 201, in addIscsiNode
    iscsiadm.node_login(iface.name, portalStr, target.iqn)
  File "/usr/share/vdsm/storage/iscsiadm.py", line 314, in node_login
    raise IscsiNodeError(rc, out, err)
IscsiNodeError: (8, ['Logging in to [iface: default, target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260] (multiple)'], ['iscsiadm: Could not login to [iface: default, target: iqn.2008-05.com.xtremio:xio00153500071-514f0c50023f6c05, portal: 10.35.146.225,3260].', 'iscsiadm: initiator reported error (8 - connection timed out)', 'iscsiadm: Could not log into all portals'])

	
Engine:
	
Operation Canceled
Error while executing action Attach Storage Domain: Network error during communication with the Host.

Tested using:
rhevm-3.6.1-0.2.el6.noarch
vdsm-4.17.11-0.el7ev.noarch

Comment 3 Allon Mureinik 2016-03-10 10:37:59 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 4 Allon Mureinik 2016-03-10 10:38:22 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 5 Allon Mureinik 2016-03-10 10:44:11 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE

Comment 6 Allon Mureinik 2016-03-10 12:00:54 UTC
RHEV 3.6.0 has been released, setting status to CLOSED CURRENTRELEASE