Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1192642

Summary:	[Bug 3.3] Can't add new FC storage domain
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Robert McSwain <rmcswain>
Component:	ovirt-engine	Assignee:	Daniel Erez <derez>
Status:	CLOSED NOTABUG	QA Contact:	Aharon Canan <acanan>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	3.3.0	CC:	adahms, amureini, derez, ecohen, iheim, lpeer, lsurette, rbalakri, Rhev-m-bugs, rmcswain, yeylon
Target Milestone:	ovirt-3.6.3
Target Release:	3.6.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	storage
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2015-02-17 15:16:25 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	Storage	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Robert McSwain 2015-02-13 22:46:23 UTC

Release:    Red Hat Enterprise Virtualization Hypervisor release 6.5 
(20140407.0.el6ev)
Kernel:     Linux encl1-usrhv201 2.6.32-431.11.2.el6.x86_64 #1 SMP Mon Mar 3 
13:32:45 EST 2014 x86_64 x86_64 x86_64 GNU/Linux
Uptime:     08:46:48 up 4 days, 18:28,  0 users,  load average: 0.83, 0.74, 0.77

CPUINFO
        Processor count: 24
        vendor_id       : GenuineIntel
        model name      : Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz
        cpu MHz         : 2900.118
        siblings        : 12    (per socket)
        cpu cores       : 6     (per socket)
        Socket:  1   [logical cores:  12 ]
        Socket:  0   [logical cores:  12 ]

BIOS Information
                Vendor: HP
                Version: I31
                Release Date: 02/10/2014
                iLO     Firmware Revision: 1.51
System Information
                Manufacturer: HP
                Product Name: ProLiant BL460c Gen8

MEMINFO
        MemTotal:       132128300 kB            Note:  132,128,300  kB
        SwapTotal:      82841592 kB             Note:  82,841,592  kB
        SwapFree:       82841592 kB
        HugePages_Total:       0

==== Problem description ====

Customer with a RHEV 3.3 environment consisting of 10 BL460 Gen8 Hypervisors
and a DL360p Gen8 as RHEVM.   They have a FC single storage domain that is
on a EMC VNX5500 storage array.

They recently did a test where they wanted to add an additional FC 
EMC VNXe3200 storage array. It has configured/zoned on same FC switches. 
For testing, they have created a LUN. From the hypervisor customer can see the 
new LUN. He was able to create a new domain called "eng_lun_usdfs". 
However, He was not able to attach new LUN to the datacenter. the error was

"failed to attach data to the datacenter"

The LUN was presented to hypervisor "encl1-ushrv201" only as part of this
test.   I wonder if this is the problem.   In earlier releases of RHEV,
there was a clear statement that storage domains needed to be accessible to 
all hosts.    It makes sense to me that ALL nodes need to be able to access
the storage domain. 

I could only find one sentence that suggests that this is mandatory in the
3.3 Admin Guide: 

<snip>
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Virtualization/
3.3/html-single/Administration_Guide/index.html

A local storage domain can be set up on a host. When you set up host to use 
local storage, the host automatically gets added to a new data center and 
cluster that no other hosts can be added to. Multiple host clusters require 
that all hosts have access to all storage domains, which is not possible with 
local storage. Virtual machines created in a single host cluster cannot be 
migrated, fenced or scheduled. 

</snip>

I find some comments that suggest that this is not the case: 

<snip>
All communication to the storage domain is via the selected host and not 
directly from the Red Hat Enterprise Virtualization Manager. At least one active 
host must exist in the system, and be attached to the chosen data center, before 
the storage is configured. 

</snip>

On RHEVM, engine.log shows when it was created: 

./var/log/ovirt-engine/engine.log:2015-02-09 09:08:05,933 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(ajp-/127.0.0.1:8702-7) [92383fb] Correlation ID: 92383fb, Job ID: 
c3fcfc65-043f-4978-9eaf-baa2233ecc2b, Call Stack: null, Custom Event ID: -1, 
Message: Storage Domain eng_lun_usdfs was added by admin@internal

./var/log/ovirt-engine/engine.log:2015-02-09 09:09:15,359 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(ajp-/127.0.0.1:8702-5) [604172a8] Correlation ID: 6cadeb4a, Job ID: 
b6cf8f3a-8c45-4916-b477-0caac69539f3, Call Stack: null, Custom Event ID: -1, 
Message: Failed to attach Storage Domain eng_lun_usdfs to Data Center Bothell. 
(User: admin@internal)

./var/log/ovirt-engine/engine.log:2015-02-09 09:11:04,705 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(pool-4-thread-31) [39724b62] Correlation ID: 4772dd74, Job ID: 
d9ebfabe-88e4-4ed1-b60d-282c073fff55, Call Stack: null, Custom Event ID: -1, 
Message: Failed to attach Storage Domain eng_lun_usdfs to Data Center Bothell. 
(User: admin@internal)

./var/log/ovirt-engine/engine.log:2015-02-09 10:10:13,553 INFO  
[org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] 
(pool-4-thread-38) [232ff5a8] Correlation ID: 723e9458, Job ID: 
2ac90545-750f-47d3-b6d3-ad514b70b431, Call Stack: null, Custom Event ID: -1, 
Message: Failed to attach Storage Domain eng_lun_usdfs to Data Center Bothell. 
(User: admin@internal)

When I look at the logs,  I can see that all nodes get the GetVGInfoVDSCommand:

*2015-02-09 09:08:06,464 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv206, HostId = dadbd25f-4e1a-4c80-a215-59cee9fee3f9, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 1fef84c1

*2015-02-09 09:08:06,464 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-45) [16efcfd3] START, GetVGInfoVDSCommand(HostName = 
encl2-usrhv207, HostId = 89c931b6-64fe-4002-91a3-7301576f374c, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 2b371828

*2015-02-09 09:08:06,464 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-37) [49ca1813] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv203, HostId = be2a7d7a-7359-49c6-a838-df57f7d25f51, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 6219176f

*2015-02-09 09:08:06,465 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-36) [58e67dea] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv205, HostId = 295eba80-1289-44df-adc2-d3ebee14f2b8, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 617604af

*2015-02-09 09:08:06,472 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-49) [5a353168] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv202, HostId = d0de758e-0e85-41d9-b5ea-d503a74162d2, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 7d993780

*2015-02-09 09:08:06,473 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-50) [5a3ed11c] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv204, HostId = ecdfc282-56ff-4e30-8e0d-994fa5e2c41b, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 69559c6e

*2015-02-09 09:08:06,474 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-40) [56d12751] START, GetVGInfoVDSCommand(HostName = 
encl2-usrhv209, HostId = 67ccf3f9-afb2-484d-bb57-6fe8650de8c3, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 1e8af27c

*2015-02-09 09:08:06,474 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-26) [454bb3f] START, GetVGInfoVDSCommand(HostName = 
encl2-usrhv210, HostId = b9cea4b5-d2c8-4b75-8257-f3c20bf7199c, 
VGID=z1jxsy-*dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 574ec52a

*2015-02-09 09:08:06,475 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-43) [5212b38d] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv201, HostId = 593d7488-cfa6-43e5-bdd2-9291861926ef, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 399b8b6d

*2015-02-09 09:08:06,478 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-24) [3d6c635e] START, GetVGInfoVDSCommand(HostName = 
encl2-usrhv208, HostId = f86b698d-d97e-4133-9bc7-c8b86304c4d7, 
VGID=z1jxsy*-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 73912fc8

But since only ONE can see it, the other fail. 

If I look at one node, I can see this pattern in the failure: 

2015-02-09 09:08:06,412 INFO  
[org.ovirt.engine.core.bll.storage.SyncLunsInfoForBlockStorageDomainCommand] 
(pool-4-thread-39) [72f2ad3d] Running command: 
SyncLunsInfoForBlockStorageDomainCommand internal: true. Entities affected :  
ID: 52645ff5-ee38-4574-9787-5cf72d893f39 Type: Storage

2015-02-09 09:08:06,464 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] START, GetVGInfoVDSCommand(HostName = 
encl1-usrhv206, HostId = dadbd25f-4e1a-4c80-a215-59cee9fee3f9, 
VGID=z1jxsy-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k), log id: 1fef84c1

2015-02-09 09:08:06,671 ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] Failed in GetVGInfoVDS method

2015-02-09 09:08:06,672 ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] Error code VolumeGroupDoesNotExist and error 
message VDSGenericException: VDSErrorException: Failed to GetVGInfoVDS, error = 
Volume Group does not exist: ('vg_uuid: 
z1jxsy-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k',)

2015-02-09 09:08:06,672 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] Command 
org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand return value 

2015-02-09 09:08:06,672 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] HostName = encl1-usrhv206

2015-02-09 09:08:06,672 ERROR 
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] Command GetVGInfoVDS execution failed. Exception: 
VDSErrorException: VDSGenericException: VDSErrorException: Failed to 
GetVGInfoVDS, error = Volume Group does not exist: ('vg_uuid: 
z1jxsy-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k',)

2015-02-09 09:08:06,672 INFO  
[org.ovirt.engine.core.vdsbroker.vdsbroker.GetVGInfoVDSCommand] 
(pool-4-thread-39) [72f2ad3d] FINISH, GetVGInfoVDSCommand, log id: 1fef84c1

2015-02-09 09:08:06,672 ERROR 
[org.ovirt.engine.core.bll.storage.SyncLunsInfoForBlockStorageDomainCommand] 
(pool-4-thread-39) [72f2ad3d] Command 
org.ovirt.engine.core.bll.storage.SyncLunsInfoForBlockStorageDomainCommand 
throw 
Vdc Bll exception. With error message VdcBLLException: 
org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: 
VDSGenericException: VDSErrorException: Failed to GetVGInfoVDS, error = Volume 
Group does not exist: ('vg_uuid: z1jxsy-dP5X-xfXy-0FF4-fTPh-Gzw4-2clq9k',) 
(Failed with error VolumeGroupDoesNotExist and code 506)

Q1:  Is this expected behavior when the new lun for the storage domain is
     only exposed to one host in the cluster? 

Q2:  Where is the documentation that states that all nodes must have visibility
     to the lun if this is the case? 

Q3:  If I am off base, what is causing the failure? 

How reproducible: XXXX
Steps to reproduce: XXXX
Actual results: XXXX
Expected results: XXXX
Summary of actions taken to resolve/troubleshoot issue: XXXX

Comment 2 Allon Mureinik 2015-02-14 07:13:48 UTC

(In reply to Robert McSwain from comment #0)
> 
> Q1:  Is this expected behavior when the new lun for the storage domain is
>      only exposed to one host in the cluster? 
Yes.
Moreover, IIRC, in 3.4 RHEVM even checks this before attempting to create the domain.
Daniel, can you confirm/refute?

> 
> Q2:  Where is the documentation that states that all nodes must have
> visibility
>      to the lun if this is the case? 
I could not find such a note, but there definitely should be one.
Andrew, can you weigh in here please?

Comment 3 Andrew Dahms 2015-02-14 12:16:55 UTC

Hi Allon,

Thank you for the needinfo request.

We have a range of information about storage domains in the Administration Guide and Technical Guide, but I don't think there is a section where we call this out directly.

I would be happy to raise a documentation bug and work on including this information if you like.

Kind regards,

Andrew

Comment 4 Daniel Erez 2015-02-15 08:06:03 UTC

(In reply to Allon Mureinik from comment #2)
> (In reply to Robert McSwain from comment #0)
> > 
> > Q1:  Is this expected behavior when the new lun for the storage domain is
> >      only exposed to one host in the cluster? 
> Yes.
> Moreover, IIRC, in 3.4 RHEVM even checks this before attempting to create
> the domain.
> Daniel, can you confirm/refute?

Connectivity is verified only upon attaching the domain to pool. I'm not sure about documentation, but the limitation is explained on [1] (btw, this limitation should probably be relaxed to the cluster level after deprecating the pool in 3.6). As a quick workaround, you can try putting the problematic host in maintenance - if the issue reoccurs, please attach the relevant vdsm and engine logs (the tarball in the ticket seems to be broken).

[1] https://access.redhat.com/solutions/61912

> 
> > 
> > Q2:  Where is the documentation that states that all nodes must have
> > visibility
> >      to the lun if this is the case? 
> I could not find such a note, but there definitely should be one.
> Andrew, can you weigh in here please?

Comment 5 Allon Mureinik 2015-02-17 15:16:25 UTC

A storage device used for a domain should be accessible from all hosts in the DC. This is behavior by design.

I'll follow up on the documentation for 3.5.z to make this issue clearer.