Bug 966342

Summary: [RHEVM-RHS] One of the RHS Node in gluster cluster regularly goes non-operational as soon as its up
Product: Red Hat Enterprise Virtualization Manager Reporter: SATHEESARAN <sasundar>
Component: ovirt-engine-webadmin-portalAssignee: Sahina Bose <sabose>
Status: CLOSED NEXTRELEASE QA Contact: SATHEESARAN <sasundar>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.1.4CC: acathrow, dyasny, ecohen, grajaiya, hchiramm, iheim, Rhev-m-bugs, rhs-bugs, sabose, sasundar, vbellur, ykaul
Target Milestone: ---   
Target Release: 3.3.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard: gluster
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
virt rhev integration
Last Closed: 2013-06-24 06:39:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Screenshot of RHEVM, showing one non-operational RHS Nodes none

Description SATHEESARAN 2013-05-23 06:05:20 UTC
Created attachment 752001 [details]
Screenshot of RHEVM, showing one non-operational RHS Nodes

Description of problem:

RHS Nodes is shown non-operational immediately after it comes up. It happens at regular interval.

Events tab for that RHS Node says, "Host rhs-node-3 cannot access one of the Storage Domains attached to the Data Center data-center-1. Setting Host state to Non-Operational"

Version-Release number of selected component (if applicable):
RHS   : glusterfs-3.3.0.8rhs-1.el6rhs [ RHS2.0 Update 5 ]
RHEVM : RHEVM 3.1.4 [3.1.0-53.el6e]/[si28.1]
VDSM  
RHS Nodes in gluster cluster: 
RHEL-H Node in virt cluster :

How reproducible:


Steps to Reproduce:
1. Create a POSIX FS Data center of compatibility 3.1
2. Create a gluster cluster in the newly created data center.
3. Add RHS Nodes to this gluster cluster. RHS Nodes used 
for this gluster cluster, contains RHS 2.0 Update 5 rpms for glusterfs
4. Create a new virt cluster and add a RHEL-H [RHEL 6.4]to it.
5. Create a 6X2 Distributed Replicate volume using RHS Nodes.
6. Create a Data Domain [storage domain] and attach it to RHEL-H
7. Create a NFS export in RHEVM itself and make it  for ISO Domain
8. Attach the ISO Domain to the Data center

Actual results:
One of the RHS Nodes is shown as Non-operational. After sometime it comes up and immediately turns non-operational. It happens regularly.

Expected results:
The data center should be up and so will be the Nodes in the cluster

Additional info:

Comment 2 SATHEESARAN 2013-05-23 06:24:31 UTC
Additional Info:

Prior to hitting this issue, I was witnessing that Data-domain was throwing error, while attaching ISO domain. Also, I observed Data center going down and coming up, while attaching the ISO domain.

But later this was not happening.

Comment 4 Sahina Bose 2013-05-23 10:54:31 UTC
Does it come back UP?

Comment 5 SATHEESARAN 2013-05-23 11:51:35 UTC
Sahina, 

The behavior is fluctuating,

I could able to see in Events for the Node,

<snip>	
2013-May-23, 17:20
	
Host rhs-node-3 cannot access one of the Storage Domains attached to the Data Center data-center-1. Setting Host state to Non-Operational.
	
571ab2f0
	
2013-May-23, 17:20
	
Detected new Host rhs-node-3. Host state was set to Up.
</snip>

for every 5 minutes.

The host is coming up and immediately goes down,

Comment 6 SATHEESARAN 2013-05-23 12:45:18 UTC
(In reply to SATHEESARAN from comment #5)
> Sahina, 
> 
> The behavior is fluctuating,
> 
> I could able to see in Events for the Node,
> 
> <snip>	
> 2013-May-23, 17:20
> 	
> Host rhs-node-3 cannot access one of the Storage Domains attached to the
> Data Center data-center-1. Setting Host state to Non-Operational.
> 	
> 571ab2f0
> 	
> 2013-May-23, 17:20
> 	
> Detected new Host rhs-node-3. Host state was set to Up.
> </snip>
> 
> for every 5 minutes.
> 
> The host is coming up and immediately goes down,

I am seeing this issue even now

Comment 7 SATHEESARAN 2013-05-23 12:45:46 UTC
(In reply to SATHEESARAN from comment #2)
> Additional Info:
> 
> Prior to hitting this issue, I was witnessing that Data-domain was throwing
> error, while attaching ISO domain. Also, I observed Data center going down
> and coming up, while attaching the ISO domain.
> 
> But later this was not happening.

While attaching ISO Domain, I see an error message. This blocks further test run.

Seems like a bug, have to explore more to understand this problem, but I could not find a trace of it now.

2013-05-22 23:32:49,357 ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper] (pool-3-thread-43) [3ffa59c6] The connection with details 10.70.37.72:/exports/iso failed because of error code 100 and error message is: general exception

2013-05-22 23:32:49,357 ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper] (pool-3-thread-46) [626e53ca] The connection with details 10.70.37.72:/exports/iso failed because of error code 100 and error message is: general exception

2013-05-22 23:32:49,358 ERROR [org.ovirt.engine.core.bll.storage.NFSStorageHelper] (pool-3-thread-47) [2777b6d] The connection with details 10.70.37.72:/exports/iso failed because of error code 100 and error message is: general exception

2013-05-22 23:32:49,359 ERROR [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-3-thread-46) [626e53ca] Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand.

2013-05-22 23:32:49,360 ERROR [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-3-thread-43) [3ffa59c6] Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand.

2013-05-22 23:32:49,360 ERROR [org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand] (pool-3-thread-47) [2777b6d] Transaction rolled-back for command: org.ovirt.engine.core.bll.storage.ConnectStorageToVdsCommand.

Comment 8 SATHEESARAN 2013-05-23 12:48:24 UTC
(In reply to SATHEESARAN from comment #7)

This error was the one I was observing earlier while attaching the ISO Domain to Data center. This can be even seen in engine.logs.

But later this error vanished, followed by RHS Nodes going up and down at regular intervals

Comment 9 Sahina Bose 2013-06-14 08:22:30 UTC
rhs-node-3 was set as spm
2013-05-22 18:21:31,897 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.ConnectStorageServerVDSCommand] (pool-3-thread-41) [6cc90984] START, ConnectStorageServerVDSCommand(HostName = rhs-node-3, HostId = 825c6db4-c299-11e2-a8a9-525400e469d5, storagePoolId = 00000000-0000-0000-0000-000000000000, storageType = NFS, connectionList = [{ id: 18defe0a-3d29-489c-9760-4eaca5813220, connection: 10.70.37.72:/exports/iso, iqn: null, vfsType: null, mountOptions: null, nfsVersion: null, nfsRetrans: null, nfsTimeo: null };]), log id: 40f6fab8

and there was an issue with storage domain - so set to Non-Operational
vds rhs-node-3 reported domain 7fb41558-ed54-4781-bd55-8e26cf22c362:data-domain as in problem, moving the vds to status NonOperational

The Gluster sync job was setting the Host to Up, however as it does not check for storage domain status.

Gluster cluster hosts should not have SPM status - This is fixed in 3.2