1319657 – hosted_engine is restarted when any one node in cluster is down.

Bug 1319657 - hosted_engine is restarted when any one node in cluster is down.

Summary: hosted_engine is restarted when any one node in cluster is down.

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.17.23
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-3.6.8
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1298693
Blocks:	Gluster-HC-1
TreeView+	depends on / blocked

Reported:	2016-03-21 09:42 UTC by RamaKasturi
Modified:	2016-07-28 11:41 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2016-07-28 11:41:51 UTC
oVirt Team:	Gluster
Embargoed:
Dependent Products:
Flags:	sabose: ovirt-4.1? rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack?

Attachments	(Terms of Use)

Description RamaKasturi 2016-03-21 09:42:18 UTC

Description of problem:
Stopping glusterd on the node which is used as the first host while deploying hosted engine moves hosted_storage domain to inactive state. I see the following error in vdsm logs.

Thread-672139::INFO::2016-03-21 15:08:27,632::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=7, spUUID='00000000-0000-0000-0000-00000
0000000', conList=[{'id': '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs', 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}], optio
ns=None)
Thread-672139::ERROR::2016-03-21 15:08:27,733::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer
    conObj.connect()
  File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect
    self.validate()
  File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate
    replicaCount = self.volinfo['replicaCount']
  File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo
    self._volinfo = self._get_gluster_volinfo()
  File "/usr/share/vdsm/storage/storageServer.py", line 372, in _get_gluster_volinfo
    self._volfileserver)
  File "/usr/share/vdsm/supervdsm.py", line 50, in __call__
    return callMethod()
  File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda>
    **kwargs)
  File "<string>", line 2, in glusterVolumeInfo
  File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod
    raise convert_to_error(kind, result)
GlusterCmdExecFailedException: Command execution failed
error: Connection failed. Please check if gluster daemon is operational.
return code: 1


Version-Release number of selected component (if applicable):
glusterfs-3.7.5-18.36.git0b0925d.el7rhgs.x86_64
vdsm-4.17.23-0.1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
1. Install HC setup by using the installation doc.
2. Now stop glusterd on the first node.
3.

Actual results:
hosted_storage domain goes to inactive state.

Expected results:
hosted_storage domain should remain in active state.

Additional info:

Comment 1 RamaKasturi 2016-03-21 09:43:10 UTC

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319631/

Comment 2 Sahina Bose 2016-03-22 11:22:10 UTC

I think this is related to Bug 1303977.

Is there a periodic storage domain query that tries to remount? Will bringing down glusterd cause the storage domain to be reactivated?

Comment 3 Sahina Bose 2016-04-11 04:37:15 UTC

Kasturi, can you check if you still see this error? The related bug 1303977 is in verified state.

Comment 4 RamaKasturi 2016-04-13 07:51:14 UTC

I still see that the hosted_storage goes to inactive state when glusterd is brought down on the first host.

sosreports from all the host can be found in the link below.

http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319657/

Comment 5 Sahina Bose 2016-04-20 11:01:44 UTC

Moving this to Hosted engine as this problem is same as mentioned in Comment 9 on bug 1298693

Comment 6 Sahina Bose 2016-04-20 11:04:54 UTC

I have reassigned to you. Could you take a look?

Comment 7 Yaniv Lavi 2016-04-27 07:31:47 UTC


*** This bug has been marked as a duplicate of bug 1298693 ***

Comment 8 Sahina Bose 2016-04-27 12:20:07 UTC

I'm re-opening this, as I see the error when HE storage has been accessed as localhost:/engine

Now every time a node in cluster goes down, HE is restarted. This is not related to SPOF RFE.
Logs are in  bug 1298693#c8

Comment 9 Simone Tiraboschi 2016-05-05 12:13:49 UTC

(In reply to Sahina Bose from comment #8)
> I'm re-opening this, as I see the error when HE storage has been accessed as
> localhost:/engine

In this case the issue is just here:
the first host that is not able to talk with the local gluster instance reports it as down and the engine flags it as partially failed.

Comment 10 Yaniv Lavi 2016-05-05 13:34:20 UTC

You should not use localhost to setup the storage.

Comment 11 Sahina Bose 2016-05-09 15:25:10 UTC

(In reply to Simone Tiraboschi from comment #9)
> (In reply to Sahina Bose from comment #8)
> > I'm re-opening this, as I see the error when HE storage has been accessed as
> > localhost:/engine
> 
> In this case the issue is just here:
> the first host that is not able to talk with the local gluster instance
> reports it as down and the engine flags it as partially failed.

When a node in the cluster is down, the other 2 nodes still have the glusterd running - Why is the host unable to talk with local gluster instance? Am I missing something evident here?

And why is it not recommended to use localhost to access storage in an HC setup?

Comment 12 Yaniv Lavi 2016-05-09 15:32:44 UTC

(In reply to Sahina Bose from comment #11)
> (In reply to Simone Tiraboschi from comment #9)
> > (In reply to Sahina Bose from comment #8)
> > > I'm re-opening this, as I see the error when HE storage has been accessed as
> > > localhost:/engine
> > 
> > In this case the issue is just here:
> > the first host that is not able to talk with the local gluster instance
> > reports it as down and the engine flags it as partially failed.
> 
> When a node in the cluster is down, the other 2 nodes still have the
> glusterd running - Why is the host unable to talk with local gluster
> instance? Am I missing something evident here?
> 
> And why is it not recommended to use localhost to access storage in an HC
> setup?

Because localhost means that the storage is local vs provide FQDN that defines failover to other hosts as well in the mount.

Comment 13 Sahina Bose 2016-05-23 11:53:15 UTC

This needs to be retested with additional mount options as per Bug 1298693.
Kasturi can you check this again?

Comment 14 Simone Tiraboschi 2016-05-26 12:03:59 UTC

Bug 1298693 got merged for 3.6.7 RC1, can you please retest this using real host address and passing something like
 OVEHOSTED_STORAGE/mntOptions=str:backupvolfile-server=gluster.xyz.com,fetch-attempts=2,log-level=WARNING,log-file=/var/log/engine_domain.log
to hosted-engine-setup to avoid having a SPOF?

Comment 15 Sahina Bose 2016-06-02 07:24:02 UTC

Simone, I tested with 3.6.7.
I have 3 hosts rhsdev9, rhsdev13, rhsdev14.
Engine volume mounted using rhsdev9 - with mntOptions=str:backup-volfile-servers=rhsdev13:rhsdev14.
HE was running on rhsdev13

First test - bring glusterd down on rhsdev9 - PASS. HE continues to be available
Second test - poweroff rhsdev14 - HE engine is restarted on rhsdev9. No errors in agent/broker logs however.
Third test - poweroff rhsdev9 - HE engine is restarted since it was running on rhsdev9.

Lowering sev and prio - as HE engine is accessible after some time.

Comment 16 Sahina Bose 2016-06-02 07:27:03 UTC

Additional note - hosted_storage domain is online for all three tests

Comment 17 Sahina Bose 2016-07-28 11:41:51 UTC

After reducing the network.ping-timeout value on gluster volume, did not encounter the issue. Closing this

Note You need to log in before you can comment on or make changes to this bug.