Description of problem: Stopping glusterd on the node which is used as the first host while deploying hosted engine moves hosted_storage domain to inactive state. I see the following error in vdsm logs. Thread-672139::INFO::2016-03-21 15:08:27,632::logUtils::48::dispatcher::(wrapper) Run and protect: connectStorageServer(domType=7, spUUID='00000000-0000-0000-0000-00000 0000000', conList=[{'id': '7926ce79-7846-466a-aa13-5296272b1d24', 'vfs_type': 'glusterfs', 'connection': 'rhsqa1.lab.eng.blr.redhat.com:/engine', 'user': 'kvm'}], optio ns=None) Thread-672139::ERROR::2016-03-21 15:08:27,733::hsm::2473::Storage.HSM::(connectStorageServer) Could not connect to storageServer Traceback (most recent call last): File "/usr/share/vdsm/storage/hsm.py", line 2470, in connectStorageServer conObj.connect() File "/usr/share/vdsm/storage/storageServer.py", line 223, in connect self.validate() File "/usr/share/vdsm/storage/storageServer.py", line 348, in validate replicaCount = self.volinfo['replicaCount'] File "/usr/share/vdsm/storage/storageServer.py", line 335, in volinfo self._volinfo = self._get_gluster_volinfo() File "/usr/share/vdsm/storage/storageServer.py", line 372, in _get_gluster_volinfo self._volfileserver) File "/usr/share/vdsm/supervdsm.py", line 50, in __call__ return callMethod() File "/usr/share/vdsm/supervdsm.py", line 48, in <lambda> **kwargs) File "<string>", line 2, in glusterVolumeInfo File "/usr/lib64/python2.7/multiprocessing/managers.py", line 773, in _callmethod raise convert_to_error(kind, result) GlusterCmdExecFailedException: Command execution failed error: Connection failed. Please check if gluster daemon is operational. return code: 1 Version-Release number of selected component (if applicable): glusterfs-3.7.5-18.36.git0b0925d.el7rhgs.x86_64 vdsm-4.17.23-0.1.el7ev.noarch How reproducible: Always Steps to Reproduce: 1. Install HC setup by using the installation doc. 2. Now stop glusterd on the first node. 3. Actual results: hosted_storage domain goes to inactive state. Expected results: hosted_storage domain should remain in active state. Additional info:
http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319631/
I think this is related to Bug 1303977. Is there a periodic storage domain query that tries to remount? Will bringing down glusterd cause the storage domain to be reactivated?
Kasturi, can you check if you still see this error? The related bug 1303977 is in verified state.
I still see that the hosted_storage goes to inactive state when glusterd is brought down on the first host. sosreports from all the host can be found in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319657/
Moving this to Hosted engine as this problem is same as mentioned in Comment 9 on bug 1298693
I have reassigned to you. Could you take a look?
*** This bug has been marked as a duplicate of bug 1298693 ***
I'm re-opening this, as I see the error when HE storage has been accessed as localhost:/engine Now every time a node in cluster goes down, HE is restarted. This is not related to SPOF RFE. Logs are in bug 1298693#c8
(In reply to Sahina Bose from comment #8) > I'm re-opening this, as I see the error when HE storage has been accessed as > localhost:/engine In this case the issue is just here: the first host that is not able to talk with the local gluster instance reports it as down and the engine flags it as partially failed.
You should not use localhost to setup the storage.
(In reply to Simone Tiraboschi from comment #9) > (In reply to Sahina Bose from comment #8) > > I'm re-opening this, as I see the error when HE storage has been accessed as > > localhost:/engine > > In this case the issue is just here: > the first host that is not able to talk with the local gluster instance > reports it as down and the engine flags it as partially failed. When a node in the cluster is down, the other 2 nodes still have the glusterd running - Why is the host unable to talk with local gluster instance? Am I missing something evident here? And why is it not recommended to use localhost to access storage in an HC setup?
(In reply to Sahina Bose from comment #11) > (In reply to Simone Tiraboschi from comment #9) > > (In reply to Sahina Bose from comment #8) > > > I'm re-opening this, as I see the error when HE storage has been accessed as > > > localhost:/engine > > > > In this case the issue is just here: > > the first host that is not able to talk with the local gluster instance > > reports it as down and the engine flags it as partially failed. > > When a node in the cluster is down, the other 2 nodes still have the > glusterd running - Why is the host unable to talk with local gluster > instance? Am I missing something evident here? > > And why is it not recommended to use localhost to access storage in an HC > setup? Because localhost means that the storage is local vs provide FQDN that defines failover to other hosts as well in the mount.
This needs to be retested with additional mount options as per Bug 1298693. Kasturi can you check this again?
Bug 1298693 got merged for 3.6.7 RC1, can you please retest this using real host address and passing something like OVEHOSTED_STORAGE/mntOptions=str:backupvolfile-server=gluster.xyz.com,fetch-attempts=2,log-level=WARNING,log-file=/var/log/engine_domain.log to hosted-engine-setup to avoid having a SPOF?
Simone, I tested with 3.6.7. I have 3 hosts rhsdev9, rhsdev13, rhsdev14. Engine volume mounted using rhsdev9 - with mntOptions=str:backup-volfile-servers=rhsdev13:rhsdev14. HE was running on rhsdev13 First test - bring glusterd down on rhsdev9 - PASS. HE continues to be available Second test - poweroff rhsdev14 - HE engine is restarted on rhsdev9. No errors in agent/broker logs however. Third test - poweroff rhsdev9 - HE engine is restarted since it was running on rhsdev9. Lowering sev and prio - as HE engine is accessible after some time.
Additional note - hosted_storage domain is online for all three tests
After reducing the network.ping-timeout value on gluster volume, did not encounter the issue. Closing this