Bug 1319657
Summary: | hosted_engine is restarted when any one node in cluster is down. | ||
---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | RamaKasturi <knarra> |
Component: | Core | Assignee: | Simone Tiraboschi <stirabos> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | SATHEESARAN <sasundar> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | 4.17.23 | CC: | ahino, bugs, cshao, knarra, leiwang, nsoffer, sabose, stirabos, ycui |
Target Milestone: | ovirt-3.6.8 | Keywords: | Reopened |
Target Release: | --- | Flags: | sabose:
ovirt-4.1?
rule-engine: planning_ack? rule-engine: devel_ack? rule-engine: testing_ack? |
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-07-28 11:41:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | Gluster | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1298693 | ||
Bug Blocks: | 1258386 |
Description
RamaKasturi
2016-03-21 09:42:18 UTC
I think this is related to Bug 1303977. Is there a periodic storage domain query that tries to remount? Will bringing down glusterd cause the storage domain to be reactivated? Kasturi, can you check if you still see this error? The related bug 1303977 is in verified state. I still see that the hosted_storage goes to inactive state when glusterd is brought down on the first host. sosreports from all the host can be found in the link below. http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/HC/1319657/ Moving this to Hosted engine as this problem is same as mentioned in Comment 9 on bug 1298693 I have reassigned to you. Could you take a look? *** This bug has been marked as a duplicate of bug 1298693 *** I'm re-opening this, as I see the error when HE storage has been accessed as localhost:/engine Now every time a node in cluster goes down, HE is restarted. This is not related to SPOF RFE. Logs are in bug 1298693#c8 (In reply to Sahina Bose from comment #8) > I'm re-opening this, as I see the error when HE storage has been accessed as > localhost:/engine In this case the issue is just here: the first host that is not able to talk with the local gluster instance reports it as down and the engine flags it as partially failed. You should not use localhost to setup the storage. (In reply to Simone Tiraboschi from comment #9) > (In reply to Sahina Bose from comment #8) > > I'm re-opening this, as I see the error when HE storage has been accessed as > > localhost:/engine > > In this case the issue is just here: > the first host that is not able to talk with the local gluster instance > reports it as down and the engine flags it as partially failed. When a node in the cluster is down, the other 2 nodes still have the glusterd running - Why is the host unable to talk with local gluster instance? Am I missing something evident here? And why is it not recommended to use localhost to access storage in an HC setup? (In reply to Sahina Bose from comment #11) > (In reply to Simone Tiraboschi from comment #9) > > (In reply to Sahina Bose from comment #8) > > > I'm re-opening this, as I see the error when HE storage has been accessed as > > > localhost:/engine > > > > In this case the issue is just here: > > the first host that is not able to talk with the local gluster instance > > reports it as down and the engine flags it as partially failed. > > When a node in the cluster is down, the other 2 nodes still have the > glusterd running - Why is the host unable to talk with local gluster > instance? Am I missing something evident here? > > And why is it not recommended to use localhost to access storage in an HC > setup? Because localhost means that the storage is local vs provide FQDN that defines failover to other hosts as well in the mount. This needs to be retested with additional mount options as per Bug 1298693. Kasturi can you check this again? Bug 1298693 got merged for 3.6.7 RC1, can you please retest this using real host address and passing something like OVEHOSTED_STORAGE/mntOptions=str:backupvolfile-server=gluster.xyz.com,fetch-attempts=2,log-level=WARNING,log-file=/var/log/engine_domain.log to hosted-engine-setup to avoid having a SPOF? Simone, I tested with 3.6.7. I have 3 hosts rhsdev9, rhsdev13, rhsdev14. Engine volume mounted using rhsdev9 - with mntOptions=str:backup-volfile-servers=rhsdev13:rhsdev14. HE was running on rhsdev13 First test - bring glusterd down on rhsdev9 - PASS. HE continues to be available Second test - poweroff rhsdev14 - HE engine is restarted on rhsdev9. No errors in agent/broker logs however. Third test - poweroff rhsdev9 - HE engine is restarted since it was running on rhsdev9. Lowering sev and prio - as HE engine is accessible after some time. Additional note - hosted_storage domain is online for all three tests After reducing the network.ping-timeout value on gluster volume, did not encounter the issue. Closing this |