Bug 1597719
| Summary: | [HE] Restart of vdsmd service took this host out of HA pool for ~15 minutes | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-hosted-engine-ha | Reporter: | Kobi Hakimi <khakimi> | ||||
| Component: | Agent | Assignee: | Simone Tiraboschi <stirabos> | ||||
| Status: | CLOSED NOTABUG | QA Contact: | Pavel Stehlik <pstehlik> | ||||
| Severity: | unspecified | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 2.3.0 | CC: | bugs, khakimi, michal.skrivanek | ||||
| Target Milestone: | --- | Keywords: | Automation | ||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2018-07-16 07:55:26 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Integration | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Why do you expect few seconds? Why is it a problem anyway? I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor and take action. We talking about High Availablity environment... IMHO we should be up and running as fast as we can, to be ready for some failure. In this case, I think that we can handle it in less than 2-3 minutes in the worse case. (In reply to Kobi Hakimi from comment #2) > I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor > and take action. > > We talking about High Availablity environment... IMHO we should be up and > running as fast as we can, to be ready for some failure. > In this case, I think that we can handle it in less than 2-3 minutes in the > worse case. Fencing action typically takes 5 minutes at least, typically more in more complicated environment. Here only one node is affected by a manual action which you're not supposed to do (not while the host is not in maintenance). So again, why do you care if it takes few minutes to stabilize? (In reply to Michal Skrivanek from comment #3) > (In reply to Kobi Hakimi from comment #2) > > I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor > > and take action. > > > > We talking about High Availablity environment... IMHO we should be up and > > running as fast as we can, to be ready for some failure. > > In this case, I think that we can handle it in less than 2-3 minutes in the > > worse case. > > Fencing action typically takes 5 minutes at least, typically more in more > complicated environment. Here only one node is affected by a manual action > which you're not supposed to do (not while the host is not in maintenance). > So again, why do you care if it takes few minutes to stabilize? You for sure know the times better than me. but still, we talking here about 15 minutes... You right, we did manually restart to vdsmd service. but in the other hand some malfunction could happen to vdsmd and as HA we should take care to be up & running as much as we can. you can priorities this bug as low/medium but to say that this is the expected result I don't agree :) The issue is here:
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
self.action(self._options)
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mgmt_bridge.py", line 47, in action
caps = cli.Host.getCapabilities()
File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
raise TimeoutError(method, kwargs, timeout)
TimeoutError: Request Host.getCapabilities with args {} timed out after 900 seconds
(In reply to Martin Sivák from comment #5) > TimeoutError: Request Host.getCapabilities with args {} timed out after 900 > seconds We could evaluate lowering that value to 60 or 90 seconds. Closing as not a bug. |
Created attachment 1456247 [details] lynx01 logs Description of problem: [HE] Restart of vdsmd service took this host out of HA pool for ~15 minutes Version-Release number of selected component (if applicable): ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch rhvm-appliance-4.2-20180620.0.el7.noarch How reproducible: 100% Steps to Reproduce: 1. Open shell to one host that have Hosted Engine HA capabilities. 2. Run command: "hosted-engine --vm-status" command to make sure all run as expected(or from the engine see this host have Hosted Engine HA: Active(Score: 3400)) 3. Restart the service vdsmd Actual results: The host lost his Hosted Engine HA capabilities for ~15 minutes Expected results: To be with the Hosted Engine HA capabilities after few seconds - maximum 2 minutes. Additional info: We WA it by restart the services of: - ovirt-ha-agent - ovirt-ha-broker (after the restart of vdsmd service) attached tar with logs from vdsm folder and ovirt-ha folder