Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1597719

Summary:

[HE] Restart of vdsmd service took this host out of HA pool for ~15 minutes

Product:

[oVirt] ovirt-hosted-engine-ha

Reporter:

Kobi Hakimi <khakimi>

Component:

Agent

Assignee:

Simone Tiraboschi <stirabos>

Status:

CLOSED NOTABUG

QA Contact:

Pavel Stehlik <pstehlik>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

2.3.0

CC:

bugs, khakimi, michal.skrivanek

Target Milestone:

---

Keywords:

Automation

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2018-07-16 07:55:26 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Integration

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
lynx01 logs	none

Description Kobi Hakimi 2018-07-03 13:56:17 UTC

Created attachment 1456247 [details]
lynx01 logs

Description of problem:
[HE] Restart of vdsmd service took this host out of HA pool for ~15 minutes

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch
rhvm-appliance-4.2-20180620.0.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Open shell to one host that have Hosted Engine HA capabilities.
2. Run command: "hosted-engine --vm-status" command to make sure all run as expected(or from the engine see this host have Hosted Engine HA: Active(Score: 3400))
3. Restart the service vdsmd

Actual results:
The host lost his Hosted Engine HA capabilities for ~15 minutes

Expected results:
To be with the Hosted Engine HA capabilities after few seconds - maximum 2 minutes.

Additional info:
We WA it by restart the services of:
 - ovirt-ha-agent
 - ovirt-ha-broker
(after the restart of vdsmd service)

attached tar with logs from vdsm folder and ovirt-ha folder

Comment 1 Michal Skrivanek 2018-07-04 04:56:50 UTC

Why do you expect few seconds?

Why is it a problem anyway?

Comment 2 Kobi Hakimi 2018-07-04 06:15:53 UTC

I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor and take action.  

We talking about High Availablity environment... IMHO we should be up and running as fast as we can, to be ready for some failure. 
In this case, I think that we can handle it in less than 2-3 minutes in the worse case.

Comment 3 Michal Skrivanek 2018-07-04 06:41:49 UTC

(In reply to Kobi Hakimi from comment #2)
> I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor
> and take action.  
> 
> We talking about High Availablity environment... IMHO we should be up and
> running as fast as we can, to be ready for some failure. 
> In this case, I think that we can handle it in less than 2-3 minutes in the
> worse case.

Fencing action typically takes 5 minutes at least, typically more in more complicated environment. Here only one node is affected by a manual action which you're not supposed to do (not while the host is not in maintenance). So again, why do you care if it takes few minutes to stabilize?

Comment 4 Kobi Hakimi 2018-07-04 07:06:46 UTC

(In reply to Michal Skrivanek from comment #3)
> (In reply to Kobi Hakimi from comment #2)
> > I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor
> > and take action.  
> > 
> > We talking about High Availablity environment... IMHO we should be up and
> > running as fast as we can, to be ready for some failure. 
> > In this case, I think that we can handle it in less than 2-3 minutes in the
> > worse case.
> 
> Fencing action typically takes 5 minutes at least, typically more in more
> complicated environment. Here only one node is affected by a manual action
> which you're not supposed to do (not while the host is not in maintenance).
> So again, why do you care if it takes few minutes to stabilize?

You for sure know the times better than me.
but still, we talking here about 15 minutes...
 
You right, we did manually restart to vdsmd service.
but in the other hand some malfunction could happen to vdsmd and as HA we should take care to be up & running as much as we can.
you can priorities this bug as low/medium but to say that this is the expected result I don't agree :)

Comment 5 Martin Sivák 2018-07-04 09:19:40 UTC

The issue is here:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mgmt_bridge.py", line 47, in action
    caps = cli.Host.getCapabilities()
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
    raise TimeoutError(method, kwargs, timeout)
TimeoutError: Request Host.getCapabilities with args {} timed out after 900 seconds

Comment 6 Simone Tiraboschi 2018-07-04 12:03:50 UTC

(In reply to Martin Sivák from comment #5)
> TimeoutError: Request Host.getCapabilities with args {} timed out after 900
> seconds

We could evaluate lowering that value to 60 or 90 seconds.

Comment 7 Sandro Bonazzola 2018-07-16 07:55:26 UTC

Closing as not a bug.