Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1597719

Summary: [HE] Restart of vdsmd service took this host out of HA pool for ~15 minutes
Product: [oVirt] ovirt-hosted-engine-ha Reporter: Kobi Hakimi <khakimi>
Component: AgentAssignee: Simone Tiraboschi <stirabos>
Status: CLOSED NOTABUG QA Contact: Pavel Stehlik <pstehlik>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 2.3.0CC: bugs, khakimi, michal.skrivanek
Target Milestone: ---Keywords: Automation
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-16 07:55:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Integration RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
lynx01 logs none

Description Kobi Hakimi 2018-07-03 13:56:17 UTC
Created attachment 1456247 [details]
lynx01 logs

Description of problem:
[HE] Restart of vdsmd service took this host out of HA pool for ~15 minutes

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-setup-2.2.23-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.15-1.el7ev.noarch
rhvm-appliance-4.2-20180620.0.el7.noarch

How reproducible:
100%

Steps to Reproduce:
1. Open shell to one host that have Hosted Engine HA capabilities.
2. Run command: "hosted-engine --vm-status" command to make sure all run as expected(or from the engine see this host have Hosted Engine HA: Active(Score: 3400))
3. Restart the service vdsmd

Actual results:
The host lost his Hosted Engine HA capabilities for ~15 minutes

Expected results:
To be with the Hosted Engine HA capabilities after few seconds - maximum 2 minutes.

Additional info:
We WA it by restart the services of:
 - ovirt-ha-agent
 - ovirt-ha-broker
(after the restart of vdsmd service)

attached tar with logs from vdsm folder and ovirt-ha folder

Comment 1 Michal Skrivanek 2018-07-04 04:56:50 UTC
Why do you expect few seconds?

Why is it a problem anyway?

Comment 2 Kobi Hakimi 2018-07-04 06:15:53 UTC
I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor and take action.  

We talking about High Availablity environment... IMHO we should be up and running as fast as we can, to be ready for some failure. 
In this case, I think that we can handle it in less than 2-3 minutes in the worse case.

Comment 3 Michal Skrivanek 2018-07-04 06:41:49 UTC
(In reply to Kobi Hakimi from comment #2)
> I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor
> and take action.  
> 
> We talking about High Availablity environment... IMHO we should be up and
> running as fast as we can, to be ready for some failure. 
> In this case, I think that we can handle it in less than 2-3 minutes in the
> worse case.

Fencing action typically takes 5 minutes at least, typically more in more complicated environment. Here only one node is affected by a manual action which you're not supposed to do (not while the host is not in maintenance). So again, why do you care if it takes few minutes to stabilize?

Comment 4 Kobi Hakimi 2018-07-04 07:06:46 UTC
(In reply to Michal Skrivanek from comment #3)
> (In reply to Kobi Hakimi from comment #2)
> > I expect between "few seconds-2 minutes", until the ovirt-ha-agent monitor
> > and take action.  
> > 
> > We talking about High Availablity environment... IMHO we should be up and
> > running as fast as we can, to be ready for some failure. 
> > In this case, I think that we can handle it in less than 2-3 minutes in the
> > worse case.
> 
> Fencing action typically takes 5 minutes at least, typically more in more
> complicated environment. Here only one node is affected by a manual action
> which you're not supposed to do (not while the host is not in maintenance).
> So again, why do you care if it takes few minutes to stabilize?

You for sure know the times better than me.
but still, we talking here about 15 minutes...
 
You right, we did manually restart to vdsmd service.
but in the other hand some malfunction could happen to vdsmd and as HA we should take care to be up & running as much as we can.
you can priorities this bug as low/medium but to say that this is the expected result I don't agree :)

Comment 5 Martin Sivák 2018-07-04 09:19:40 UTC
The issue is here:

Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitor_base.py", line 115, in _worker
    self.action(self._options)
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/submonitors/mgmt_bridge.py", line 47, in action
    caps = cli.Host.getCapabilities()
  File "/usr/lib/python2.7/site-packages/vdsm/client.py", line 278, in _call
    raise TimeoutError(method, kwargs, timeout)
TimeoutError: Request Host.getCapabilities with args {} timed out after 900 seconds

Comment 6 Simone Tiraboschi 2018-07-04 12:03:50 UTC
(In reply to Martin Sivák from comment #5)
> TimeoutError: Request Host.getCapabilities with args {} timed out after 900
> seconds

We could evaluate lowering that value to 60 or 90 seconds.

Comment 7 Sandro Bonazzola 2018-07-16 07:55:26 UTC
Closing as not a bug.