Bug 1275606 - [hosted-engine-ha] ha-agent service is not restarted once connectivity to the storage is restored
[hosted-engine-ha] ha-agent service is not restarted once connectivity to the...
Status: CLOSED DUPLICATE of bug 1030441
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent (Show other bugs)
1.3.1
x86_64 Unspecified
unspecified Severity urgent (vote)
: ovirt-3.6.1
: ---
Assigned To: Martin Sivák
Ilanit Stein
sla
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-10-27 06:17 EDT by Elad
Modified: 2016-02-10 14:20 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2015-11-18 04:50:36 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: SLA
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
dfediuck: ovirt‑3.6.z?
gklein: blocker?
ebenahar: planning_ack?
dfediuck: devel_ack+
ebenahar: testing_ack?


Attachments (Terms of Use)
journalctl (3.99 MB, application/x-gzip)
2015-11-01 11:14 EST, Elad
no flags Details

  None (edit)
Description Elad 2015-10-27 06:17:03 EDT
Description of problem:
ha-agent service is not restarted automatically once the connectivity to the storage is restored. Hence, hosted-engine VM is not resumed from EIO once the connectivity to the storage is restored.


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
On hosted-engine env:
1. Block connectivity from all HE hosts to the storage server where the HE SD is located. 
2. Restore connectivity to the storage

Actual results:
ha-agent service is in status 'failed' and stays on it while the connectivity to the storage is ok:


[root@green-vdsa images]# systemctl status ovirt-ha-agent.service 
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring AgentHosted-engine VM is not resumed from EIO once the connectivity to the storage is restored because the
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2015-10-27 08:43:37 IST; 35min ago
  Process: 4726 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS)
  Process: 2592 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-agent start (code=exited, status=0/SUCCESS)
 Main PID: 2620 (code=exited, status=157)

Oct 27 08:43:06 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:11 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:16 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:21 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:27 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:32 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Too many errors occurred, giving up. Plea...ng a bug.
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=157/n/a
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state.
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

ha-agent doesn't get restarted. Hosted-engine VM remains down.


Expected results:
ha-agent should be restarted after the connectivity to the storage is resumed.

Additional info:
logs (/var/log/ from hosts):
http://file.tlv.redhat.com/ebenahar/bug2.tar.gz



https://gerrit.ovirt.org/#/c/47227/1/ -

[Unit]
Description=oVirt Hosted Engine High Availability Communications Broker

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/ovirt-ha-broker
ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start
ExecStop=/usr/lib/systemd/systemd-ovirt-ha-broker stop
Restart=on-abnormal

[Install]
WantedBy=multi-user.target
Comment 1 Martin Sivák 2015-10-27 11:17:15 EDT
Please note that the current EL7.1 systemd (208) does not support Restart=on-abnormal. Can you test this using Fedora or EL7.2 (systemd 217 or higher iirc)?
Comment 2 Elad 2015-10-27 11:52:01 EDT
(In reply to Martin Sivák from comment #1)
> Please note that the current EL7.1 systemd (208) does not support
> Restart=on-abnormal. Can you test this using Fedora or EL7.2 (systemd 217 or
> higher iirc)?

It was tested using el7.2
Comment 3 Martin Sivák 2015-10-29 04:49:26 EDT
So did systemd try to restart the service? Can you check the full journal log? If not attempt was made, check the systemd version please.
Comment 4 Elad 2015-11-01 11:14 EST
Created attachment 1088371 [details]
journalctl

Martin, checked it again using the latest systemd:

systemd-python-219-19.el7.x86_64
systemd-libs-219-19.el7.x86_64
systemd-219-19.el7.x86_64
systemd-sysv-219-19.el7.x86_64

ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
vdsm-4.17.10-5.el7ev.noarch


Blocked connectivity between all hosts in the DC to the hosted-engine storage domain. The VM moved to paused, the ha-agent service failed. Restored the connectivity to the storage server and waited for ~30 minutes. No attempt to restart the ha-agent was done by systemd. 
Attached journalctl log.
Comment 5 Martin Sivák 2015-11-03 10:28:26 EST
It seems we need the more traditional on-failure restart mode in this case.
Comment 6 Martin Sivák 2015-11-18 04:50:36 EST
This should be resolved as part of a wider fix for the referenced bug #1030441

*** This bug has been marked as a duplicate of bug 1030441 ***

Note You need to log in before you can comment on or make changes to this bug.