Bug 1275606 - [hosted-engine-ha] ha-agent service is not restarted once connectivity to the storage is restored
Summary: [hosted-engine-ha] ha-agent service is not restarted once connectivity to the...
Keywords:
Status: CLOSED DUPLICATE of bug 1030441
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 1.3.1
Hardware: x86_64
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-3.6.1
: ---
Assignee: Martin Sivák
QA Contact: Ilanit Stein
URL:
Whiteboard: sla
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2015-10-27 10:17 UTC by Elad
Modified: 2019-04-25 10:40 UTC (History)
4 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-11-18 09:50:36 UTC
oVirt Team: SLA
Embargoed:
dfediuck: ovirt-3.6.z?
gklein: blocker?
ebenahar: planning_ack?
dfediuck: devel_ack+
ebenahar: testing_ack?


Attachments (Terms of Use)
journalctl (3.99 MB, application/x-gzip)
2015-11-01 16:14 UTC, Elad
no flags Details

Description Elad 2015-10-27 10:17:03 UTC
Description of problem:
ha-agent service is not restarted automatically once the connectivity to the storage is restored. Hence, hosted-engine VM is not resumed from EIO once the connectivity to the storage is restored.


Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch

How reproducible:
Always

Steps to Reproduce:
On hosted-engine env:
1. Block connectivity from all HE hosts to the storage server where the HE SD is located. 
2. Restore connectivity to the storage

Actual results:
ha-agent service is in status 'failed' and stays on it while the connectivity to the storage is ok:


[root@green-vdsa images]# systemctl status ovirt-ha-agent.service 
● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring AgentHosted-engine VM is not resumed from EIO once the connectivity to the storage is restored because the
   Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled)
   Active: failed (Result: exit-code) since Tue 2015-10-27 08:43:37 IST; 35min ago
  Process: 4726 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS)
  Process: 2592 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-agent start (code=exited, status=0/SUCCESS)
 Main PID: 2620 (code=exited, status=157)

Oct 27 08:43:06 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:11 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:16 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:21 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:27 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:32 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Too many errors occurred, giving up. Plea...ng a bug.
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=157/n/a
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state.
Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed.
Hint: Some lines were ellipsized, use -l to show in full.

ha-agent doesn't get restarted. Hosted-engine VM remains down.


Expected results:
ha-agent should be restarted after the connectivity to the storage is resumed.

Additional info:
logs (/var/log/ from hosts):
http://file.tlv.redhat.com/ebenahar/bug2.tar.gz



https://gerrit.ovirt.org/#/c/47227/1/ -

[Unit]
Description=oVirt Hosted Engine High Availability Communications Broker

[Service]
Type=forking
EnvironmentFile=-/etc/sysconfig/ovirt-ha-broker
ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start
ExecStop=/usr/lib/systemd/systemd-ovirt-ha-broker stop
Restart=on-abnormal

[Install]
WantedBy=multi-user.target

Comment 1 Martin Sivák 2015-10-27 15:17:15 UTC
Please note that the current EL7.1 systemd (208) does not support Restart=on-abnormal. Can you test this using Fedora or EL7.2 (systemd 217 or higher iirc)?

Comment 2 Elad 2015-10-27 15:52:01 UTC
(In reply to Martin Sivák from comment #1)
> Please note that the current EL7.1 systemd (208) does not support
> Restart=on-abnormal. Can you test this using Fedora or EL7.2 (systemd 217 or
> higher iirc)?

It was tested using el7.2

Comment 3 Martin Sivák 2015-10-29 08:49:26 UTC
So did systemd try to restart the service? Can you check the full journal log? If not attempt was made, check the systemd version please.

Comment 4 Elad 2015-11-01 16:14:45 UTC
Created attachment 1088371 [details]
journalctl

Martin, checked it again using the latest systemd:

systemd-python-219-19.el7.x86_64
systemd-libs-219-19.el7.x86_64
systemd-219-19.el7.x86_64
systemd-sysv-219-19.el7.x86_64

ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch
ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch
vdsm-4.17.10-5.el7ev.noarch


Blocked connectivity between all hosts in the DC to the hosted-engine storage domain. The VM moved to paused, the ha-agent service failed. Restored the connectivity to the storage server and waited for ~30 minutes. No attempt to restart the ha-agent was done by systemd. 
Attached journalctl log.

Comment 5 Martin Sivák 2015-11-03 15:28:26 UTC
It seems we need the more traditional on-failure restart mode in this case.

Comment 6 Martin Sivák 2015-11-18 09:50:36 UTC
This should be resolved as part of a wider fix for the referenced bug #1030441

*** This bug has been marked as a duplicate of bug 1030441 ***


Note You need to log in before you can comment on or make changes to this bug.