Description of problem: ha-agent service is not restarted automatically once the connectivity to the storage is restored. Hence, hosted-engine VM is not resumed from EIO once the connectivity to the storage is restored. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch How reproducible: Always Steps to Reproduce: On hosted-engine env: 1. Block connectivity from all HE hosts to the storage server where the HE SD is located. 2. Restore connectivity to the storage Actual results: ha-agent service is in status 'failed' and stays on it while the connectivity to the storage is ok: [root@green-vdsa images]# systemctl status ovirt-ha-agent.service ● ovirt-ha-agent.service - oVirt Hosted Engine High Availability Monitoring AgentHosted-engine VM is not resumed from EIO once the connectivity to the storage is restored because the Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-agent.service; enabled; vendor preset: disabled) Active: failed (Result: exit-code) since Tue 2015-10-27 08:43:37 IST; 35min ago Process: 4726 ExecStop=/usr/lib/systemd/systemd-ovirt-ha-agent stop (code=exited, status=0/SUCCESS) Process: 2592 ExecStart=/usr/lib/systemd/systemd-ovirt-ha-agent start (code=exited, status=0/SUCCESS) Main PID: 2620 (code=exited, status=157) Oct 27 08:43:06 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent Oct 27 08:43:11 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent Oct 27 08:43:16 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent Oct 27 08:43:21 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent Oct 27 08:43:27 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent Oct 27 08:43:32 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: 'path to storage domain 6a46276a-0...art agent Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com ovirt-ha-agent[2620]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Too many errors occurred, giving up. Plea...ng a bug. Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service: main process exited, code=exited, status=157/n/a Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: Unit ovirt-ha-agent.service entered failed state. Oct 27 08:43:37 green-vdsa.qa.lab.tlv.redhat.com systemd[1]: ovirt-ha-agent.service failed. Hint: Some lines were ellipsized, use -l to show in full. ha-agent doesn't get restarted. Hosted-engine VM remains down. Expected results: ha-agent should be restarted after the connectivity to the storage is resumed. Additional info: logs (/var/log/ from hosts): http://file.tlv.redhat.com/ebenahar/bug2.tar.gz https://gerrit.ovirt.org/#/c/47227/1/ - [Unit] Description=oVirt Hosted Engine High Availability Communications Broker [Service] Type=forking EnvironmentFile=-/etc/sysconfig/ovirt-ha-broker ExecStart=/usr/lib/systemd/systemd-ovirt-ha-broker start ExecStop=/usr/lib/systemd/systemd-ovirt-ha-broker stop Restart=on-abnormal [Install] WantedBy=multi-user.target
Please note that the current EL7.1 systemd (208) does not support Restart=on-abnormal. Can you test this using Fedora or EL7.2 (systemd 217 or higher iirc)?
(In reply to Martin Sivák from comment #1) > Please note that the current EL7.1 systemd (208) does not support > Restart=on-abnormal. Can you test this using Fedora or EL7.2 (systemd 217 or > higher iirc)? It was tested using el7.2
So did systemd try to restart the service? Can you check the full journal log? If not attempt was made, check the systemd version please.
Created attachment 1088371 [details] journalctl Martin, checked it again using the latest systemd: systemd-python-219-19.el7.x86_64 systemd-libs-219-19.el7.x86_64 systemd-219-19.el7.x86_64 systemd-sysv-219-19.el7.x86_64 ovirt-hosted-engine-ha-1.3.1-1.el7ev.noarch ovirt-hosted-engine-setup-1.3.0-1.el7ev.noarch vdsm-4.17.10-5.el7ev.noarch Blocked connectivity between all hosts in the DC to the hosted-engine storage domain. The VM moved to paused, the ha-agent service failed. Restored the connectivity to the storage server and waited for ~30 minutes. No attempt to restart the ha-agent was done by systemd. Attached journalctl log.
It seems we need the more traditional on-failure restart mode in this case.
This should be resolved as part of a wider fix for the referenced bug #1030441 *** This bug has been marked as a duplicate of bug 1030441 ***