Bug 1124826

Summary: hosted engine upgrade has issues restarting the services with systemd
Product: [Retired] oVirt Reporter: Antoni Segura Puimedon <asegurap>
Component: ovirt-hosted-engine-haAssignee: Sandro Bonazzola <sbonazzo>
Status: CLOSED WORKSFORME QA Contact: meital avital <mavital>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.5CC: bazulay, bugs, ecohen, gklein, iheim, rbalakri, sbonazzo, s.kieske, yeylon
Target Milestone: ---Keywords: TestCaseNeeded, TestCaseProvided
Target Release: 3.5.1   
Hardware: Unspecified   
OS: Unspecified   
URL: http://www.ovirt.org/QA:TestCase_Hosted_Engine_Upgrade
Whiteboard: sla
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-11-28 10:51:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: SLA RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Antoni Segura Puimedon 2014-07-30 12:14:29 UTC
Description of problem:
To upgrade a hosted engine environment I followed:
    http://www.ovirt.org/Hosted_Engine_Howto#Upgrade_Hosted_Engine
The 5th step says:
     Restart ha-agent and broker services(# service ovirt-ha-broker restart && service ovirt-ha-agent restart)

which translates to:
    systemctl restart ovirt-ha-broker 
    systemctl restart ovirt-ha-agent

However, after doing that, the daemons are not running and systemctl
shows:
    [root@dev-20 ~]# systemctl status ovirt-ha-broker
    ovirt-ha-broker.service - oVirt Hosted Engine High Availability Communications Broker
       Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service; enabled)
       Active: inactive (dead) since Wed 2014-07-30 13:55:19 CEST; 2min 44s ago
      Process: 26794 ExecStop=/lib/systemd/systemd-ovirt-ha-broker stop (code=exited, status=0/SUCCESS)
      Process: 26786 ExecStart=/lib/systemd/systemd-ovirt-ha-broker start (code=exited, status=0/SUCCESS)
     Main PID: 861 (code=killed, signal=KILL)

    Jul 30 13:55:19 dev-20.rhev.lab.eng.brq.redhat.com systemd[1]: Starting oVirt Hosted Engine High Availability Co......
    Jul 30 13:55:19 dev-20.rhev.lab.eng.brq.redhat.com systemd-ovirt-ha-broker[26794]: Stopping ovirt-ha-broker: [FAILED]
    Jul 30 13:55:19 dev-20.rhev.lab.eng.brq.redhat.com systemd[1]: Started oVirt Hosted Engine High Availability Com...er.
    Hint: Some lines were ellipsized, use -l to show in full.
    [root@dev-20 ~]# ps aux | grep ovirt

The only way to get the daemons up and running again was to bypass systemd and
do:
    /lib/systemd/systemd-ovirt-ha-broker restart
    /lib/systemd/systemd-ovirt-ha-agent restart


After doing that, the --vm-status shows that the data is stale though (possibly
unrelated)

    --== Host 2 status ==--

    Status up-to-date                  : False
    Hostname                           : 10.34.63.180
    Host ID                            : 2
    Engine status                      : unknown stale-data
    Score                              : 2000
    Local maintenance                  : False
    Host timestamp                     : 1406640293
    Extra metadata (valid at timestamp):
         metadata_parse_version=1
         metadata_feature_version=1
         timestamp=1406640293 (Tue Jul 29 15:24:53 2014)
         host-id=2
         score=2000
         maintenance=False
         bridge=True
         cpu-load=0.0856
         engine-health={"health": "good", "vm": "up", "detail": "up"}
         gateway=True
         mem-free=3192

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-1.2.1-0.2.master.20140725101556.fc20.noarch

How reproducible:
I only have one setup, so I did it once


Steps to Reproduce:
1. Follow the guide above
2. Try to do step 7 hosted-engine --set-maintenance --mode=none

Actual results:
    [root@dev-20 ~]# hosted-engine --set-maintenance --mode=none
    Traceback (most recent call last):
      File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
        "__main__", fname, loader, pkg_name)
      File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
        exec code in run_globals
      File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/set_maintenance.py", line 72, in <module>
        if not maintenance.set_mode(sys.argv[1]):
      File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_setup/set_maintenance.py", line 60, in set_mode
        value=m_global,
      File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 248, in set_maintenance_mode
        str(value))
      File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/client/client.py", line 196, in set_global_md_flag
        with broker.connection():
      File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
        return self.gen.next()
      File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 99, in connection
        self.connect()
      File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/brokerlink.py", line 78, in connect
        raise BrokerConnectionError(error_msg)
    ovirt_hosted_engine_ha.lib.exceptions.BrokerConnectionError: Failed to connect to broker, the number of errors has exceeded the limit (5)

Expected results:
It silently succeeds


Additional info:

Above I put how to workaround the error by bypassing systemd, but the results
are not optimal with the workaround

Comment 1 Sandro Bonazzola 2014-09-29 15:16:42 UTC
(In reply to Antoni Segura Puimedon from comment #0)
> Description of problem:
> To upgrade a hosted engine environment I followed:
>     http://www.ovirt.org/Hosted_Engine_Howto#Upgrade_Hosted_Engine
> The 5th step says:
>      Restart ha-agent and broker services(# service ovirt-ha-broker restart
> && service ovirt-ha-agent restart)
> 
> which translates to:
>     systemctl restart ovirt-ha-broker 
>     systemctl restart ovirt-ha-agent

Please note it translates to

  systemctl restart ovirt-ha-broker.service
  systemctl restart ovirt-ha-agent.service



> 
> However, after doing that, the daemons are not running and systemctl
> shows:
>     [root@dev-20 ~]# systemctl status ovirt-ha-broker
>     ovirt-ha-broker.service - oVirt Hosted Engine High Availability
> Communications Broker
>        Loaded: loaded (/usr/lib/systemd/system/ovirt-ha-broker.service;
> enabled)
>        Active: inactive (dead) since Wed 2014-07-30 13:55:19 CEST; 2min 44s
> ago
>       Process: 26794 ExecStop=/lib/systemd/systemd-ovirt-ha-broker stop
> (code=exited, status=0/SUCCESS)
>       Process: 26786 ExecStart=/lib/systemd/systemd-ovirt-ha-broker start
> (code=exited, status=0/SUCCESS)
>      Main PID: 861 (code=killed, signal=KILL)
> 
>     Jul 30 13:55:19 dev-20.rhev.lab.eng.brq.redhat.com systemd[1]: Starting
> oVirt Hosted Engine High Availability Co......
>     Jul 30 13:55:19 dev-20.rhev.lab.eng.brq.redhat.com
> systemd-ovirt-ha-broker[26794]: Stopping ovirt-ha-broker: [FAILED]
>     Jul 30 13:55:19 dev-20.rhev.lab.eng.brq.redhat.com systemd[1]: Started
> oVirt Hosted Engine High Availability Com...er.
>     Hint: Some lines were ellipsized, use -l to show in full.
>     [root@dev-20 ~]# ps aux | grep ovirt
> 
> The only way to get the daemons up and running again was to bypass systemd
> and
> do:
>     /lib/systemd/systemd-ovirt-ha-broker restart
>     /lib/systemd/systemd-ovirt-ha-agent restart
> 

Can you check if for some reason the services have not been reloaded by the rpm update?
Can you reporduce executing:

 systemctl reload ovirt-ha-broker.service
 systemctl reload ovirt-ha-agent.service

before trying to restart the services?

Comment 2 Sandro Bonazzola 2014-09-29 15:17:17 UTC
About the stale file, I've no clue right now.

Comment 3 Sven Kieske 2014-09-29 15:58:09 UTC
(In reply to Sandro Bonazzola from comment #1)

> 
> Can you check if for some reason the services have not been reloaded by the
> rpm update?


AFAIK rpm updates should never reload services on their own?

Comment 4 Sandro Bonazzola 2014-11-27 12:33:08 UTC
the spec file contains:

%postun
%if 0%{?with_systemd}
%systemd_postun_with_restart ovirt-ha-agent.service
%systemd_postun_with_restart ovirt-ha-broker.service
%else
...


so the daemon reload the service and execute try-restart for the 2 services on upgrade.

Comment 5 Sandro Bonazzola 2014-11-28 10:51:19 UTC
I can't reproduce the issue upgrading F20 host with F19 VM engine.

Upgrade from 3.4.4 to 3.5.1 snapshot from 
http://jenkins.ovirt.org/view/Publishers/job/publish_ovirt_rpms_nightly_3.5/197/