Bug 1365242 - 3.4->3.5->3.6->4.0 SHE migration: ovirt-ha-agent not working correctly / state=AgentStopped
Summary: 3.4->3.5->3.6->4.0 SHE migration: ovirt-ha-agent not working correctly / stat...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.0.1
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ovirt-4.0.4
: 2.0.4
Assignee: Simone Tiraboschi
QA Contact: Jiri Belka
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-08-08 17:04 UTC by Jiri Belka
Modified: 2016-09-26 12:35 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
With 4.0 we moved to the jsonrpc protocol; adding additional checks on jsonrpc responses.
Clone Of:
Environment:
Last Closed: 2016-09-26 12:35:39 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: exception+
ylavi: planning_ack+
sbonazzo: devel_ack+
mavital: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 62162 0 master MERGED jsonrpc: safely parsing empty responses 2016-08-30 10:06:40 UTC
oVirt gerrit 63001 0 v2.0.z MERGED jsonrpc: safely parsing empty responses 2016-09-02 10:01:00 UTC

Description Jiri Belka 2016-08-08 17:04:44 UTC
Description of problem:

After doing SHE migration path 3.4->3.5->3.6->4.0 and ending global maintenance, HE VM was not automatically started and hosted-engine --vm-status showed that agent was in 'state=AgentStopped'.

manually starting HE VM with hosted-engine --vm-start worked fine.

~~~
# hosted-engine --vm-status | sed 's/rhev.lab.eng.brq.redhat/example.com/'
/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/storage_backends.py:15: DeprecationWarning: vd
scli uses xmlrpc. since ovirt 3.6 xmlrpc is deprecated, please use vdsm.jsonrpcvdscli
  import vdsm.vdscli


--== Host 1 status ==--

Status up-to-date                  : False
Hostname                           : 10-34-60-151.example.com.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : e9f3cf55
Host timestamp                     : 231104
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=231104 (Mon Aug  8 15:53:34 2016)
        host-id=1
        score=0
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

Status up-to-date                  : True
Hostname                           : 10-34-60-215.example.com.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "dow
n", "detail": "unknown"}
Score                              : 0
stopped                            : False
Local maintenance                  : True
crc32                              : 2e113351
Host timestamp                     : 234997
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=234997 (Mon Aug  8 18:39:40 2016)
        host-id=2
        score=0
        maintenance=True
        state=LocalMaintenance
        stopped=False
~~~

There's a lot of "ovirt-ha-agent ovirt_hosted_engine_ha.agent.agent.Agent ERROR Error: ''items'' - trying to restart agent" in the log.

Both hosts were EL7 with 4.0 rpms.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.0.1-1.el7ev.noarch

How reproducible:
hard to reproduce, if at all possible

Steps to Reproduce:
1. discovered as part of 3.4->3.5->3.6->4.0 SHE migration
2.
3.

Actual results:
HE VM was not started after ending global maintenance

Expected results:
HE VM should be started automatically.

Additional info:

Comment 2 Simone Tiraboschi 2016-08-09 15:43:45 UTC
The issue is here:

MainThread::WARNING::2016-08-08 15:51:03,712::hosted_engine::480::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Unexpected error
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 445, in start_monitoring
    self._initialize_storage_images()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/agent/hosted_engine.py", line 667, in _initialize_storage_images
    img.prepare_images()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/image.py", line 141, in prepare_images
    for volUUID in vm_vol_uuid_list['items']:
KeyError: 'items'
MainThread::INFO::2016-08-08 15:51:05,328::hosted_engine::496::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring) Sleeping 60 seconds
MainThread::INFO::2016-08-08 15:52:05,455::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1470664325.46 type=state_transition detail=GlobalMaintenance-ReinitializeFSM hostname='10-34-60-151.rhev.lab.eng.brq.redhat.com'

It seams that a certain time you got an image without a volume (still not sure how) and our code failed scanning it.

Comment 3 Jiri Belka 2016-09-19 23:27:31 UTC
ok, ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch

can't see the issue anymore as described in #2.


Note You need to log in before you can comment on or make changes to this bug.