Bug 1583712 - hosted-engine metadata are not correctly read and write on hosts set into maintenance mode from the engine
Summary: hosted-engine metadata are not correctly read and write on hosts set into mai...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Broker
Version: 2.2.12
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.2.4
: 2.2.13
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
: 1599110 (view as bug list)
Depends On:
Blocks: ovirt-hosted-engine-ha-2.2.14
TreeView+ depends on / blocked
 
Reported: 2018-05-29 14:11 UTC by Simone Tiraboschi
Modified: 2021-09-09 14:29 UTC (History)
4 users (show)

Fixed In Version: ovirt-hosted-engine-ha-2.2.13-1.el7ev
Doc Type: Bug Fix
Doc Text:
Setting an host into maintenance mode from the engine implicitly disconnects the storage, in the past the hosted-engine storage domain was reconnected by ovirt-ha-agent to be able to read and write metadata area. Now that capability has been moved to ovirt-ha-broker.
Clone Of:
Environment:
Last Closed: 2018-06-26 08:42:41 UTC
oVirt Team: Integration
rule-engine: ovirt-4.2+
rule-engine: blocker+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker RHV-43513 0 None None None 2021-09-09 14:29:41 UTC
Red Hat Knowledge Base (Solution) 3238221 0 None None None 2018-07-09 06:27:19 UTC
oVirt gerrit 91739 0 'None' MERGED broker: restart the service on storage errors 2021-01-09 02:47:32 UTC
oVirt gerrit 91792 0 'None' MERGED broker: restart the service on storage errors 2021-01-09 02:47:32 UTC

Description Simone Tiraboschi 2018-05-29 14:11:27 UTC
Description of problem:
Setting maintenance mode from the engine will disconnect all the configured storage domains including the HE one.
In the past ovirt-ha-agent was able to detect that and restart itself in order to reconnect the HE storage domain.
in 4.2 storage domain connection has been moved to ovirt-ha-broker that doesn't handle the storage reconnection.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha 2.2.12

How reproducible:
100%

Steps to Reproduce:
1. deploy hosted-engine with 2 or more HE hosts
2. move the host where the engine VM is running into maintenance mode from the engine
3.

Actual results:
The engine VM correctly migrates to another host.
The host got moved into maintenance mode which implies HE local maintenance mode.
  
All the storage domains including the HE one got disconnected and nothing currently reconnects it until the user manually restarts ovirt-ha-broker

In broker.log we see something like:
 Thread-7::INFO::2018-05-29 13:54:30,895::engine_health::191::engine_health.EngineHealth::(_result_from_stats) VM successfully migrated away from this host.
 Thread-3::INFO::2018-05-29 13:54:30,903::mem_free::51::mem_free.MemFree::(action) memFree: 5191
 Thread-5::INFO::2018-05-29 13:54:37,700::engine_health::94::engine_health.EngineHealth::(action) VM not on this host
 StatusStorageThread::ERROR::2018-05-29 13:54:38,925::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/3e80ea57-56f2-4b97-9723-331d37319678/9b572739-1f4c-40ca-89e6-5c0b337f96af/d43a872f-239a-48cf-b667-ba68758bb02b
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 156, in get_raw_stats
     fin.readinto(direct_io_buffer)
 IOError: [Errno 5] Input/output error
 StatusStorageThread::ERROR::2018-05-29 13:54:38,941::status_broker::92::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 88, in run
     self._storage_broker.get_raw_stats()
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
     .format(str(e)))
 RequestError: failed to read metadata: [Errno 5] Input/output error
 Listener::INFO::2018-05-29 13:54:39,132::storage_broker::304::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(_get_domain_monitor_status) VDSM domain monitor status: NONE
 StatusStorageThread::ERROR::2018-05-29 13:54:39,159::storage_broker::211::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(put_stats) Failed to write metadata for host 1 to /var/run/vdsm/storage/3e80ea57-56f2-4b97-9723-331d37319678/9b572739-1f4c-40ca-89e6-5c0b337f96af/d43a872f-239a-48cf-b667-ba68758bb02b
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 207, in put_stats
     uninterruptible(os.write, f, direct_io_buffer)
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/lib/util.py", line 183, in uninterruptible
     return method(*args, **kwargs)
 OSError: [Errno 5] Input/output error
 StatusStorageThread::ERROR::2018-05-29 13:54:39,160::status_broker::85::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to update state.
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 81, in run
     entry.data
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 213, in put_stats
     .format(str(e)))
 RequestError: failed to write metadata: [Errno 5] Input/output error
 StatusStorageThread::ERROR::2018-05-29 13:54:39,162::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/3e80ea57-56f2-4b97-9723-331d37319678/9b572739-1f4c-40ca-89e6-5c0b337f96af/d43a872f-239a-48cf-b667-ba68758bb02b
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 156, in get_raw_stats
     fin.readinto(direct_io_buffer)
 IOError: [Errno 5] Input/output error
 StatusStorageThread::ERROR::2018-05-29 13:54:39,163::status_broker::92::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
 Traceback (most recent call last):
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 88, in run
     self._storage_broker.get_raw_stats()
   File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
     .format(str(e)))
 RequestError: failed to read metadata: [Errno 5] Input/output error
 Thread-8::WARNING::2018-05-29 13:54:39,431::storage_domain::60::storage_domain.EngineHealth::(action) Hosted-engine storage domain is in invalid state
 Thread-3::INFO::2018-05-29 13:54:39,924::mem_free::51::mem_free.MemFree::(action) memFree: 5193


hosted-engine --vm-status on the host set into maintenance mode reports itself as '{"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}' and all the other hosts as 'unknown stale-data'.

All the other hosts report the host into maintenance mode as 'unknown stale-data'


Expected results:
The hosted-engine storage domain is correctly reconnected on the host set into maintenance mode, hosted-engine --vm-status is correctly working on all the hosts.


Additional info:

Comment 1 Raz Tamir 2018-05-30 10:38:33 UTC
This bug discovered by automation - adding keyword

Comment 2 Nikolai Sednev 2018-06-10 14:14:01 UTC
Works for me on these components:
ovirt-hosted-engine-ha-2.2.13-1.el7ev.noarch
ovirt-hosted-engine-setup-2.2.22-1.el7ev.noarch
rhvm-appliance-4.2-20180601.0.el7.noarch
ovirt-engine-setup-4.2.4.1-0.1.el7.noarch
Linux 3.10.0-862.6.1.el7.x86_64 #1 SMP Mon Jun 4 15:33:25 EDT 2018 x86_64 x86_64 x86_64 GNU/Linux
Red Hat Enterprise Linux Server release 7.5 (Maipo)

No errors in broker.log were discovered and migrations passed without any issues.
Tested over NFS deployment.

Comment 3 Sandro Bonazzola 2018-06-26 08:42:41 UTC
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.

Comment 4 Germano Veit Michel 2018-07-09 06:05:27 UTC
*** Bug 1599110 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.