Bug 1599110
| Summary: | Hosted-Engine HA failed and started looping, not recovering itself. | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> |
| Component: | ovirt-hosted-engine-ha | Assignee: | Martin Sivák <msivak> |
| Status: | CLOSED DUPLICATE | QA Contact: | meital avital <mavital> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.2.3 | CC: | lsurette |
| Target Milestone: | --- | Flags: | lsvaty:
testing_plan_complete-
|
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-07-09 06:05:27 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | SLA | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
(In reply to Germano Veit Michel from comment #0) > There was a problem with the hosted-engine metadata LV (somehow it got > deactivated). I think this could have been caused by the vdsm restart as it cleans up "stale" LVs on start-up. Will try to reproduce. Reproduced LV deactivation by vdsm, but the problem does not fully reproduce in 4.2.4, it recovers automatically because the HA broker restarts.
# date && lvs | grep 8a9577f7-2e95-49a2-8ab8-8447c935446f
Mon Jul 9 15:58:40 AEST 2018
8a9577f7-2e95-49a2-8ab8-8447c935446f 6c873880-0244-4823-90d5-03e4ec562cd8 -wi-a----- 1.00g
# systemctl restart vdsmd
2018-07-09 15:59:02,877+1000 INFO (hsm/init) [storage.LVM] Deactivating lvs: vg=6c873880-0244-4823-90d5-03e4ec562cd8 lvs=['8a9577f7-2e95-49a2-8ab8-8447c935446f', 'ad62b2ec-3f99-47cd-a8e0-d22d4b1b3d69', '3fc2c155-65b7-4b2c-988a-b8beab04a01a', 'c9042982-1ec4-4b74-8c97-b7ab119ac8d6'] (lvm:684)
# date && lvs | grep 8a9577f7-2e95-49a2-8ab8-8447c935446f
Mon Jul 9 15:59:23 AEST 2018
8a9577f7-2e95-49a2-8ab8-8447c935446f 6c873880-0244-4823-90d5-03e4ec562cd8 -wi------- 1.00g
StatusStorageThread::ERROR::2018-07-09 15:59:09,945::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 152, in get_raw_stats
f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
OSError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a'
StatusStorageThread::ERROR::2018-07-09 15:59:09,946::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
self._storage_broker.get_raw_stats()
File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
.format(str(e)))
RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a'
StatusStorageThread::ERROR::2018-07-09 15:59:09,947::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker
Tracked it down to this: BZ1583712
*** This bug has been marked as a duplicate of bug 1583712 ***
|
Description of problem: There was a problem with the hosted-engine metadata LV (somehow it got deactivated). After this the HA mechanism was unable to recover from it. It looped through states sending tons of emails and never attempted to prepareImages/refresh SD to bring that LV up again to recover from the incident. Storage was available all the time. See the details: 1. At 15:58:25, the hosted-engine.metadata volume LV was somehow deactivated. VDSM logs from this time have rotated, all I could find was vdsm restarted 1 minute earlier and it was not through soft fencing. MainThread::INFO::2018-07-04 15:57:55,791::hosted_engine::506::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Required service vdsmd is not up. StatusStorageThread::ERROR::2018-07-04 15:58:25,676::status_broker::92::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state. Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 88, in run self._storage_broker.get_raw_stats() File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats .format(str(e))) RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/2fc4a59d-f27b-41be-a851-940694b94314/e14df275-5499-49fe-a158-4460948e54aa/f3efb31c-ee23-4a24-9c96-c695831409d6' $ grep f3efb31c-ee23-4a24-9c96-c695831409d6 etc/ovirt-hosted-engine/hosted-engine.conf metadata_volume_UUID=f3efb31c-ee23-4a24-9c96-c695831409d6 2. Then the agent restarts (due to vdsm restarting I understand) MainThread::INFO::2018-07-04 15:58:38,094::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down MainThread::INFO::2018-07-04 15:58:49,824::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.11 started 3. The ha mechanism did not recover from this anymore. After the ha-agent restart it loops through these states below while the HE VM was running on another host (he disk lock acquired): EngineDown-EngineStart-EngineStarting-EngineForceStop-ReinitializeFSM This sent dozens of emails as this specific host was unable to read the metadata LV to see the other hosts states and therefore kept trying (and failing) to start the HE VM. For example, after 15:58, if one looks at host id 1 timestamp (the logs from above are from host id 5), its always 2265573 on the point of view of this failed host. But it doesn't try to refresh the storage domain or prepare images to bring up the metadata LV to read newer status of other hosts. Just keeps looping. This is a 5 host HE cluster, only host id 5 faced this problem, all other 4 hosts were fine. Version-Release number of selected component (if applicable): ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch How reproducible: Unknown Actual results: HA agent of single host looping, not recovering. Expected results: Bring up the missing LV (prepareVolumes/refresh storage) and recover