1569827 – Hosted Engine stuck after (probing edd... ok)

Bug 1569827 - Hosted Engine stuck after (probing edd... ok)

Summary: Hosted Engine stuck after (probing edd... ok)

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	ovirt-hosted-engine-ha
Classification:	oVirt
Component:	General
Sub Component:
Version:	2.2.10
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	ovirt-4.2.4
Target Release:	---
Assignee:	Simone Tiraboschi
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-20 03:35 UTC by Eamonn Nugent
Modified:	2018-07-09 18:35 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-05-25 15:20:27 UTC
oVirt Team:	Integration
Embargoed:
Flags:	rick: needinfo- rule-engine: ovirt-4.2+ rule-engine: exception+

Attachments	(Terms of Use)

Description Eamonn Nugent 2018-04-20 03:35:44 UTC

Description of problem:
I'm back with an unsolvable problem! I've managed (somehow) to get my Hosted Engine in a state where it get stuck after the console says (probing edd... okay), and then the HA service force reboots it, where the issue returns. I suspect the kernel is corrupted.


How reproducible:
Honestly, I'm not sure. Hard poweroff at some weird point of the host?


Actual results:
Well, it gets stuck

Expected results:
It doesn't get stuck

Additional info:
--console shows a console with an escape character ^], but nothing else, and does not respond to inputs. Here's the state:

--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : s1.virt.stm.inf.demilletech.net
Host ID                            : 1
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : a1dceb37
local_conf_timestamp               : 597
Host timestamp                     : 597
Extra metadata (valid at timestamp):
	metadata_parse_version=1
	metadata_feature_version=1
	timestamp=597 (Thu Apr 19 22:18:10 2018)
	host-id=1
	score=3400
	vm_conf_refresh_time=597 (Thu Apr 19 22:18:10 2018)
	conf_on_shared_storage=True
	maintenance=False
	state=EngineStarting
	stopped=False

And then it reboots. So no, I do not have a working inf right now :(

Comment 1 Doron Fediuck 2018-04-20 08:50:11 UTC

Hi,
I'm missing information, mainly log files.

This line-
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}

means that the hosted engine VM is up but the engine inside it is not. This may be a result of multiple scenarios, starting from network issues and ending with VM corruption.

In order stop the rebooting of the VM you can move the system to global maintenance using the commanline:
hosted-engine --set-maintenance --mode=global

Next step will be the figure out the HE VM and engine status.
Can you ssh into the VM? Check ovirt service status? Start the engine if it's not running? Get engine log files?

Comment 3 Julia DeMille 2018-04-20 21:25:01 UTC

(I work with Eamonn) We've managed to get it to stop rebooting, but all the video console shows is the probing EDD message. Network isn't up, and it doesn't respond to console commands.

Comment 4 Rick Sherman 2018-05-15 23:46:14 UTC

I experienced the same issue, and was able to recover.

For me the XFS file system of the HostedEngine was corrupted.  This is a rough outline of the steps I took to recover.

# Manually install libguestfs-xfs package 
rpm -Uhv libguestfs-xfs-1.36.3-6.el7.x86_64.rpm --nodeps

# Determine location of HostedEngine disk
virsh -r dumpxml HostedEngine

# run guestfish to repair VM disk
LIBGUESTFS_BACKEND=direct guestfish --rw -a <disk>

><fs> run
><fs> list-filesystems
/dev/sda1: xfs
/dev/sda3: xfs
/dev/ovirt/audit: xfs
/dev/ovirt/home: xfs
/dev/ovirt/log: xfs
/dev/ovirt/swap: swap
/dev/ovirt/tmp: xfs
/dev/ovirt/var: xfs

# I repaired all disks
><fs> xfs-repair /dev/ovirt/log
><fs> exit

# Then I destroyed the loaded VM
hosted-engine --vm-poweroff

# Last I disabled edd to complete boot - may not be needed

# Restart HostedEngine Paused
hosted-engine --vm-start-paused

# Add VNC password
hosted-engine --add-console-password

# Connect via Remote Viewer

# Start VM and quickly edit Grub on viewer (move fast)
virsh -c qemu://hostname.local/system resume HostedEngine

# Append linux16 line with
edd=off

Comment 5 Simone Tiraboschi 2018-05-16 10:25:01 UTC

(In reply to Rick Sherman from comment #4)
> I experienced the same issue, and was able to recover.
> 
> For me the XFS file system of the HostedEngine was corrupted.  This is a
> rough outline of the steps I took to recover.

Thanks, the point is why it got corrupted.
Anything else suspicious?

Comment 6 Simone Tiraboschi 2018-05-25 15:20:27 UTC

Closing it as not systematically reproducible.
Feel free to reopen it if needed.

Note You need to log in before you can comment on or make changes to this bug.