Description of problem: I'm back with an unsolvable problem! I've managed (somehow) to get my Hosted Engine in a state where it get stuck after the console says (probing edd... okay), and then the HA service force reboots it, where the issue returns. I suspect the kernel is corrupted. How reproducible: Honestly, I'm not sure. Hard poweroff at some weird point of the host? Actual results: Well, it gets stuck Expected results: It doesn't get stuck Additional info: --console shows a console with an escape character ^], but nothing else, and does not respond to inputs. Here's the state: --== Host 1 status ==-- conf_on_shared_storage : True Status up-to-date : True Hostname : s1.virt.stm.inf.demilletech.net Host ID : 1 Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} Score : 3400 stopped : False Local maintenance : False crc32 : a1dceb37 local_conf_timestamp : 597 Host timestamp : 597 Extra metadata (valid at timestamp): metadata_parse_version=1 metadata_feature_version=1 timestamp=597 (Thu Apr 19 22:18:10 2018) host-id=1 score=3400 vm_conf_refresh_time=597 (Thu Apr 19 22:18:10 2018) conf_on_shared_storage=True maintenance=False state=EngineStarting stopped=False And then it reboots. So no, I do not have a working inf right now :(
Hi, I'm missing information, mainly log files. This line- Engine status : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"} means that the hosted engine VM is up but the engine inside it is not. This may be a result of multiple scenarios, starting from network issues and ending with VM corruption. In order stop the rebooting of the VM you can move the system to global maintenance using the commanline: hosted-engine --set-maintenance --mode=global Next step will be the figure out the HE VM and engine status. Can you ssh into the VM? Check ovirt service status? Start the engine if it's not running? Get engine log files?
(I work with Eamonn) We've managed to get it to stop rebooting, but all the video console shows is the probing EDD message. Network isn't up, and it doesn't respond to console commands.
I experienced the same issue, and was able to recover. For me the XFS file system of the HostedEngine was corrupted. This is a rough outline of the steps I took to recover. # Manually install libguestfs-xfs package rpm -Uhv libguestfs-xfs-1.36.3-6.el7.x86_64.rpm --nodeps # Determine location of HostedEngine disk virsh -r dumpxml HostedEngine # run guestfish to repair VM disk LIBGUESTFS_BACKEND=direct guestfish --rw -a <disk> ><fs> run ><fs> list-filesystems /dev/sda1: xfs /dev/sda3: xfs /dev/ovirt/audit: xfs /dev/ovirt/home: xfs /dev/ovirt/log: xfs /dev/ovirt/swap: swap /dev/ovirt/tmp: xfs /dev/ovirt/var: xfs # I repaired all disks ><fs> xfs-repair /dev/ovirt/log ><fs> exit # Then I destroyed the loaded VM hosted-engine --vm-poweroff # Last I disabled edd to complete boot - may not be needed # Restart HostedEngine Paused hosted-engine --vm-start-paused # Add VNC password hosted-engine --add-console-password # Connect via Remote Viewer # Start VM and quickly edit Grub on viewer (move fast) virsh -c qemu://hostname.local/system resume HostedEngine # Append linux16 line with edd=off
(In reply to Rick Sherman from comment #4) > I experienced the same issue, and was able to recover. > > For me the XFS file system of the HostedEngine was corrupted. This is a > rough outline of the steps I took to recover. Thanks, the point is why it got corrupted. Anything else suspicious?
Closing it as not systematically reproducible. Feel free to reopen it if needed.