Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1892416

Summary: Qmeu guest non responsive
Product: [oVirt] ovirt-guest-agent Reporter: Mister X <phonixx6>
Component: GeneralAssignee: Tomáš Golembiovský <tgolembi>
Status: CLOSED DUPLICATE QA Contact: Lucie Leistnerova <lleistne>
Severity: high Docs Contact:
Priority: unspecified    
Version: 1.0.16CC: ahadas, bugs
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-11-03 17:20:40 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Virt RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
vddm.log
none
libvirt guest log
none
messages.log
none
sanlock
none
journalctl the last 2 days
none
vdsm
none
messages.log_Host
none
agent.log
none
messages.log_Engine
none
engine-journalctl
none
host-journalctl
none
sanlock none

Description Mister X 2020-10-28 18:12:57 UTC
Created attachment 1724881 [details]
vddm.log

Description of problem:
Guest being stuck on "reboot in progress", neither shutdown nor poweroff on the GUI will help.
After complete restart of the host, engine, and the backend storage will not solve the problem.

Version-Release number of selected component (if applicable):
Ovirt Software Version:4.4.2.6-1.el8
RHL-release-8.2-2.2004.0.2.el8.x86_64

How reproducible:
100% - persistent after restart and power off 

Steps to Reproduce:
1. start the host and the engine. 
2. connect to the hosted-engine and try to control the guest in any way.
3. Try to interact with the guest (restart, pweroff, console, Guest agent is not responding

Actual results:
Any interaction with the guest will fail (remove is greyed out)

Expected results:
The guest being controlled and complete the requested action. 

Additional info: 

hosted-engine --clean-metadata
INFO:ovirt_hosted_engine_ha.agent.agent.Agent:ovirt-hosted-engine-ha agent 2.4.4 started
INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Certificate common name not found, using hostname to identify host
INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing ha-broker connection
INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Broker initialized, all submonitors started
ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Cannot clean unclean metadata block. Consider --force-clean.
INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down

hosted-engine --set-maintenance --mode=global and restart also not working.

Thanks and KR,

Comment 1 Mister X 2020-10-28 18:13:46 UTC
Created attachment 1724882 [details]
libvirt guest log

Comment 2 Mister X 2020-10-28 18:14:38 UTC
Created attachment 1724884 [details]
messages.log

Comment 3 Mister X 2020-10-28 18:15:33 UTC
Created attachment 1724885 [details]
sanlock

Comment 4 Tomáš Golembiovský 2020-10-29 09:32:15 UTC
To be clear, is this bug about problem with Hosted Engine VM or a regular VM?
Please attach journal log from the VM containing messages between reboot and poweroff of the VM:

1) enable persistent journal by: mkdir /var/log/journal
2) perform reboot of VM
3) wait a minute or two
4) force power off the VM
5) start VM again
6) get logs from previous run: journalctl -b -1 > vm-journal.log

Comment 5 RHEL Program Management 2020-10-29 09:32:23 UTC
The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again.

Comment 6 Mister X 2020-10-29 09:57:56 UTC
(In reply to Tomáš Golembiovský from comment #4)
> To be clear, is this bug about problem with Hosted Engine VM or a regular VM?
> Please attach journal log from the VM containing messages between reboot and
> poweroff of the VM:
Hi Tomáš,

Thanks for your reply.

> 1) enable persistent journal by: mkdir /var/log/journal
> 2) perform reboot of VM
The VM is stuck on reboot endlessly, no possible way to overcome this!
> 3) wait a minute or two
> 4) force power off the VM
Not possible, power off produces no reaction whatsoever - the VM will not power off!
> 5) start VM again
Not possible, dependent on step 4
> 6) get logs from previous run: journalctl -b -1 > vm-journal.log

I will follow the suggested steps to generate the requested logs collection and upload these ASAP.

Thanks,

Comment 7 Mister X 2020-10-29 10:02:24 UTC
Forgot to mention:
The VM is not the hosted-engine!
It’s a normal Linux VM!
Thanks 😊

Comment 8 Mister X 2020-10-29 14:09:08 UTC
I have created the folder as requested and tried to poweoff/shutdown, restart or start are both not selectable.

Here is the output:
-- Logs begin at Thu 2020-10-29 14:55:03 CET, end at Thu 2020-10-29 15:05:33 CET. --
Oct 29 14:55:03 vapp.int.example.com sshd[12656]: pam_unix(sshd:session): session closed for user root
Oct 29 14:55:03 vapp.int.example.com systemd-logind[1450]: Session 11 logged out. Waiting for processes to exit.
Oct 29 14:55:03 vapp.int.example.com systemd-logind[1450]: Removed session 11.
Oct 29 14:55:49 vapp.int.example.com su[12039]: pam_unix(su:session): session closed for user root
Oct 29 14:55:49 vapp.int.example.com systemd-journald[862]: Time spent on flushing to /var is 1.253ms for 3 entries.
Oct 29 14:56:13 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit->
Oct 29 14:56:13 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service...
Oct 29 14:56:13 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1'
Oct 29 14:56:13 vapp.int.example.com systemd[1]: Started System clock and RTC settings service.
Oct 29 14:56:18 vapp.int.example.com su[13359]: (to root) user on pts/0
Oct 29 14:56:18 vapp.int.example.com su[13359]: pam_systemd(su:session): Cannot create session: Already running in a session or user slice
Oct 29 14:56:18 vapp.int.example.com su[13359]: pam_unix(su:session): session opened for user root by user(uid=1000)
Oct 29 14:58:33 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit->
Oct 29 14:58:33 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service...
Oct 29 14:58:33 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1'
Oct 29 14:58:33 vapp.int.example.com systemd[1]: Started System clock and RTC settings service.
Oct 29 14:59:03 vapp.int.example.com systemd[1]: Starting Cleanup of Temporary Directories...
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/certmonger.conf:3] Line references path below legacy directory /var/run/, updating /var/run/certmonger → /run/certmonger; please update the tmpfiles.d/ dr>
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:1] Line references path below legacy directory /var/run/, updating /var/run/vdsm → /run/vdsm; please update the tmpfiles.d/ drop-in file accordi>
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:2] Line references path below legacy directory /var/run/, updating /var/run/vdsm/dhclientmon → /run/vdsm/dhclientmon; please update the tmpfiles>
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:3] Line references path below legacy directory /var/run/, updating /var/run/vdsm/trackedInterfaces → /run/vdsm/trackedInterfaces; please update >
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:4] Line references path below legacy directory /var/run/, updating /var/run/vdsm/v2v → /run/vdsm/v2v; please update the tmpfiles.d/ drop-in file>
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:5] Line references path below legacy directory /var/run/, updating /var/run/vdsm/vhostuser → /run/vdsm/vhostuser; please update the tmpfiles.d/ >
Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:6] Line references path below legacy directory /var/run/, updating /var/run/vdsm/payload → /run/vdsm/payload; please update the tmpfiles.d/ drop>
Oct 29 14:59:03 vapp.int.example.com systemd[1]: Started Cleanup of Temporary Directories.
Oct 29 15:00:09 vapp.int.example.com systemd[1]: Starting system activity accounting tool...
Oct 29 15:00:09 vapp.int.example.com systemd[1]: Started system activity accounting tool.
Oct 29 15:00:53 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit->
Oct 29 15:00:53 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service...
Oct 29 15:00:53 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1'
Oct 29 15:00:53 vapp.int.example.com systemd[1]: Started System clock and RTC settings service.
Oct 29 15:01:01 vapp.int.example.com CROND[15127]: (root) CMD (run-parts /etc/cron.hourly)
Oct 29 15:01:01 vapp.int.example.com run-parts[15130]: (/etc/cron.hourly) starting 0anacron
Oct 29 15:01:01 vapp.int.example.com anacron[15136]: Anacron started on 2020-10-29
Oct 29 15:01:01 vapp.int.example.com anacron[15136]: Will run job `cron.daily' in 45 min.
Oct 29 15:01:01 vapp.int.example.com anacron[15136]: Jobs will be executed sequentially
Oct 29 15:01:01 vapp.int.example.com run-parts[15138]: (/etc/cron.hourly) finished 0anacron
Oct 29 15:01:01 vapp.int.example.com run-parts[15140]: (/etc/cron.hourly) starting vdsm-logrotate
Oct 29 15:01:01 vapp.int.example.com run-parts[15146]: (/etc/cron.hourly) finished vdsm-logrotate
Oct 29 15:03:13 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit->
Oct 29 15:03:13 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service...
Oct 29 15:03:13 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1'
Oct 29 15:03:13 vapp.int.example.com systemd[1]: Started System clock and RTC settings service.
Oct 29 15:05:11 vapp.int.example.com sshd[16640]: Accepted publickey for root from 192.168.2.132 port 53800 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxx
Oct 29 15:05:11 vapp.int.example.com systemd-logind[1450]: New session 12 of user root.
Oct 29 15:05:11 vapp.int.example.com systemd[1]: Started Session 12 of user root.
Oct 29 15:05:11 vapp.int.example.com sshd[16640]: pam_unix(sshd:session): session opened for user root by (uid=0)
Oct 29 15:05:11 vapp.int.example.com platform-python[16678]: ansible-setup Invoked with gather_subset=['all'] gather_timeout=10 filter=* fact_path=/etc/ansible/facts.d
Oct 29 15:05:13 vapp.int.example.com platform-python[16804]: ansible-command Invoked with _raw_params=yum check-update -q | cut -d ' ' -f1 | sed '/^$/d' >> /tmp/yum_updates _uses_shell=True warn=True stdin_add_newline=True strip_empty_en>
Oct 29 15:05:13 vapp.int.example.com platform-python[16804]: ansible-command [WARNING] Consider using the yum module rather than running 'yum'.  If you need to use command because yum is insufficient you can add 'warn: false' to this com>
Oct 29 15:05:14 vapp.int.example.com platform-python[16829]: ansible-lineinfile Invoked with path=/tmp/yum_updates line=Obsoleting state=absent backrefs=False create=False backup=False firstmatch=False follow=False regexp=None insertafte>
Oct 29 15:05:15 vapp.int.example.com platform-python[16840]: ansible-slurp Invoked with src=/tmp/yum_updates
Oct 29 15:05:15 vapp.int.example.com platform-python[16851]: ansible-file Invoked with path=/tmp/yum_updates state=absent recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _orig>
Oct 29 15:05:33 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit->
Oct 29 15:05:33 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service...
Oct 29 15:05:33 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1'
Oct 29 15:05:33 vapp.int.example.com systemd[1]: Started System clock and RTC settings service.


KR,

Comment 9 Mister X 2020-10-29 15:32:08 UTC
I also don’t see how it’s being possible to execute a shutdown or power off using the virsh commands or any cli.?

Comment 10 Mister X 2020-10-29 15:57:16 UTC
Created attachment 1725101 [details]
journalctl the last 2 days

Comment 11 Tomáš Golembiovský 2020-10-29 17:56:54 UTC
The attached journal is not really helpful. It does not contain anything from the phase after shutdown was issued.

(In reply to Mister X from comment #9)
> I also don’t see how it’s being possible to execute a shutdown or power off
> using the virsh commands or any cli.?

In Web UI there is an arrow next to the "Shutdown" button, when you click the arrow you have an option to power off the VM. From command line you can issue 'virsh destroy' to kill the VM.

Comment 12 Mister X 2020-10-29 18:36:52 UTC
(In reply to Tomáš Golembiovský from comment #11)
> The attached journal is not really helpful. It does not contain anything
> from the phase after shutdown was issued.

Can't really tell why, it seems that the tasks are not being registered in the journal.


> (In reply to Mister X from comment #9)
> > I also don’t see how it’s being possible to execute a shutdown or power off
> > using the virsh commands or any cli.?

virsh does not list any information other then the hosted engine:
virsh # list
 Id   Name           State
------------------------------
 1    HostedEngine   running

even list all won't show:

virsh # list --all
 Id   Name           State
------------------------------
 1    HostedEngine   running


I have another 7 VM's on the same host and storage.

> In Web UI there is an arrow next to the "Shutdown" button, when you click
> the arrow you have an option to power off the VM. From command line you can
I know that arrow, I have been clicking on it already more then 30 times, nothing happens, if I do it with another VM, it will work perfectly.

> issue 'virsh destroy' to kill the VM.
Like I said virsh is not listing any VM's, except the hosted engine.

Thanks,

Comment 13 Mister X 2020-10-29 18:44:50 UTC
If I for example try to get the domain info of the problematic VM, I get this:

virsh # domid Ops-infra
error: failed to get domain 'Ops-infra'

Here are the details of the VM:

Name:
Ops-Infra
Defined Memory:
8000 MB
Origin:
oVirt
Description:
Physical Memory Guaranteed:
8000 MB
Run On:
Any Host in Cluster
Status:
Reboot In Progress
Guest OS Memory Free/Cached/Buffered:
Not Configured
Custom Properties:
Not Configured
Uptime:
Number of CPU Cores:
4 (4:1:1)
Cluster Compatibility Version:
4.4
Template:
Blank
Guest CPU Count:
N/A
VM ID:
78fae722-6f01-415e-b2e7-f1622f4c945b
Operating System:
Red Hat Enterprise Linux 7.x x64
Guest CPU:
Skylake-Server,+spec-ctrl,+ssbd,+md-clear,-hle,-rtm
Chipset/Firmware Type:
Q35 Chipset with UEFI BIOS
Highly Available:
No
Graphics protocol:
SPICE + VNC
Number of Monitors:
1
Video Type:
QXL
USB Policy:
Disabled
Hardware Clock Time Offset:
Etc/GMT
Priority:
Low
Created By:
admin
Optimized for:
Server

Comment 14 Mister X 2020-10-29 18:50:53 UTC
Another thing I have found out is that libvirt could have some bug:

systemctl status libvirtd
● libvirtd.service - Virtualization daemon
   Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
  Drop-In: /etc/systemd/system/libvirtd.service.d
           └─unlimited-core.conf
   Active: active (running) since Thu 2020-10-29 14:44:21 CET; 5h 3min ago
     Docs: man:libvirtd(8)
           https://libvirt.org
 Main PID: 2387 (libvirtd)
    Tasks: 18 (limit: 32768)
   Memory: 72.7M
   CGroup: /system.slice/libvirtd.service
           └─2387 /usr/sbin/libvirtd

Oct 29 19:19:49 vapp.int.example.com libvirtd[2387]: authentication failed: Failed to start SASL negotiation: -20 (SASL(-13): user not found: unable to canonify user and get auxprops)
Oct 29 19:19:49 vapp.int.example.com libvirtd[2387]: authentication failed: authentication failed
Oct 29 19:19:49 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:26:27 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:38:02 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:41:54 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:46:54 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:48:12 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Comment 15 Tomáš Golembiovský 2020-10-30 09:34:32 UTC
Now I noticed that the journal log is actually from the host, but I would like to see the journal from the *guest* to be able to spot any issues with the agent. So please attach also journal from the VM.

But since, it now looks more like issue with libvirt and/or engine, we will need more information here. Please enable debug logs for virt logger in vdsm. You can do that either by modifying /etc/vdsm/logger.conf or by issuing command:

   vdsm-client Host setLogLevel level=DEBUG name=virt

Please do that *before* you start the VM.

Then after a while shut down the VM as usual to reproduce the issue. Note that with debug logs enabled the logs get rotated quite often so please make sure you collect all the log files. Attach also engine logs from the same time period.

Comment 16 Mister X 2020-10-30 10:54:31 UTC
I’ll get the logs you requested and attach those as requested.

Just to clarify in regards to the problematic VM, this VM is literally non responsive at all, it means I cannot access it over ssh or spice/VNC console, I also cannot restart, shutdown, or power off no interaction with the VM works at all! 

The only thing that stated under the VM “state” is rebooting.

Thx

Comment 17 Mister X 2020-10-30 12:15:36 UTC
(In reply to Tomáš Golembiovský from comment #15)
> Now I noticed that the journal log is actually from the host, but I would
> like to see the journal from the *guest* to be able to spot any issues with
> the agent. So please attach also journal from the VM.

I have attached journalctl for a specific period of time where I execute shutdown and poweroff commands to catch those in the logs (journalctl --since "2020-10-30 12:24:00" --until "2020-10-30 12:45:00")
Unfortunately I see nothing on the engine that could give a hint, but on the host logs I do see something:

Oct 30 12:25:34 vapp.int.example.com ovirt-ha-agent[7868]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore ERROR Unable to extract HEVM OVF
Oct 30 12:25:34 vapp.int.example.com ovirt-ha-agent[7868]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf


> But since, it now looks more like issue with libvirt and/or engine, we will
> need more information here. Please enable debug logs for virt logger in
> vdsm. You can do that either by modifying /etc/vdsm/logger.conf or by
> issuing command:
> 
>    vdsm-client Host setLogLevel level=DEBUG name=virt

Done! set to true
> Please do that *before* you start the VM.
Done! after I set the debug,! oct 30th 12:55 to 12:57 is where I have shutdown and power off the VM and should appertains in the logs.

> Then after a while shut down the VM as usual to reproduce the issue. Note
> that with debug logs enabled the logs get rotated quite often so please make
> sure you collect all the log files. Attach also engine logs from the same
> time period.
Collected the logs also from the engine.

KR,

Comment 18 Mister X 2020-10-30 12:19:42 UTC
Created attachment 1725274 [details]
vdsm

Comment 19 Mister X 2020-10-30 12:20:35 UTC
Created attachment 1725275 [details]
messages.log_Host

Comment 20 Mister X 2020-10-30 12:21:16 UTC
Created attachment 1725276 [details]
agent.log

Comment 21 Mister X 2020-10-30 12:22:15 UTC
Created attachment 1725277 [details]
messages.log_Engine

Comment 22 Mister X 2020-10-30 12:23:39 UTC
Created attachment 1725284 [details]
engine-journalctl

Comment 23 Mister X 2020-10-30 12:24:08 UTC
Created attachment 1725285 [details]
host-journalctl

Comment 24 Mister X 2020-10-30 12:24:42 UTC
Created attachment 1725286 [details]
sanlock

Comment 25 Arik 2020-11-03 17:20:40 UTC
See [1] for a workaround until 4.4.4 is released (the fix for bz 1834233 should reduce the probability of this to happen significantly already in 4.4.3)

[1] https://www.mail-archive.com/users@ovirt.org/msg62928.html

*** This bug has been marked as a duplicate of bug 1880251 ***

Comment 26 Mister X 2020-11-03 22:57:13 UTC
(In reply to Arik from comment #25)
> See [1] for a workaround until 4.4.4 is released (the fix for bz 1834233
> should reduce the probability of this to happen significantly already in
> 4.4.3)

Thank for the workaround.

Two days ago, due to the communication issues I had with virsh and expressed in a complete inability to use the virsh commands at all, I have completely reinstalled the host and the engine from scratch.
Since then everything works well and I am able to use the virsh cli without any problem.
Now I just hope it works as it should.

Many Thanks,