Bug 1892416
| Summary: | Qmeu guest non responsive | ||||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [oVirt] ovirt-guest-agent | Reporter: | Mister X <phonixx6> | ||||||||||||||||||||||||||
| Component: | General | Assignee: | Tomáš Golembiovský <tgolembi> | ||||||||||||||||||||||||||
| Status: | CLOSED DUPLICATE | QA Contact: | Lucie Leistnerova <lleistne> | ||||||||||||||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||||||||||||||
| Priority: | unspecified | ||||||||||||||||||||||||||||
| Version: | 1.0.16 | CC: | ahadas, bugs | ||||||||||||||||||||||||||
| Target Milestone: | --- | ||||||||||||||||||||||||||||
| Target Release: | --- | ||||||||||||||||||||||||||||
| Hardware: | x86_64 | ||||||||||||||||||||||||||||
| OS: | Linux | ||||||||||||||||||||||||||||
| Whiteboard: | |||||||||||||||||||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||||||||||||||||||
| Doc Text: | Story Points: | --- | |||||||||||||||||||||||||||
| Clone Of: | Environment: | ||||||||||||||||||||||||||||
| Last Closed: | 2020-11-03 17:20:40 UTC | Type: | Bug | ||||||||||||||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||||||||||||||
| oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||
| Embargoed: | |||||||||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||||||
Created attachment 1724882 [details]
libvirt guest log
Created attachment 1724884 [details]
messages.log
Created attachment 1724885 [details]
sanlock
To be clear, is this bug about problem with Hosted Engine VM or a regular VM? Please attach journal log from the VM containing messages between reboot and poweroff of the VM: 1) enable persistent journal by: mkdir /var/log/journal 2) perform reboot of VM 3) wait a minute or two 4) force power off the VM 5) start VM again 6) get logs from previous run: journalctl -b -1 > vm-journal.log The documentation text flag should only be set after 'doc text' field is provided. Please provide the documentation text and set the flag to '?' again. (In reply to Tomáš Golembiovský from comment #4) > To be clear, is this bug about problem with Hosted Engine VM or a regular VM? > Please attach journal log from the VM containing messages between reboot and > poweroff of the VM: Hi Tomáš, Thanks for your reply. > 1) enable persistent journal by: mkdir /var/log/journal > 2) perform reboot of VM The VM is stuck on reboot endlessly, no possible way to overcome this! > 3) wait a minute or two > 4) force power off the VM Not possible, power off produces no reaction whatsoever - the VM will not power off! > 5) start VM again Not possible, dependent on step 4 > 6) get logs from previous run: journalctl -b -1 > vm-journal.log I will follow the suggested steps to generate the requested logs collection and upload these ASAP. Thanks, Forgot to mention: The VM is not the hosted-engine! It’s a normal Linux VM! Thanks 😊 I have created the folder as requested and tried to poweoff/shutdown, restart or start are both not selectable. Here is the output: -- Logs begin at Thu 2020-10-29 14:55:03 CET, end at Thu 2020-10-29 15:05:33 CET. -- Oct 29 14:55:03 vapp.int.example.com sshd[12656]: pam_unix(sshd:session): session closed for user root Oct 29 14:55:03 vapp.int.example.com systemd-logind[1450]: Session 11 logged out. Waiting for processes to exit. Oct 29 14:55:03 vapp.int.example.com systemd-logind[1450]: Removed session 11. Oct 29 14:55:49 vapp.int.example.com su[12039]: pam_unix(su:session): session closed for user root Oct 29 14:55:49 vapp.int.example.com systemd-journald[862]: Time spent on flushing to /var is 1.253ms for 3 entries. Oct 29 14:56:13 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit-> Oct 29 14:56:13 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service... Oct 29 14:56:13 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1' Oct 29 14:56:13 vapp.int.example.com systemd[1]: Started System clock and RTC settings service. Oct 29 14:56:18 vapp.int.example.com su[13359]: (to root) user on pts/0 Oct 29 14:56:18 vapp.int.example.com su[13359]: pam_systemd(su:session): Cannot create session: Already running in a session or user slice Oct 29 14:56:18 vapp.int.example.com su[13359]: pam_unix(su:session): session opened for user root by user(uid=1000) Oct 29 14:58:33 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit-> Oct 29 14:58:33 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service... Oct 29 14:58:33 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1' Oct 29 14:58:33 vapp.int.example.com systemd[1]: Started System clock and RTC settings service. Oct 29 14:59:03 vapp.int.example.com systemd[1]: Starting Cleanup of Temporary Directories... Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/certmonger.conf:3] Line references path below legacy directory /var/run/, updating /var/run/certmonger → /run/certmonger; please update the tmpfiles.d/ dr> Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:1] Line references path below legacy directory /var/run/, updating /var/run/vdsm → /run/vdsm; please update the tmpfiles.d/ drop-in file accordi> Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:2] Line references path below legacy directory /var/run/, updating /var/run/vdsm/dhclientmon → /run/vdsm/dhclientmon; please update the tmpfiles> Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:3] Line references path below legacy directory /var/run/, updating /var/run/vdsm/trackedInterfaces → /run/vdsm/trackedInterfaces; please update > Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:4] Line references path below legacy directory /var/run/, updating /var/run/vdsm/v2v → /run/vdsm/v2v; please update the tmpfiles.d/ drop-in file> Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:5] Line references path below legacy directory /var/run/, updating /var/run/vdsm/vhostuser → /run/vdsm/vhostuser; please update the tmpfiles.d/ > Oct 29 14:59:03 vapp.int.example.com systemd-tmpfiles[14379]: [/usr/lib/tmpfiles.d/vdsm.conf:6] Line references path below legacy directory /var/run/, updating /var/run/vdsm/payload → /run/vdsm/payload; please update the tmpfiles.d/ drop> Oct 29 14:59:03 vapp.int.example.com systemd[1]: Started Cleanup of Temporary Directories. Oct 29 15:00:09 vapp.int.example.com systemd[1]: Starting system activity accounting tool... Oct 29 15:00:09 vapp.int.example.com systemd[1]: Started system activity accounting tool. Oct 29 15:00:53 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit-> Oct 29 15:00:53 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service... Oct 29 15:00:53 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1' Oct 29 15:00:53 vapp.int.example.com systemd[1]: Started System clock and RTC settings service. Oct 29 15:01:01 vapp.int.example.com CROND[15127]: (root) CMD (run-parts /etc/cron.hourly) Oct 29 15:01:01 vapp.int.example.com run-parts[15130]: (/etc/cron.hourly) starting 0anacron Oct 29 15:01:01 vapp.int.example.com anacron[15136]: Anacron started on 2020-10-29 Oct 29 15:01:01 vapp.int.example.com anacron[15136]: Will run job `cron.daily' in 45 min. Oct 29 15:01:01 vapp.int.example.com anacron[15136]: Jobs will be executed sequentially Oct 29 15:01:01 vapp.int.example.com run-parts[15138]: (/etc/cron.hourly) finished 0anacron Oct 29 15:01:01 vapp.int.example.com run-parts[15140]: (/etc/cron.hourly) starting vdsm-logrotate Oct 29 15:01:01 vapp.int.example.com run-parts[15146]: (/etc/cron.hourly) finished vdsm-logrotate Oct 29 15:03:13 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit-> Oct 29 15:03:13 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service... Oct 29 15:03:13 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1' Oct 29 15:03:13 vapp.int.example.com systemd[1]: Started System clock and RTC settings service. Oct 29 15:05:11 vapp.int.example.com sshd[16640]: Accepted publickey for root from 192.168.2.132 port 53800 ssh2: RSA SHA256:xxxxxxxxxxxxxxxxxxxxxxxxxxx Oct 29 15:05:11 vapp.int.example.com systemd-logind[1450]: New session 12 of user root. Oct 29 15:05:11 vapp.int.example.com systemd[1]: Started Session 12 of user root. Oct 29 15:05:11 vapp.int.example.com sshd[16640]: pam_unix(sshd:session): session opened for user root by (uid=0) Oct 29 15:05:11 vapp.int.example.com platform-python[16678]: ansible-setup Invoked with gather_subset=['all'] gather_timeout=10 filter=* fact_path=/etc/ansible/facts.d Oct 29 15:05:13 vapp.int.example.com platform-python[16804]: ansible-command Invoked with _raw_params=yum check-update -q | cut -d ' ' -f1 | sed '/^$/d' >> /tmp/yum_updates _uses_shell=True warn=True stdin_add_newline=True strip_empty_en> Oct 29 15:05:13 vapp.int.example.com platform-python[16804]: ansible-command [WARNING] Consider using the yum module rather than running 'yum'. If you need to use command because yum is insufficient you can add 'warn: false' to this com> Oct 29 15:05:14 vapp.int.example.com platform-python[16829]: ansible-lineinfile Invoked with path=/tmp/yum_updates line=Obsoleting state=absent backrefs=False create=False backup=False firstmatch=False follow=False regexp=None insertafte> Oct 29 15:05:15 vapp.int.example.com platform-python[16840]: ansible-slurp Invoked with src=/tmp/yum_updates Oct 29 15:05:15 vapp.int.example.com platform-python[16851]: ansible-file Invoked with path=/tmp/yum_updates state=absent recurse=False force=False follow=True modification_time_format=%Y%m%d%H%M.%S access_time_format=%Y%m%d%H%M.%S _orig> Oct 29 15:05:33 vapp.int.example.com dbus-daemon[1387]: [system] Activating via systemd: service name='org.freedesktop.timedate1' unit='dbus-org.freedesktop.timedate1.service' requested by ':1.100' (uid=0 pid=8043 comm="/usr/bin/cockpit-> Oct 29 15:05:33 vapp.int.example.com systemd[1]: Starting System clock and RTC settings service... Oct 29 15:05:33 vapp.int.example.com dbus-daemon[1387]: [system] Successfully activated service 'org.freedesktop.timedate1' Oct 29 15:05:33 vapp.int.example.com systemd[1]: Started System clock and RTC settings service. KR, I also don’t see how it’s being possible to execute a shutdown or power off using the virsh commands or any cli.? Created attachment 1725101 [details]
journalctl the last 2 days
The attached journal is not really helpful. It does not contain anything from the phase after shutdown was issued. (In reply to Mister X from comment #9) > I also don’t see how it’s being possible to execute a shutdown or power off > using the virsh commands or any cli.? In Web UI there is an arrow next to the "Shutdown" button, when you click the arrow you have an option to power off the VM. From command line you can issue 'virsh destroy' to kill the VM. (In reply to Tomáš Golembiovský from comment #11) > The attached journal is not really helpful. It does not contain anything > from the phase after shutdown was issued. Can't really tell why, it seems that the tasks are not being registered in the journal. > (In reply to Mister X from comment #9) > > I also don’t see how it’s being possible to execute a shutdown or power off > > using the virsh commands or any cli.? virsh does not list any information other then the hosted engine: virsh # list Id Name State ------------------------------ 1 HostedEngine running even list all won't show: virsh # list --all Id Name State ------------------------------ 1 HostedEngine running I have another 7 VM's on the same host and storage. > In Web UI there is an arrow next to the "Shutdown" button, when you click > the arrow you have an option to power off the VM. From command line you can I know that arrow, I have been clicking on it already more then 30 times, nothing happens, if I do it with another VM, it will work perfectly. > issue 'virsh destroy' to kill the VM. Like I said virsh is not listing any VM's, except the hosted engine. Thanks, If I for example try to get the domain info of the problematic VM, I get this: virsh # domid Ops-infra error: failed to get domain 'Ops-infra' Here are the details of the VM: Name: Ops-Infra Defined Memory: 8000 MB Origin: oVirt Description: Physical Memory Guaranteed: 8000 MB Run On: Any Host in Cluster Status: Reboot In Progress Guest OS Memory Free/Cached/Buffered: Not Configured Custom Properties: Not Configured Uptime: Number of CPU Cores: 4 (4:1:1) Cluster Compatibility Version: 4.4 Template: Blank Guest CPU Count: N/A VM ID: 78fae722-6f01-415e-b2e7-f1622f4c945b Operating System: Red Hat Enterprise Linux 7.x x64 Guest CPU: Skylake-Server,+spec-ctrl,+ssbd,+md-clear,-hle,-rtm Chipset/Firmware Type: Q35 Chipset with UEFI BIOS Highly Available: No Graphics protocol: SPICE + VNC Number of Monitors: 1 Video Type: QXL USB Policy: Disabled Hardware Clock Time Offset: Etc/GMT Priority: Low Created By: admin Optimized for: Server Another thing I have found out is that libvirt could have some bug:
systemctl status libvirtd
● libvirtd.service - Virtualization daemon
Loaded: loaded (/usr/lib/systemd/system/libvirtd.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/libvirtd.service.d
└─unlimited-core.conf
Active: active (running) since Thu 2020-10-29 14:44:21 CET; 5h 3min ago
Docs: man:libvirtd(8)
https://libvirt.org
Main PID: 2387 (libvirtd)
Tasks: 18 (limit: 32768)
Memory: 72.7M
CGroup: /system.slice/libvirtd.service
└─2387 /usr/sbin/libvirtd
Oct 29 19:19:49 vapp.int.example.com libvirtd[2387]: authentication failed: Failed to start SASL negotiation: -20 (SASL(-13): user not found: unable to canonify user and get auxprops)
Oct 29 19:19:49 vapp.int.example.com libvirtd[2387]: authentication failed: authentication failed
Oct 29 19:19:49 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:26:27 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:38:02 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:41:54 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:46:54 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Oct 29 19:48:12 vapp.int.example.com libvirtd[2387]: End of file while reading data: Input/output error
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
Now I noticed that the journal log is actually from the host, but I would like to see the journal from the *guest* to be able to spot any issues with the agent. So please attach also journal from the VM. But since, it now looks more like issue with libvirt and/or engine, we will need more information here. Please enable debug logs for virt logger in vdsm. You can do that either by modifying /etc/vdsm/logger.conf or by issuing command: vdsm-client Host setLogLevel level=DEBUG name=virt Please do that *before* you start the VM. Then after a while shut down the VM as usual to reproduce the issue. Note that with debug logs enabled the logs get rotated quite often so please make sure you collect all the log files. Attach also engine logs from the same time period. I’ll get the logs you requested and attach those as requested. Just to clarify in regards to the problematic VM, this VM is literally non responsive at all, it means I cannot access it over ssh or spice/VNC console, I also cannot restart, shutdown, or power off no interaction with the VM works at all! The only thing that stated under the VM “state” is rebooting. Thx (In reply to Tomáš Golembiovský from comment #15) > Now I noticed that the journal log is actually from the host, but I would > like to see the journal from the *guest* to be able to spot any issues with > the agent. So please attach also journal from the VM. I have attached journalctl for a specific period of time where I execute shutdown and poweroff commands to catch those in the logs (journalctl --since "2020-10-30 12:24:00" --until "2020-10-30 12:45:00") Unfortunately I see nothing on the engine that could give a hint, but on the host logs I do see something: Oct 30 12:25:34 vapp.int.example.com ovirt-ha-agent[7868]: ovirt-ha-agent ovirt_hosted_engine_ha.lib.ovf.ovf_store.OVFStore ERROR Unable to extract HEVM OVF Oct 30 12:25:34 vapp.int.example.com ovirt-ha-agent[7868]: ovirt-ha-agent ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine.config.vm ERROR Failed extracting VM OVF from the OVF_STORE volume, falling back to initial vm.conf > But since, it now looks more like issue with libvirt and/or engine, we will > need more information here. Please enable debug logs for virt logger in > vdsm. You can do that either by modifying /etc/vdsm/logger.conf or by > issuing command: > > vdsm-client Host setLogLevel level=DEBUG name=virt Done! set to true > Please do that *before* you start the VM. Done! after I set the debug,! oct 30th 12:55 to 12:57 is where I have shutdown and power off the VM and should appertains in the logs. > Then after a while shut down the VM as usual to reproduce the issue. Note > that with debug logs enabled the logs get rotated quite often so please make > sure you collect all the log files. Attach also engine logs from the same > time period. Collected the logs also from the engine. KR, Created attachment 1725274 [details]
vdsm
Created attachment 1725275 [details]
messages.log_Host
Created attachment 1725276 [details]
agent.log
Created attachment 1725277 [details]
messages.log_Engine
Created attachment 1725284 [details]
engine-journalctl
Created attachment 1725285 [details]
host-journalctl
Created attachment 1725286 [details]
sanlock
See [1] for a workaround until 4.4.4 is released (the fix for bz 1834233 should reduce the probability of this to happen significantly already in 4.4.3) [1] https://www.mail-archive.com/users@ovirt.org/msg62928.html *** This bug has been marked as a duplicate of bug 1880251 *** (In reply to Arik from comment #25) > See [1] for a workaround until 4.4.4 is released (the fix for bz 1834233 > should reduce the probability of this to happen significantly already in > 4.4.3) Thank for the workaround. Two days ago, due to the communication issues I had with virsh and expressed in a complete inability to use the virsh commands at all, I have completely reinstalled the host and the engine from scratch. Since then everything works well and I am able to use the virsh cli without any problem. Now I just hope it works as it should. Many Thanks, |
Created attachment 1724881 [details] vddm.log Description of problem: Guest being stuck on "reboot in progress", neither shutdown nor poweroff on the GUI will help. After complete restart of the host, engine, and the backend storage will not solve the problem. Version-Release number of selected component (if applicable): Ovirt Software Version:4.4.2.6-1.el8 RHL-release-8.2-2.2004.0.2.el8.x86_64 How reproducible: 100% - persistent after restart and power off Steps to Reproduce: 1. start the host and the engine. 2. connect to the hosted-engine and try to control the guest in any way. 3. Try to interact with the guest (restart, pweroff, console, Guest agent is not responding Actual results: Any interaction with the guest will fail (remove is greyed out) Expected results: The guest being controlled and complete the requested action. Additional info: hosted-engine --clean-metadata INFO:ovirt_hosted_engine_ha.agent.agent.Agent:ovirt-hosted-engine-ha agent 2.4.4 started INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Certificate common name not found, using hostname to identify host INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Initializing ha-broker connection INFO:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Broker initialized, all submonitors started ERROR:ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine:Cannot clean unclean metadata block. Consider --force-clean. INFO:ovirt_hosted_engine_ha.agent.agent.Agent:Agent shutting down hosted-engine --set-maintenance --mode=global and restart also not working. Thanks and KR,