Bug 1334982
Summary: | [RFE] Gracefully shutdown Virtual Machines on Host reboot/shutdown. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Germano Veit Michel <gveitmic> | ||||||
Component: | ovirt-engine | Assignee: | Petr Kotas <pkotas> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Israel Pinto <ipinto> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 3.6.5 | CC: | apinnick, dcadzow, dfediuck, eheftman, fgarciad, giordy, gordon.stocks, gveitmic, ipinto, lsurette, mavital, michal.skrivanek, mtessun, pkotas, rbalakri, Rhev-m-bugs, sbonazzo, srevivo, tjelinek, ykaul | ||||||
Target Milestone: | ovirt-4.2.3 | Keywords: | FutureFeature | ||||||
Target Release: | --- | Flags: | mavital:
testing_plan_complete+
|
||||||
Hardware: | x86_64 | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Enhancement | |||||||
Doc Text: |
Previously, in an emergency, users were required to shut down the hosts to preserve the data center. This caused running virtual machines to be killed by the systemd process without performing a graceful shutdown. As a result, the virtual machines' state became undefined, which led to problematic scenarios for virtual machines running databases such as Oracle and SAP.
In this release, virtual machines can be gracefully shut down by delaying the systemd process. After the virtual machines are shut down, the systemd process takes control and continues the shutdown. The VDSM is only shut down after the virtual machines have been gracefully shut down, after passing information to the Manager and waiting 5 seconds for the Manager to acknowledge the virtual machines have been shut down.
|
Story Points: | --- | ||||||
Clone Of: | |||||||||
: | 1401054 (view as bug list) | Environment: | |||||||
Last Closed: | 2018-05-15 17:38:32 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | 1401054 | ||||||||
Bug Blocks: | 1507935 | ||||||||
Attachments: |
|
Description
Germano Veit Michel
2016-05-11 03:34:20 UTC
Maybe a good first step would be to get this hard-coded string "Before", "as", 1, "libvirt-guests.service") < 0) <----------- out of libvirt and make it customizable. Or add another 'Before' line for a new 'rhev-guests.service'. (In reply to Germano Veit Michel from comment #2) > Maybe a good first step would be to get this hard-coded string > > "Before", "as", 1, "libvirt-guests.service") < 0) <----------- > > out of libvirt and make it customizable. Or add another 'Before' line for a > new 'rhev-guests.service'. The above sounds like a libvirt issue, then? (In reply to Yaniv Kaul from comment #3) > The above sounds like a libvirt issue, then? Yes, correct. I even considered opening a separate BZ for that. As this RFE will likely spawn a few other child RFE (libvirt, vdsm, ..) I would prefer to not get ahead of myself and let engineering properly decide what needs to be done in each component. So I created just this one, as a feature request that has the ultimate goal of having a nice, usable feature. And also describing the problems I found to provide some help/context. We could go ahead an open the libvirt specific BZ, but first it would be good to have someone else assess what exactly would be the best implementation on libvirt's side to help this RFE. Michal? Isn't this a Virt area? Thinking about a bit with Moran and msivak, it seems to me that a feasible approach would be: - put a message to /etc/motd that this host is part of xyz environment and should be put to maintenance before any actual system maintenance - drop in a shell alias to echo a similar message when admin/root types poweroff, shutdown, reboot. Should be fairly simple to add to host deploy and NGN, what do you think, Sandro? other alternatives to actually block it from happening are way more complicated to do. The above should be good enough for most cases (In reply to Michal Skrivanek from comment #6) > Thinking about a bit with Moran and msivak, it seems to me that a feasible > approach would be: > > - put a message to /etc/motd that this host is part of xyz environment and > should be put to maintenance before any actual system maintenance > - drop in a shell alias to echo a similar message when admin/root types > poweroff, shutdown, reboot. > > Should be fairly simple to add to host deploy and NGN, what do you think, > Sandro? > > other alternatives to actually block it from happening are way more > complicated to do. The above should be good enough for most cases What about setting systemd to disable poweroff and shutdown[1] and re-enable it on host moved to maintenance? [1] https://www.freedesktop.org/software/systemd/man/systemd-inhibit.html Should probably be handled by VDSM/infra in this case. Hm, that would be great, but...it seems to me it assumes a command to execute. And would require manipulation from vdsm. Note that currently vdsm is not aware of its maintenance status Sounds too complex to me... Just to summarize what I understood so far and to have a discussion starting point here. vdsm does disable the libvirt-guests.service. This leads to systemd killing the VMs. The default behaviour of the guests service is to suspend the VMs in case of a shutdown (which is in any case not wanted by vdsm/RHV). According to the service definition this can be changed in /etc/sysconfig/libvirt-guests: ON_SHUTDOWN=shutdown Also the parallel shutdown could be increased. (PARALLEL_SHUTDOWN=<n>) If I understood this correctly, using a configured libvirt-guests service could easily solve this issue, as a reboot would then do the following: 1. Shutdown the VMs via the libvirt-guest service 2. Shutdown libvirt and vdsm ---> here eventually fencing comes into place, but the VMs should already been stopped. For the nicer approach we can add some UI elements that would change the config of libvirt-guests (disable/enable it). So the implementation could be: - vdsm-config configures libvirt-guest service to shutdown VMs instead of suspending thenm. - Enable/Disable the service can be achieved via the UI by selecting a checkbox like "Gracefully shutdown VMs in case of external Poweroff". If this is checked the libvirt-guest service is enabled, otherwise it is disabled. Thoughts? Forgot to mention: OB_BOOT=ignore needs also be set so that the VMs are not restarted by libvirt during restart, as this could lead to VMs running more than once. I see no use of keeping the current behavior, a shutdown should be the default. The suspend is useless as it won't work when initiated by libvirt. Another more complicated enhancement is to trigger a host maintenance instead....but the shutdown seems like a good enough starting point. please make sure to verify also the case explained in https://bugzilla.redhat.com/show_bug.cgi?id=1485967 *** Bug 1485967 has been marked as a duplicate of this bug. *** Depends on https://github.com/oVirt/ovirt-ansible/pull/62 Cannot be tested before this gets merged. Hey Petr I'm reviewing the Doc Text that you wrote, and also, where required, opening documentation bugs to covers RFEs. In the doc text, don't you think we should mention that the solution should be enabled/disabled via the UI? As discussed here: https://bugzilla.redhat.com/show_bug.cgi?id=1334982#c10 Another question Petr. Does the 5 second configurable delay that you mentioned a new feature? I don't really understand the connection with what you wrote and what is written by Martin wrote here: https://bugzilla.redhat.com/show_bug.cgi?id=1334982#c10 Can you please clarify exactly what changes will be made. Thanks! Hi Emma, thanks for the questions. > In the doc text, don't you think we should mention that the solution should be enabled/disabled via the UI? Currently, there is no GUI to enable or disable this functionality. The feature is on by default. > Does the 5 second configurable delay that you mentioned a new feature? This is not a feature, although it is a tweakable parameter. I am not expecting this to change a lot. The 5 seconds delay exists only to allow engine finish all communication before the host machine shuts down. I am testing on version: 4.2.0-0.5.master.el7 (build 7) I set the host_deploy_shutdown_timeout to 6000 I run the first case with 2 VMs on same host and run command "poweroff -f" on the host. (see https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/wiki/Compute/_RFE_%20Gracefully%20shutdown%20Virtual%20Machines%20on%20Host%20reboot_shutdown_) I see that the VMs did not stopped by vdsm before host powered off. From engine log i see that VMs (test_vm, test_vm_2) is down with error: 2017-12-04 10:48:30,206+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_vm_2 is down with error. Exit message: VM terminated with error. 2017-12-04 10:48:31,011+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] VM '635a2291-f060-48a0-9fce-c4c05d6722d8'(test_vm) moved from 'Unknown' --> 'Down' 2017-12-04 10:48:31,039+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-58) [] EVENT_ID: VM_DOWN_ERROR(119), VM test_vm is down with error. Exit message: VM terminated with error. From my script I see that VMs running till host and vdsm went down: (Sampling 3 time in second) check #225 10:44:07 Id Name State ---------------------------------------------------- 1 test_vm_2 running 2 test_vm running vdsm: Active: active (running) since Mon 2017-12-04 10:40:08 IST; 3min 59s ago ================================================ check #226 10:44:07 Id Name State ---------------------------------------------------- 1 test_vm_2 running 2 test_vm running vdsm: Active: active (running) since Mon 2017-12-04 10:40:08 IST; 3min 59s ago ================================================ check #227 10:44:07 Id Name State ---------------------------------------------------- 1 test_vm_2 running 2 test_vm running vdsm: Active: active (running) since Mon 2017-12-04 10:40:08 IST; 4min 0s ago ================================================ check #228 10:44:08 Id Name State ---------------------------------------------------- 1 test_vm_2 running 2 test_vm running vdsm: Active: active (running) since Mon 2017-12-04 10:40:08 IST; 4min 0s ago ================================================ (In reply to Israel Pinto from comment #22) please add logs. Namely journal from host and vdsm.log, and qemu log for one of those VMs Created attachment 1365319 [details]
logs(engine,vdsm, vms-qemu, script log)
still need journal and exact time you run poweroff Created attachment 1365484 [details]
logs(engine,vdsm, vms-qemu, script log, journalctl)
That doesn't seem to capture the time you did the test. Please note down when you did things and doublecheck timestamps in attached logs That doesn't seem to capture the time you did the test. Please note down when you did things and doublecheck timestamps in attached logs 1. poweroff command date: [root@puma42 home]# date Sun Dec 10 08:58:16 IST 2017 [root@puma42 home]# poweroff -f Powering off. Michal i see the jump in the time also in /var/log/messages I did the poweroff at 08:58:16 but i don't see this time not in journal and /var/log/message. If you need i will re-run it on different host. Update: Engine: Software Version:4.2.1.6-0.1.el7 Host: OS Version:RHEL - 7.4 - 18.el7 Kernel Version: 3.10.0 - 693.17.1.el7.x86_64 KVM Version: 2.9.0 - 16.el7_4.14 LIBVIRT Version:libvirt-3.9.0-8.el7 VDSM Version: vdsm-4.20.17-1.el7ev Steps: 1. Create vm pool with 10 vms 2. Shutdown host The vms are down with error [root@compute-ge-6 ovirt-engine]# grep -ir test-2 engine.log:2018-02-07 10:07:07,563+02 INFO [org.ovirt.engine.core.bll.AddVmCommand] (default task-7) [283f0642-f07b-47ed-9e78-8326598ed7c5] Lock Acquired to object 'EngineLock:{exclusiveLocks='[test-2=VM_NAME]', sharedLocks='[ad6dc22c-9274-4664-a6e6-e1a23e57de3d=VM_POOL, c6080af6-8fe5-48d0-a8c4-99c33a644f27=TEMPLATE, e778d44a-a7a9-4eec-8481-0fad5f008f13=DISK]'}' engine.log:2018-02-07 10:07:09,667+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-7) [22859810] EVENT_ID: USER_ADD_VM_STARTED(37), VM test-2 creation was initiated by admin@internal-authz. engine.log:2018-02-07 10:07:29,123+02 INFO [org.ovirt.engine.core.bll.AddVmCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] Lock freed to object 'EngineLock:{exclusiveLocks='[test-2=VM_NAME]', sharedLocks='[ad6dc22c-9274-4664-a6e6-e1a23e57de3d=VM_POOL, c6080af6-8fe5-48d0-a8c4-99c33a644f27=TEMPLATE, e778d44a-a7a9-4eec-8481-0fad5f008f13=DISK]'}' engine.log:2018-02-07 10:07:29,154+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: USER_ADD_VM_FINISHED_SUCCESS(53), VM test-2 creation has been completed. engine.log:2018-02-07 11:02:22,614+02 INFO [org.ovirt.engine.core.vdsbroker.CreateVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1399) [20dd9e65-48d0-4f59-a52c-71857bd0134d] START, CreateVDSCommand( CreateVDSCommandParameters:{hostId='4a284ac7-8075-430b-98ee-c7c45d1cc04a', vmId='7dbb3560-8299-44fc-a97d-d669ac7dd31e', vm='VM [test-2]'}), log id: 69c52db4 engine.log:2018-02-07 11:02:22,617+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.CreateBrokerVDSCommand] (EE-ManagedThreadFactory-engine-Thread-1399) [20dd9e65-48d0-4f59-a52c-71857bd0134d] START, CreateBrokerVDSCommand(HostName = host_mixed_1, CreateVDSCommandParameters:{hostId='4a284ac7-8075-430b-98ee-c7c45d1cc04a', vmId='7dbb3560-8299-44fc-a97d-d669ac7dd31e', vm='VM [test-2]'}), log id: 1b4977b engine.log: <name>test-2</name> engine.log:2018-02-07 11:02:22,803+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1399) [20dd9e65-48d0-4f59-a52c-71857bd0134d] EVENT_ID: USER_STARTED_VM(153), VM test-2 was started by admin@internal-authz (Host: host_mixed_1). engine.log:2018-02-07 11:02:40,180+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-61) [] VM '7dbb3560-8299-44fc-a97d-d669ac7dd31e'(test-2) moved from 'WaitForLaunch' --> 'PoweringUp' engine.log:2018-02-07 11:02:56,199+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] VM '7dbb3560-8299-44fc-a97d-d669ac7dd31e'(test-2) moved from 'PoweringUp' --> 'Up' engine.log:2018-02-07 11:02:56,341+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-84) [] EVENT_ID: USER_RUN_VM(32), VM test-2 started on Host host_mixed_1 engine.log:2018-02-07 11:36:16,228+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-1784) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM test-2 was set to the Unknown status. engine.log:2018-02-07 11:42:35,335+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [] VM '7dbb3560-8299-44fc-a97d-d669ac7dd31e'(test-2) moved from 'Unknown' --> 'Down' engine.log:2018-02-07 11:42:35,359+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-51) [] EVENT_ID: VM_DOWN_ERROR(119), VM test-2 is down with error. Exit message: VM terminated with error. engine.log:2018-02-07 11:42:39,963+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-1904) [64f3541a] Couldn't update VM 'test-2' (7dbb3560-8299-44fc-a97d-d669ac7dd31e) version from it's template, continue with restoring stateless snapshot. logs: https://drive.google.com/open?id=1vzLUwlP3oB6Z2PFWbq_Y8ahlLPfnpI1E Failing on vm pools, VM is down with error. Failed QA. see c31. Hi I can see there is no libvirt-guests entry in the journal log. Is the journal from host or from guest? Moreover, on which RHEL were the tests done? There is a known bud in 7.4 for VM shutdown, which is fixed in 7.5. If so, can you please retest it for RHEL 7.5? Next, can you please redo the test for 10 VMs which are not in the poll? This will help eliminate the Pool as the source of issue. Many thanks, Petr let's wait for confirmation I retested with: Engine version:4.2.2-0.1.el7 Host: OS Version:RHEL - 7.5 - 6.el7 Kernel Version:3.10.0 - 851.el7.x86_64 KVM Version:2.10.0 - 20.el7 LIBVIRT Version:libvirt-3.9.0-13.el7 VDSM Version:vdsm-4.20.18-1.el7ev Cases: 1. Create VM pool with 15 VMs, Run all VMs shutdown the host 2. Create 12 VMs, Run all VMs shutdown the host Results: Case 1: 2 VMs stay in Unknown status Case 1: 1 VM stay in Unknown status See logs at: https://drive.google.com/open?id=1YdeA3msXn6w8fNMLlXVmO3qO3RPe1bab Case 1 pool with 15 vms: VM pool_test_-7 and pool_test_-7 were set to "Unknown status" shutdown Command: [root@puma42 ~]# shutdown Shutdown scheduled for Wed 2018-02-21 09:21:03 IST, use 'shutdown -c' to cancel. [root@puma42 ~]# Broadcast message from root.lab.tlv.redhat.com (Wed 2018-02-21 09:20:03 IST): The system is going down for power-off at Wed 2018-02-21 09:21:03 IST! Engine log: 2018-02-21 09:21:10,962+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] VM '9206a475-bcc5-4e24-a125-13085a2b02be'(pool_test_-1) moved from 'Up' --> 'PoweringDown' 2018-02-21 09:21:10,971+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] VM '5c0b4c2f-d1f0-4e48-93c9-131c7ef61ef1'(pool_test_-3) moved from 'Up' --> 'Down' 2018-02-21 09:21:11,112+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-65) [] EVENT_ID: VM_DOWN(61), VM pool_test_-3 is down. Exit message: User shut down from within the guest 2018-02-21 09:21:11,475+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21509) [27f2ac26] Couldn't update VM 'pool_test_-3' (5c0b4c2f-d1f0-4e48-93c9-131c7ef61ef1) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:21:26,176+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] VM '2ddcad0a-e94e-4f17-9268-91624d9bd9da'(pool_test_-2) moved from 'Up' --> 'Down' 2018-02-21 09:21:26,297+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] EVENT_ID: VM_DOWN(61), VM pool_test_-2 is down. Exit message: User shut down from within the guest 2018-02-21 09:21:26,306+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] VM '9206a475-bcc5-4e24-a125-13085a2b02be'(pool_test_-1) moved from 'PoweringDown' --> 'Down' 2018-02-21 09:21:26,341+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] EVENT_ID: VM_DOWN(61), VM pool_test_-1 is down. Exit message: User shut down from within the guest 2018-02-21 09:21:26,422+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21514) [3fbc9284] Couldn't update VM 'pool_test_-1' (9206a475-bcc5-4e24-a125-13085a2b02be) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:21:26,516+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21514) [4b3e995b] Couldn't update VM 'pool_test_-2' (2ddcad0a-e94e-4f17-9268-91624d9bd9da) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:21:41,395+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] VM 'b5af689a-6f2e-411f-89bf-dfca090a495e'(pool_test_-4) moved from 'Up' --> 'Down' 2018-02-21 09:21:41,536+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] EVENT_ID: VM_DOWN(61), VM pool_test_-4 is down. Exit message: User shut down from within the guest 2018-02-21 09:21:41,549+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] VM '3d95f01e-1e92-4ad7-bbcd-7d00167929a1'(pool_test_-14) moved from 'Up' --> 'Down' 2018-02-21 09:21:41,570+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-25) [] EVENT_ID: VM_DOWN(61), VM pool_test_-14 is down. Exit message: User shut down from within the guest 2018-02-21 09:21:41,661+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21519) [bc90f5a] Couldn't update VM 'pool_test_-4' (b5af689a-6f2e-411f-89bf-dfca090a495e) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:21:41,764+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21519) [2d7151cd] Couldn't update VM 'pool_test_-14' (3d95f01e-1e92-4ad7-bbcd-7d00167929a1) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:21:56,621+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] VM '2dd11809-dd6d-4041-ae01-b4d402f99579'(pool_test_-6) moved from 'Up' --> 'Down' 2018-02-21 09:21:56,646+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-40) [] EVENT_ID: VM_DOWN(61), VM pool_test_-6 is down. Exit message: User shut down from within the guest 2018-02-21 09:21:56,728+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21524) [ac7ac7b] Couldn't update VM 'pool_test_-6' (2dd11809-dd6d-4041-ae01-b4d402f99579) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:22:11,697+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM '52fd7858-cdc1-47fb-9603-34fb874c34fc'(pool_test_-8) moved from 'Up' --> 'Down' 2018-02-21 09:22:11,721+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM pool_test_-8 is down. Exit message: User shut down from within the guest 2018-02-21 09:22:11,802+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21529) [2bb28719] Couldn't update VM 'pool_test_-8' (52fd7858-cdc1-47fb-9603-34fb874c34fc) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:22:26,765+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] VM 'd6d9f51c-323e-46a2-82da-f43da155425f'(pool_test_-11) moved from 'Up' --> 'Down' 2018-02-21 09:22:26,786+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] EVENT_ID: VM_DOWN(61), VM pool_test_-11 is down. Exit message: User shut down from within the guest 2018-02-21 09:22:26,794+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] VM 'd21fd87b-95e9-435f-9155-96ec25ccb272'(pool_test_-15) moved from 'Up' --> 'Down' 2018-02-21 09:22:26,810+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-34) [] EVENT_ID: VM_DOWN(61), VM pool_test_-15 is down. Exit message: User shut down from within the guest 2018-02-21 09:22:26,880+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21534) [127a20ab] Couldn't update VM 'pool_test_-15' (d21fd87b-95e9-435f-9155-96ec25ccb272) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:22:26,963+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21534) [52917cf7] Couldn't update VM 'pool_test_-11' (d6d9f51c-323e-46a2-82da-f43da155425f) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:22:41,857+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-7) [] VM '8f441bc7-148f-44c0-8890-a7f8db9eca57'(pool_test_-9) moved from 'Up' --> 'Down' 2018-02-21 09:22:41,899+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-7) [] EVENT_ID: VM_DOWN(61), VM pool_test_-9 is down. Exit message: User shut down from within the guest 2018-02-21 09:22:41,981+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21537) [760ea1af] Couldn't update VM 'pool_test_-9' (8f441bc7-148f-44c0-8890-a7f8db9eca57) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:22:56,938+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] VM '389f1314-cd45-4e55-ba70-c62ff1e6e715'(pool_test_-10) moved from 'Up' --> 'Down' 2018-02-21 09:22:56,972+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_DOWN(61), VM pool_test_-10 is down. Exit message: User shut down from within the guest 2018-02-21 09:22:57,041+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21542) [4aedf5e2] Couldn't update VM 'pool_test_-10' (389f1314-cd45-4e55-ba70-c62ff1e6e715) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:23:12,012+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [] VM '52ed58bf-7a49-4776-b51e-58eabceb5c60'(pool_test_-12) moved from 'Up' --> 'Down' 2018-02-21 09:23:12,186+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-56) [] EVENT_ID: VM_DOWN(61), VM pool_test_-12 is down. Exit message: User shut down from within the guest 2018-02-21 09:23:12,432+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21547) [63fb6556] Couldn't update VM 'pool_test_-12' (52ed58bf-7a49-4776-b51e-58eabceb5c60) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:23:27,222+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-75) [] VM '7bbd15b8-6bdc-4df0-956b-8d3dd7409485'(pool_test_-13) moved from 'Up' --> 'Down' 2018-02-21 09:23:27,247+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-75) [] EVENT_ID: VM_DOWN(61), VM pool_test_-13 is down. Exit message: User shut down from within the guest 2018-02-21 09:23:27,321+02 WARN [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-21550) [3069f554] Couldn't update VM 'pool_test_-13' (7bbd15b8-6bdc-4df0-956b-8d3dd7409485) version from it's template, continue with restoring stateless snapshot. 2018-02-21 09:25:02,768+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21578) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM pool_test_-7 was set to the Unknown status. 2018-02-21 09:25:02,778+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-21578) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM pool_test_-5 was set to the Unknown status. ------------------------------------------------------------------------------- Case 2 info with 12 vms, vm_2 was stay in "Unknown status": shutdown command: [root@puma42 ~]# shutdown Shutdown scheduled for Wed 2018-02-21 10:30:00 IST, use 'shutdown -c' to cancel. [root@puma42 ~]# Broadcast message from root.lab.tlv.redhat.com (Wed 2018-02-21 10:29:00 IST): The system is going down for power-off at Wed 2018-02-21 10:30:00 IST! Engine log, vm_2 was set to the Unknown status: 2018-02-21 10:30:28,324+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM '1ba7d522-17cf-4c68-a829-271ec4c8288f'(vm_1) moved from 'Up' --> 'Down' 2018-02-21 10:30:28,478+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM vm_1 is down. Exit message: User shut down from within the guest 2018-02-21 10:30:28,491+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM '49adc5c8-aa6a-4576-93ba-734750c713fb'(vm_11) moved from 'Up' --> 'Down' 2018-02-21 10:30:28,611+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM vm_11 is down. Exit message: User shut down from within the guest 2018-02-21 10:30:28,620+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] VM 'b36b411b-864a-4cce-a6df-143ecb345538'(vm_5) moved from 'Up' --> 'Down' 2018-02-21 10:30:28,744+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-55) [] EVENT_ID: VM_DOWN(61), VM vm_5 is down. Exit message: User shut down from within the guest 2018-02-21 10:30:43,804+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] VM 'c215a1be-7a75-41d8-b470-bea29d0ccd32'(vm_7) moved from 'Up' --> 'Down' 2018-02-21 10:30:43,926+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] EVENT_ID: VM_DOWN(61), VM vm_7 is down. Exit message: User shut down from within the guest 2018-02-21 10:30:58,975+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] VM '80349a03-8886-4cf0-a577-a39f7221875d'(vm_12) moved from 'Up' --> 'Down' 2018-02-21 10:30:59,115+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] EVENT_ID: VM_DOWN(61), VM vm_12 is down. Exit message: User shut down from within the guest 2018-02-21 10:30:59,134+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] VM '8625c439-792b-4a46-9c71-f85d0c1cdd08'(vm_8) moved from 'Up' --> 'Down' 2018-02-21 10:30:59,159+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-10) [] EVENT_ID: VM_DOWN(61), VM vm_8 is down. Exit message: User shut down from within the guest 2018-02-21 10:31:14,202+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] VM 'a7822cf9-6086-45e7-9ffc-b501457709a0'(vm_6) moved from 'Up' --> 'Down' 2018-02-21 10:31:14,223+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] EVENT_ID: VM_DOWN(61), VM vm_6 is down. Exit message: User shut down from within the guest 2018-02-21 10:31:14,233+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] VM '710c361d-e66a-49c6-a4fc-4aed3a5ca2d6'(vm_3) moved from 'Up' --> 'Down' 2018-02-21 10:31:14,253+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-86) [] EVENT_ID: VM_DOWN(61), VM vm_3 is down. Exit message: User shut down from within the guest 2018-02-21 10:31:29,292+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] VM 'e46b3232-4cbb-47fb-8f24-8c1b72486482'(vm_4) moved from 'Up' --> 'Down' 2018-02-21 10:31:29,425+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] EVENT_ID: VM_DOWN(61), VM vm_4 is down. Exit message: User shut down from within the guest 2018-02-21 10:31:29,433+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] VM '3fd29d04-0e93-49a4-b1ed-464aa2b4b73c'(vm_9) moved from 'Up' --> 'Down' 2018-02-21 10:31:29,451+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-4) [] EVENT_ID: VM_DOWN(61), VM vm_9 is down. Exit message: User shut down from within the guest 2018-02-21 10:31:44,487+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] VM 'e61ac7f6-6026-42e9-bc8b-f42113a89815'(vm_10) moved from 'Up' --> 'Down' 2018-02-21 10:31:44,511+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-53) [] EVENT_ID: VM_DOWN(61), VM vm_10 is down. Exit message: User shut down from within the guest 2018-02-21 10:31:59,541+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] VM 'b8da67a1-7c10-4170-a9e6-e87cf6d397e1'(vm_13) moved from 'Up' --> 'Down' 2018-02-21 10:31:59,563+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-6) [] EVENT_ID: VM_DOWN(61), VM vm_13 is down. Exit message: User shut down from within the guest 2018-02-21 10:33:33,656+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-22791) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_2 was set to the Unknown status. 2018-02-21 10:33:33,656+02 INFO [org.ovirt.engine.core.vdsbroker.VdsManager] (EE-ManagedThreadFactory-engine-Thread-22791) [] Server failed to respond, vds_id='4a284ac7-8075-430b-98ee-c7c45d1cc04a', vds_name='host_mixed_1', vm_count=2, spm_status='SPM', non-responsive_timeout (seconds)=81, error: java.net.NoRouteToHostException: No route to host Hi, I have examined the logs and I can not find the conclusive result. Please do repeat the test, this time with debug logs switched on. This issue can not be solved without it. Thanks you Engine: Software Version:4.2.2.1-0.1.el7 Host: OS Version:RHEL - 7.5 - 6.el7 Kernel Version:3.10.0 - 855.el7.x86_64 KVM Version:2.10.0 - 21.el7 LIBVIRT Version: libvirt-3.9.0-13.el7 VDSM Version: vdsm-4.20.19-1.el7ev I did rerun with pools and vms and open the logs to debug see logs at: https://drive.google.com/open?id=1QktaTFoo_fdq9k61-v0HvF4iRqmniKnM same results we have vms in unknown state after shutting down the host. Failed QA we've sent the event during shutdown as expected 2018-02-27 13:36:38,841+0200 DEBUG (libvirt/events) [jsonrpc.Notification] Sending event {"params": {"7c2a20fc-12a9-44c5-95e7-1bfb3282d056": {"status": "Down", "exitMessage": "User shut down from within the guest", "vmId": "7c2a20fc-12a9-44c5-95e7-1bfb3282d056", "exitReason": 7, "exitCode": 0}, "notify_time": 4364298540}, "jsonrpc": "2.0", "method": "|virt|VM_status|7c2a20fc-12a9-44c5-95e7-1bfb3282d056"} (__init__:181) 2018-02-27 13:36:38,842+0200 DEBUG (libvirt/events) [vds] Sending notification |virt|VM_status|7c2a20fc-12a9-44c5-95e7-1bfb3282d056 with params {'7c2a20fc-12a9-44c5-95e7-1bfb3282d056': {'status': 'Down', 'exitMessage': 'User shut down from within the guest', 'vmId': '7c2a20fc-12a9-44c5-95e7-1bfb3282d056', 'exitReason': 7, 'exitCode': 0}, 'notify_time': 4364298540} (clientIF:212) 2018-02-27 13:36:43,121+0200 DEBUG (MainThread) [vds.MultiProtocolAcceptor] Stopping Acceptor (protocoldetector:214) 2018-02-27 13:36:43,122+0200 INFO (MainThread) [jsonrpc.JsonRpcServer] Stopping JsonRPC Server (__init__:703) but there's only following error on engine side and nothing else. 2018-02-27 13:36:29,930+02 INFO [org.ovirt.engine.core.bll.RestoreStatelessVmCommand] (EE-ManagedThreadFactory-engine-Thread-30299) [cc26175] Lock freed to object 'EngineLock:{exclusiveLocks='[b46f4e62-d5eb-4f81-ad75-00163341ac65=VM]', sharedLocks=''}' 2018-02-27 13:36:46,956+02 ERROR [org.ovirt.vdsm.jsonrpc.client.reactors.Reactor] (SSL Stomp Reactor) [] Unable to process messages Broken pipe Israel, debug log level was NOT enabled on the engine side so we cannot see whether the message was received or not. Please repeat with debug logs on all sides. Retested with: Engine: Software Version:4.2.2.2-0.1.el7 Host: OS Version:RHEL - 7.5 - 8.el7 Kernel Version:3.10.0 - 858.el7.x86_64 KVM Version:2.10.0 - 21.el7 LIBVIRT Version:libvirt-3.9.0-13.el7 VDSM Version:vdsm-4.20.19-1.el7ev See logs: https://drive.google.com/open?id=1wkEIyXsmSpTywZvFfr_0k8ucFHSePu4G This time did graceful shutdown with power management: From val/log/messages: Mar 5 15:39:08 puma42 systemd: Starting Power-Off... Mar 5 15:39:08 puma42 systemd: Stopping Monitoring of LVM2 mirrors, snapshots etc. using dmeventd or progress polling... Mar 5 15:39:08 puma42 systemd: Shutting down. Mar 5 15:39:08 puma42 systemd-shutdown[1]: Sending SIGTERM to remaining processes... Mar 5 15:39:08 puma42 journal: Journal stopped Run again: Mar 5 17:41:16 puma42 journal: Runtime journal is using 8.0M (max allowed 1.5G, trying to leave 2.3G free of 15.5G available → current limit 1.5G). Mar 5 17:41:16 puma42 kernel: microcode: microcode updated early to revision 0x710, date = 2013-06-17 vm_16 set to unknown state: engine.log:2018-03-05 15:37:29,261+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-709) [] EVENT_ID: VM_SET_TO_UNKNOWN_STATUS(142), VM vm_16 was set to the Unknown status. it seems like two issues really 1) all VMs are shutdown, but few of the last VMs (when there are more than just a few) do not seem to send their update correctly and engine declares them Unknown (event is lost) 2) all VMs are shut down as "user shut down from the guest" due to mis-identification of the shutdown reason Hi Derek, I think with this behaviour and HA VMs (with and without leases), we need some documentation section that explains the flow(s) so that the customer understands the impact. In case a HA-VM is shut down cleanly (be it due to this BZ or from the UI or from within the VM), the HA VM isn't restarted. While this might be expected for the UI and in-VM usecase it is probably not 100% clear for the host shutdown usecase. As such, I would like to have that clearly documented in the HA-VM section. Do you think this is feasible? Thanks! Martin In a word, yes. Can we meet on BlueJeans after next week (travelling to the OPS QBR next week) to go through the detail planning discussion for docs? (In reply to Michal Skrivanek from comment #41) > it seems like two issues really > 1) all VMs are shutdown, but few of the last VMs (when there are more than > just a few) do not seem to send their update correctly and engine declares > them Unknown (event is lost) > 2) all VMs are shut down as "user shut down from the guest" due to > mis-identification of the shutdown reason Who's looking at this? I have retested the setup mentioned by the QE. All of the VMs terminated gracefully and without the error. It seems that the issue was due to the unrelated storage error. As for the second part. I started working on the patch that sets the right status to the terminating VM during system shutdown. Verify with: Engine: 4.2.3.3-0.1.el7 Host: OS Version:RHEL - 7.5 - 8.el7 Kernel Version:3.10.0 - 862.el7.x86_64 KVM Version:2.10.0 - 21.el7_5.2 LIBVIRT Version:libvirt-3.9.0-14.el7_5.2 VDSM Version:vdsm-4.20.27-1.el7ev SPICE Version:0.14.0 - 2.el7 Test run: https://polarion.engineering.redhat.com/polarion/#/project/RHEVM3/testrun?id=42-4&tab=records&result=passed Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:1488 BZ<2>Jira Resync |