Description of problem: Editation of prestarted VMs and saving the pool configuration does not behave as expected Version-Release number of selected component (if applicable): av2 How reproducible: Steps to Reproduce: 1. create a pool from template with 50 vms. 2. edit the pool and set 50 vms prestarted. 3. wait 4. edit the pool again and set 0 vms prestarted Actual results: Only 18 of 50 vms get started, although reported host cpu nor memory usage exceedes 20%. Then for two VMs error "Failed to complete starting of " is reported, without providing any reason why the engine was not able to start the VM. No attempt of starting more VMs is performed by the engine for next 30 mins. Afgter step 4, all the 18 running VMs are still running. Expected results: Engine attempts to start all the 50 VMs. If starting fails due to insufficient host resources, it's reported. When number of prestarted VMs is set to zero, all the prestarted running VMs are stoppend. Additional info:
logs?
Created attachment 872593 [details] engine.log, related parts at the end
Petr, the existing VMs should not be stopped How long did you wait when you changed prestarted to 0? Likely the original VMs were not yet started hence you get just a few of them prestarted…
I waited about 30 minutes, all VMs were started for the first 10 minutes, then notnig was happening for the next 20 minutes.
ok its clear from the log that after 18 vms started, there were no more resources to start any more vms, although the engine try to start more every 5 mins: 2014-03-06 17:01:54,164 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler_Worker-3) VmPool e21c20c0-4b0d-4d90-bac2-9b6d16933eed is missing 50 prestarted Vms, attempting to prestart 5 Vms .. 2014-03-06 17:19:58,405 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler_Worker-3) VmPool e21c20c0-4b0d-4d90-bac2-9b6d16933eed is missing 32 prestarted Vms, attempting to prestart 5 Vms 2014-03-06 17:19:58,456 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler_Worker-3) Candidate host rhel6 (35845df9-45db-4eab-b9e4-270217cd52b2) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory .. 2014-03-06 17:24:58,614 INFO [org.ovirt.engine.core.bll.VmPoolMonitor] (DefaultQuartzScheduler_Worker-29) VmPool e21c20c0-4b0d-4d90-bac2-9b6d16933eed is missing 32 prestarted Vms, attempting to prestart 5 Vms until the pre-started count decreased, and as mentioned in comment 4 we should not stop any running vms do you think your setup should have been able to run more than these vms?
Ok, if we don't stop VMs when the prestarted VMs count is decreased, it's ok. No idea how many VM's should the setup handle, but I'd expect an alert message letting me know I cannot have more VMs. Ideally, it would give a reason (CPU, mem, ...).
you didn't see anything in the event log? (stating insufficient memory as the reason)
No, I've rechecked this in av4. Set prestarted VM count to 30 on the same setup and got two "Failed to complete starting of VM". No reason given. Now I've got 12 VMs of the pool running plus one which was already running before this test. Last VM was started about a hour ago, current host usage is 20% mem and 2% CPU, but I expect the CPU usage was much higher during the boot of many VMs. I see in the engine.log that VMs could not be started because host was filtered out because of memory, but there is no such message in the WebUI event log.
smells like SLA, Martin please check
SLA indeed. Currently when there is a scheduling failure you only get this line in the logs indicating there was some problem: 2014-03-06 17:29:59,005 INFO [org.ovirt.engine.core.bll.scheduling.SchedulingManager] (DefaultQuartzScheduler_Worker-7) Candidate host rhel6 (35845df9-45db-4eab-b9e4-270217cd52b2) was filtered out by VAR__FILTERTYPE__INTERNAL filter Memory It would be better if the SchedulingManager would emit standard audit log events, so the administrator would directly see what is the problem.
This is definitely not SLA. SLA correctly says that there are not enough resources. VM Pool management is 100% virt :) SchedulingManager is separated from AuditLog by design. The consumer of scheduling results is responsible for logging stuff. But I wrote the VmPoolMonitor patch for you.
Verified, in case of pre-started VM failed, an indication appears in events/tasks tabs. Verified using: rhevm-3.5.0-0.21.el6ev.noarch sanlock-2.8-1.el6.x86_64 libvirt-0.10.2-46.el6_6.2.x86_64 qemu-kvm-rhev-0.12.1.2-2.448.el6.x86_64 vdsm-4.16.7.5-1.el6ev.x86_64
RHEV-M 3.5.0 has been released