Bug 1448628
| Summary: | Sending a large number of tasks to RHEVM causes hypervisors to apparently go offline | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Virtualization Manager | Reporter: | Greg Scott <gscott> |
| Component: | ovirt-engine | Assignee: | Oved Ourfali <oourfali> |
| Status: | CLOSED ERRATA | QA Contact: | guy chen <guchen> |
| Severity: | urgent | Docs Contact: | |
| Priority: | urgent | ||
| Version: | 3.5.7 | CC: | eberman, gscott, gveitmic, lsurette, mgoldboi, michal.skrivanek, mperina, oourfali, pstehlik, rbalakri, rgolan, Rhev-m-bugs, srevivo, ykaul, ylavi |
| Target Milestone: | ovirt-4.1.3 | Keywords: | TestOnly, ZStream |
| Target Release: | --- | Flags: | lsvaty:
testing_plan_complete-
|
| Hardware: | x86_64 | ||
| OS: | All | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-07-06 07:30:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Greg Scott
2017-05-06 11:54:31 UTC
Thinking about this some more - the VMs might not need to be Windows. Doing something with, say, ssh, to bulk-shutdown a bunch of, say, Fedora or RHEL VMs might also do the trick. I was thinking that selecting a large number of VMs in the RHVM GUI and starting them all simultaneously might also do the trick - but that might also bury the SPM host and disguise the too-many-tasks problem. I am NOT sure this test is valid, but it's as close as I get to a mass shutdown. It uses vdsmfake to create 1000's of "fake vms" running, then shuts all of them down at once. 1. Deploy 4.1 Manager (standalone) 2. Add a Host and a Storage Domain 3. Install Docker somewhere else and run this: docker build -t vdsmfake github.com/ovirt/ovirt-vdsmfake --network=host docker run --rm -p54322:54322 -p54321:54321 --network=host vdsmfake 4. Set these options in the 4.1 DB: UPDATE vdc_options set option_value = 'false' where option_name = 'InstallVds'; UPDATE vdc_options set option_value = 'true' WHERE option_name = 'UseHostNameIdentifier'; UPDATE vdc_options set option_value = '0' WHERE option_name = 'HostPackagesUpdateTimeInHours'; UPDATE vdc_options set option_value = 'false' WHERE option_name = 'SSLEnabled'; UPDATE vdc_options set option_value = 'false' WHERE option_name = 'EncryptHostCommunication'; 5. Add the fakevdsm Host from step 3 (Hosts -> Add ...) 6. Configure -> Scheduling Policies -> none -> Copy -> none_no_mem 7. Configure -> Scheduling Policies -> none -> Copy -> none_no_mem -> Edit -> Remove "Memory" from "Enabled Filters" 8. Move the real host to Maintenance Mode, Fake host will get SPM 9. Create 1000's of VMs using API, pinning them to the FAKE host and starting them up [1] 10. systemctl ovirt-engine stop 11. Edit 4.1 DB with these values and restart the engine UPDATE vdc_options SET option_value = 3 WHERE option_name = 'DefaultMinThreadPoolSize'; UPDATE vdc_options SET option_value = 3 WHERE option_name = 'DefaultMaxThreadPoolSize'; UPDATE vdc_options SET option_value = 1000 WHERE option_name = 'DefaultMaxThreadWaitQueueSize'; 12. Activate the real Host, run a real VM on it, switch SPM to it 13. Shutdown all VMs [2] Results: I get no problems at all while shutting down 1000s of VMs at once. I can switch SDs (real) and Hosts (real) to Maintenance mode without any problems while the fake VMs in the fake host are being shut down, switching SPM and moving SDs to maintenance also works fine. It does get's a bit unresponsive but does the job. Interestingly, there are no tasks piling up. Again, I'm not sure if this is valid due to using fakevdsm, but it's as close as I can get to a high number of VMs. To run a real test I assume we would need a lab with around 512GB of ram at least. Mine has ~30GB which is not enough for even 100 real VMs. I'm attaching scripts [1] and [2], as they might be useful for a test with real hosts. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1692 sync2jira sync2jira sync2jira sync2jira |