Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1448628

Summary: Sending a large number of tasks to RHEVM causes hypervisors to apparently go offline
Product: Red Hat Enterprise Virtualization Manager Reporter: Greg Scott <gscott>
Component: ovirt-engineAssignee: Oved Ourfali <oourfali>
Status: CLOSED ERRATA QA Contact: guy chen <guchen>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 3.5.7CC: eberman, gscott, gveitmic, lsurette, mgoldboi, michal.skrivanek, mperina, oourfali, pstehlik, rbalakri, rgolan, Rhev-m-bugs, srevivo, ykaul, ylavi
Target Milestone: ovirt-4.1.3Keywords: TestOnly, ZStream
Target Release: ---Flags: lsvaty: testing_plan_complete-
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-06 07:30:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Greg Scott 2017-05-06 11:54:31 UTC
Description of problem:

Send a large number of tasks to RHEVM, such as a bulk shutown of several VMs. Various task queues overflow, RHEVM gets busy and misses the heartbeats from RHEVH hosts, and declares them offline, eventually shutting down the whole environment.


Version-Release number of selected component (if applicable):
3.x, possibly 4.x

How reproducible:
At will.

Steps to Reproduce:
1. Build a RHEV environment with a large number of Windows VMs. Make them all part of an Active Directory domain.  Start them all.
2. From another Windows system, execute a script similar to what I'm pasting in to "Additional Info" below.
3.

Actual results:

RHEVM declares some or all hypervisors offline, leaving tasks backed up with nowhere to execute.  Sometimes orphan tasks also plague the database.

Expected results:

RHEVM needs a mechanism in 4.x to deal gracefully with overflowing queues.  We also need a 3.x workaround to increase the queue depth and minimize this problem.

Additional info:

Germano reproduced the problem with 3.x by setting the appropriate queue depth parameters to an artificially low number and using a Python program to inject tasks into RHEVM.  He was unable to inject tasks fast enough to overflow the queue in 4.1, so was unable to reproduce the problem in 4.1.

However, a script like the one I'm pasting in below might do the trick.  Build up a 4.1 RHV environment with, say, 1000 Windows VMs.  Start them all. Make them all members of the same Active Directory domain.  From another Windows system in the same AD domain, execute a script similar to this one.  When VMs shut down, the hypervisors are supposed to figure it out and notify RHEVM.  That generates a bunch of tasks. Bulk-shutdown enough VMs, maybe it will generate enough tasks to overflow the queues, and let's see what happens.  Note that VMs don't have to be in a pool; it just might be convenient to set it up this way.

@Title 1.CD2.bat
@echo *********  Press any key to Shutdown VMs now! ********* 
@Echo                      Ctrl C to Abort
@echo ******************************************************* 
@pause
@echo off
@echo(
shutdown /s  /m \\mypool-1.example.local
shutdown /s  /m \\mypool-10.example.local
shutdown /s  /m \\mypool-100.example.local
shutdown /s  /m \\mypool-101.example.local
shutdown /s  /m \\mypool-102.example.local
shutdown /s  /m \\mypool-103.example.local
shutdown /s  /m \\mypool-104.example.local
shutdown /s  /m \\mypool-105.example.local
shutdown /s  /m \\mypool-106.example.local
shutdown /s  /m \\mypool-107.example.local
shutdown /s  /m \\mypool-108.example.local
shutdown /s  /m \\mypool-109.example.local
shutdown /s  /m \\mypool-11.example.local
shutdown /s  /m \\mypool-110.example.local
shutdown /s  /m \\mypool-111.example.local
shutdown /s  /m \\mypool-112.example.local
shutdown /s  /m \\mypool-113.example.local
shutdown /s  /m \\mypool-114.example.local
shutdown /s  /m \\mypool-115.example.local
shutdown /s  /m \\mypool-116.example.local
shutdown /s  /m \\mypool-117.example.local
shutdown /s  /m \\mypool-118.example.local
shutdown /s  /m \\mypool-119.example.local
shutdown /s  /m \\mypool-12.example.local
shutdown /s  /m \\mypool-120.example.local
shutdown /s  /m \\mypool-121.example.local
shutdown /s  /m \\mypool-122.example.local
shutdown /s  /m \\mypool-123.example.local
shutdown /s  /m \\mypool-124.example.local
shutdown /s  /m \\mypool-125.example.local
shutdown /s  /m \\mypool-126.example.local
shutdown /s  /m \\mypool-127.example.local
shutdown /s  /m \\mypool-128.example.local
shutdown /s  /m \\mypool-129.example.local
shutdown /s  /m \\mypool-13.example.local
shutdown /s  /m \\mypool-130.example.local
shutdown /s  /m \\mypool-131.example.local
shutdown /s  /m \\mypool-132.example.local
shutdown /s  /m \\mypool-133.example.local
shutdown /s  /m \\mypool-134.example.local
shutdown /s  /m \\mypool-135.example.local
shutdown /s  /m \\mypool-136.example.local
shutdown /s  /m \\mypool-137.example.local
shutdown /s  /m \\mypool-138.example.local
shutdown /s  /m \\mypool-139.example.local
shutdown /s  /m \\mypool-14.example.local
shutdown /s  /m \\mypool-140.example.local
shutdown /s  /m \\mypool-141.example.local
shutdown /s  /m \\mypool-142.example.local
shutdown /s  /m \\mypool-143.example.local
shutdown /s  /m \\mypool-144.example.local
shutdown /s  /m \\mypool-145.example.local
shutdown /s  /m \\mypool-146.example.local
shutdown /s  /m \\mypool-147.example.local
shutdown /s  /m \\mypool-148.example.local
shutdown /s  /m \\mypool-149.example.local
shutdown /s  /m \\mypool-15.example.local
shutdown /s  /m \\mypool-150.example.local
shutdown /s  /m \\mypool-151.example.local
shutdown /s  /m \\mypool-152.example.local
shutdown /s  /m \\mypool-153.example.local
shutdown /s  /m \\mypool-154.example.local
shutdown /s  /m \\mypool-155.example.local
shutdown /s  /m \\mypool-156.example.local
shutdown /s  /m \\mypool-157.example.local
shutdown /s  /m \\mypool-158.example.local
shutdown /s  /m \\mypool-159.example.local
shutdown /s  /m \\mypool-16.example.local
shutdown /s  /m \\mypool-160.example.local
shutdown /s  /m \\mypool-161.example.local
shutdown /s  /m \\mypool-162.example.local
shutdown /s  /m \\mypool-163.example.local
shutdown /s  /m \\mypool-164.example.local
shutdown /s  /m \\mypool-165.example.local
shutdown /s  /m \\mypool-166.example.local
shutdown /s  /m \\mypool-167.example.local
shutdown /s  /m \\mypool-168.example.local
shutdown /s  /m \\mypool-169.example.local
shutdown /s  /m \\mypool-17.example.local
shutdown /s  /m \\mypool-170.example.local
shutdown /s  /m \\mypool-18.example.local
shutdown /s  /m \\mypool-19.example.local
shutdown /s  /m \\mypool-2.example.local
shutdown /s  /m \\mypool-20.example.local
shutdown /s  /m \\mypool-21.example.local
shutdown /s  /m \\mypool-22.example.local
shutdown /s  /m \\mypool-23.example.local
shutdown /s  /m \\mypool-24.example.local
shutdown /s  /m \\mypool-25.example.local
shutdown /s  /m \\mypool-26.example.local
shutdown /s  /m \\mypool-27.example.local
shutdown /s  /m \\mypool-28.example.local
shutdown /s  /m \\mypool-29.example.local
shutdown /s  /m \\mypool-3.example.local
shutdown /s  /m \\mypool-30.example.local
shutdown /s  /m \\mypool-31.example.local
shutdown /s  /m \\mypool-32.example.local
shutdown /s  /m \\mypool-33.example.local
shutdown /s  /m \\mypool-34.example.local
shutdown /s  /m \\mypool-35.example.local
shutdown /s  /m \\mypool-36.example.local
shutdown /s  /m \\mypool-37.example.local
shutdown /s  /m \\mypool-38.example.local
shutdown /s  /m \\mypool-39.example.local
shutdown /s  /m \\mypool-4.example.local
shutdown /s  /m \\mypool-40.example.local
shutdown /s  /m \\mypool-41.example.local
shutdown /s  /m \\mypool-42.example.local
shutdown /s  /m \\mypool-43.example.local
shutdown /s  /m \\mypool-44.example.local
shutdown /s  /m \\mypool-45.example.local
shutdown /s  /m \\mypool-46.example.local
shutdown /s  /m \\mypool-47.example.local
shutdown /s  /m \\mypool-48.example.local
shutdown /s  /m \\mypool-49.example.local
shutdown /s  /m \\mypool-5.example.local
shutdown /s  /m \\mypool-50.example.local
shutdown /s  /m \\mypool-51.example.local
shutdown /s  /m \\mypool-52.example.local
shutdown /s  /m \\mypool-53.example.local
shutdown /s  /m \\mypool-54.example.local
shutdown /s  /m \\mypool-55.example.local
shutdown /s  /m \\mypool-56.example.local
shutdown /s  /m \\mypool-57.example.local
shutdown /s  /m \\mypool-58.example.local
shutdown /s  /m \\mypool-59.example.local
shutdown /s  /m \\mypool-6.example.local
shutdown /s  /m \\mypool-60.example.local
shutdown /s  /m \\mypool-61.example.local
shutdown /s  /m \\mypool-62.example.local
shutdown /s  /m \\mypool-63.example.local
shutdown /s  /m \\mypool-64.example.local
shutdown /s  /m \\mypool-65.example.local
shutdown /s  /m \\mypool-66.example.local
shutdown /s  /m \\mypool-67.example.local
shutdown /s  /m \\mypool-68.example.local
shutdown /s  /m \\mypool-69.example.local
shutdown /s  /m \\mypool-7.example.local
shutdown /s  /m \\mypool-70.example.local
shutdown /s  /m \\mypool-71.example.local
shutdown /s  /m \\mypool-72.example.local
shutdown /s  /m \\mypool-73.example.local
shutdown /s  /m \\mypool-74.example.local
shutdown /s  /m \\mypool-75.example.local
shutdown /s  /m \\mypool-76.example.local
shutdown /s  /m \\mypool-77.example.local
shutdown /s  /m \\mypool-78.example.local
shutdown /s  /m \\mypool-79.example.local
shutdown /s  /m \\mypool-8.example.local
shutdown /s  /m \\mypool-80.example.local
shutdown /s  /m \\mypool-81.example.local
shutdown /s  /m \\mypool-82.example.local
shutdown /s  /m \\mypool-83.example.local
shutdown /s  /m \\mypool-84.example.local
shutdown /s  /m \\mypool-85.example.local
shutdown /s  /m \\mypool-86.example.local
shutdown /s  /m \\mypool-87.example.local
shutdown /s  /m \\mypool-88.example.local
shutdown /s  /m \\mypool-89.example.local
shutdown /s  /m \\mypool-9.example.local
shutdown /s  /m \\mypool-90.example.local
shutdown /s  /m \\mypool-91.example.local
shutdown /s  /m \\mypool-92.example.local
shutdown /s  /m \\mypool-93.example.local
shutdown /s  /m \\mypool-94.example.local
shutdown /s  /m \\mypool-95.example.local
shutdown /s  /m \\mypool-96.example.local
shutdown /s  /m \\mypool-97.example.local
shutdown /s  /m \\mypool-98.example.local
shutdown /s  /m \\mypool-99.example.local
@echo *********  Script Completed *********
@echo off
@echo(
@pause

Comment 1 Greg Scott 2017-05-06 12:26:32 UTC
Thinking about this some more - the VMs might not need to be Windows. Doing something with, say, ssh, to bulk-shutdown a bunch of, say, Fedora or RHEL VMs might also do the trick.

I was thinking that selecting a large number of VMs in the RHVM GUI and starting them all simultaneously might also do the trick - but that might also bury the SPM host and disguise the too-many-tasks problem.

Comment 6 Germano Veit Michel 2017-05-09 06:37:09 UTC
I am NOT sure this test is valid, but it's as close as I get to a mass shutdown. It uses vdsmfake to create 1000's of "fake vms" running, then shuts all of them down at once.

1. Deploy 4.1 Manager (standalone)
2. Add a Host and a Storage Domain
3. Install Docker somewhere else and run this:
docker build -t vdsmfake github.com/ovirt/ovirt-vdsmfake --network=host
docker run --rm -p54322:54322 -p54321:54321 --network=host vdsmfake
4. Set these options in the 4.1 DB:
UPDATE vdc_options set option_value = 'false' where option_name = 'InstallVds';
UPDATE vdc_options set option_value = 'true' WHERE option_name = 'UseHostNameIdentifier';
UPDATE vdc_options set option_value = '0' WHERE option_name = 'HostPackagesUpdateTimeInHours';
UPDATE vdc_options set option_value = 'false' WHERE option_name = 'SSLEnabled';
UPDATE vdc_options set option_value = 'false' WHERE option_name = 'EncryptHostCommunication';
5. Add the fakevdsm Host from step 3 (Hosts -> Add ...)
6. Configure -> Scheduling Policies -> none -> Copy -> none_no_mem
7. Configure -> Scheduling Policies -> none -> Copy -> none_no_mem -> Edit -> Remove "Memory" from "Enabled Filters"
8. Move the real host to Maintenance Mode, Fake host will get SPM
9. Create 1000's of VMs using API, pinning them to the FAKE host and starting them up [1]
10. systemctl ovirt-engine stop
11. Edit 4.1 DB with these values and restart the engine
UPDATE vdc_options SET option_value = 3 WHERE option_name = 'DefaultMinThreadPoolSize';
UPDATE vdc_options SET option_value = 3 WHERE option_name = 'DefaultMaxThreadPoolSize';
UPDATE vdc_options SET option_value = 1000 WHERE option_name = 'DefaultMaxThreadWaitQueueSize';
12. Activate the real Host, run a real VM on it, switch SPM to it
13. Shutdown all VMs [2]

Results:

I get no problems at all while shutting down 1000s of VMs at once. I can switch SDs (real) and Hosts (real) to Maintenance mode without any problems while the fake VMs in the fake host are being shut down, switching SPM and moving SDs to maintenance also works fine. It does get's a bit unresponsive but does the job. Interestingly, there are no tasks piling up.

Again, I'm not sure if this is valid due to using fakevdsm, but it's as close as I can get to a high number of VMs. To run a real test I assume we would need a lab with around 512GB of ram at least. Mine has ~30GB which is not enough for even 100 real VMs.

I'm attaching scripts [1] and [2], as they might be useful for a test with real hosts.

Comment 35 errata-xmlrpc 2017-07-06 07:30:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1692

Comment 36 Daniel Gur 2019-08-28 12:57:57 UTC
sync2jira

Comment 37 Daniel Gur 2019-08-28 13:03:05 UTC
sync2jira

Comment 38 Daniel Gur 2019-08-28 13:14:28 UTC
sync2jira

Comment 39 Daniel Gur 2019-08-28 13:19:30 UTC
sync2jira