1208654 – Satellite6 is unresponsive during vmware or rhevm image clone

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1208654 - Satellite6 is unresponsive during vmware or rhevm image clone

Summary: Satellite6 is unresponsive during vmware or rhevm image clone

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Orchestration
Sub Component:
Version:	6.0.8
Hardware:	All
OS:	All
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	Ivan Necas
QA Contact:	Katello QA List
Docs Contact:
URL:	http://projects.theforeman.org/issues...
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-04-02 19:16 UTC by ldomb
Modified:	2019-09-26 13:53 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-03-14 17:14:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Foreman Issue Tracker	10579	0	None	None	None	2016-04-22 15:36:22 UTC

Description ldomb 2015-04-02 19:16:12 UTC

Description of problem:

When creating a new vm/host from image (it does not matter if VMware or Rhev, as it happens on both) via rest api Satellite6 drops into a non responsive state and throws away all requests until the vmdk or img clone is done. Once the disk gets released satellite6 is responsive again. 

This results in failed registrations of hosts trying to register in any way which is quite critical. It also does not enable the user to create multiple hosts via image at the same time (bulk creation of hosts via api). Meaning the user can only provision sequential and has to wait for each host until it at least uploaded one report.


Version-Release number of selected component (if applicable): 6.0.8 


How reproducible:

create a vm(image not pxe) host via api and look at the logs (host becomes non responsive. 



Steps to Reproduce:
1. Api call for vmware: 


if provider == "vmware"
# Choose params
post_json(url+"hosts", JSON.generate({"host" => {
"name"=>hostname,
"organization_id"=>organization_id,
"location_id"=>location_id,
"hostgroup_id"=>hostgroup_id,
"compute_resource_id"=>compute_resource_id,
"environment_id"=>environment_id,
"content_source_id"=>"1",
"managed"=>"true",
"type"=>"Host::Managed",
"compute_attributes"=>{"cpus"=>cpus, "corespersocket"=>corespersocket, "memory_mb"=>memory, "cluster"=>vmcluster, "path"=>"/Datacenters/#{datacenter}/vm", "guest_id"=>"otherGuest64", "interfaces_attributes"=>{"new_interfaces"=>{"type"=>"VirtualE1000", "network"=>network, "_delete"=>""}, "0"=>{"type"=>"VirtualE1000", "network"=>network, "_delete"=>""}}, "volumes_attributes"=>{"new_volumes"=>{"datastore"=>datastore, "name"=>"Hard disk", "size_gb"=>disksize, "thin"=>"true", "_delete"=>""}, "0"=>{"datastore"=>datastore, "name"=>"Hard disk", "size_gb"=>disksize, "thin"=>"true", "_delete"=>""}}, "scsi_controller_type"=>"VirtualLsiLogicController", "start"=>"1", "image_id"=>"templates/#{template}"},
"domain_id"=>domain_id,
"realm_id"=>"",
"mac"=>"",
"subnet_id"=>subnet_id,
"ip"=>ipaddr,
"interfaces_attributes"=>{"new_interfaces"=>{"_destroy"=>"false", "type"=>"Nic::Managed", "mac"=>"", "name"=>"", "domain_id"=>"", "ip"=>"", "provider"=>"IPMI"}},
"architecture_id"=>architecture_id,
"operatingsystem_id"=>operatingsystem_id,
"provision_method"=>"image",
"build"=>"1",
"root_pass"=>rootpw,
"medium_id"=>"",
"disk"=>"",
"enabled"=>"1",
"model_id"=>"",
"comment"=>"",
"overwrite"=>"false"}}))
end

if provider == "RedhatVirt"
memsize = memory.to_i * 1024 * 1024
number1 = rand.to_s[2..14]
number2 = rand.to_s[2..14]  
post_json(url+"hosts", JSON.generate({"host" => {
"name"=>hostname,
"organization_id"=>organization_id,
"location_id"=>location_id,
"hostgroup_id"=>hostgroup_id,
"compute_resource_id"=>compute_resource_id,
"environment_id"=>environment_id,
"content_source_id"=>"1",
"managed"=>"true",
"type"=>"Host::Managed",
"compute_attributes"=>{"cpus"=>cpus,"cores"=>corespersocket, "memory"=>memsize, "cluster"=>cluster, "interfaces_attributes"=>{"new_interfaces"=>{"name"=>"", "network"=>network, "_delete"=>""}, "new_#{number1}"=>{"name"=>"eth0", "network"=>network, "_delete"=>""}}, "volumes_attributes"=>{"new_volumes"=>{"size_gb"=>"", "storage_domain"=>datastore, "_delete"=>"", "id"=>""}, "new_#{number2}"=>{"size_gb"=>disksize, "storage_domain"=>datastore, "_delete"=>"", "id"=>""}}, "start"=>"1", "image_id"=>template},
"domain_id"=>domain_id,
"realm_id"=>"", 
"mac"=>"",
"subnet_id"=>subnet_id,
"ip"=>ipaddr,
"interfaces_attributes"=>{"new_interfaces"=>{"_destroy"=>"false", "type"=>"Nic::Managed", "mac"=>"", "name"=>"", "domain_id"=>"", "ip"=>"", "provider"=>"IPMI"}},
"architecture_id"=>architecture_id,
"operatingsystem_id"=>operatingsystem_id,
"provision_method"=>"image",
"build"=>"1",
"disk"=>"", 
"root_pass"=>rootpw,
"enabled"=>"1",
"model_id"=>"",
"comment"=>"",
"overwrite"=>"false"}}))
end

2. Wait unit it breaks. 


Actual results:
Sat6 is not responsive during a image clone.  

Expected results:
Beeing able to communicate with sat6 during image clone. 

Additional info:

This was tested on a BL465g6 with an SB40 (6x15k disks) So its not a hardware issue.

Comment 1 RHEL Program Management 2015-04-02 19:33:34 UTC

Since this issue was entered in Red Hat Bugzilla, the release flag has been
set to ? to ensure that it is properly evaluated for this release.

Comment 3 Ivan Necas 2015-04-07 07:39:49 UTC

Was that observed when doing multiple host-creations at once? Or even on the frist host create API call? I assume there was a finish-template (not user-data) used for the provisioning…

Comment 4 ldomb 2015-04-08 03:33:22 UTC

Ivan

The lock happens on every image provision task. I does not matter if you provision one or ten hosts. The lock happens when the provider starts cloning the image. 

Yes I used the Satellite Finish Default template. How does the template in any way affect image cloning?

Comment 5 Ivan Necas 2015-04-08 10:14:42 UTC

I'm asking bacause the type of used template might also influence how that provisioning is being handled (because with the finish template, we need to wait for the machine to be ready to run the finish script on it).

When does the UI get responsive again? After the cloning is finished, or after the whole provisioning is done?

Also, do the satellite and the provider share some resources, (running in same cluster, sharing the data storage… etc).

I was not able to reproduce this behavior on my setup, therefore I'm asking for more details that could point us to the root cause of the issue so that we can fix that.

Comment 7 ldomb 2015-04-08 12:29:31 UTC

It gets responsive again after the clone is finished (so after the vm starts up). No Satellite6 and the VM storage are on 2 different storage types. The only thing they share is the blade chassis.

Comment 10 ldomb 2015-04-20 15:19:12 UTC

Ivan and I just tested and could fix the issue by increasing 

PassengerMinInstances 1 to PassengerMinInstances 10
PassengerMaxPoolSize 6 (default) to PassengerMaxPoolSize 20

in /etc/httpd/conf.d/05-foreman-ssl.conf

After that the api requests got not blocked anymore by the finish scripts as new workers got spawned. 

Provisioning images now works on Rhev and Vmware without locking issues. 

It might be good to add this info to a kb article so that others who run into the same issues know howto fix it.

Comment 12 Ivan Necas 2015-04-20 16:19:52 UTC

Yes, we might be interested into making this a default, although there is always an issue with calculating the proper defaults, based on the HW it's being installed on.

For mid-term however, the proper fix should be getting the provisioning into foreman-tasks, so that it's not blocking the web-server processes. Also, the process might be optimized to not block when waiting for the ssh to be ready (or even for the cloning to finish). Every process blocking when it's just waiting for external event is wasting of resources or Dynflow is ready to help with that.

Comment 13 Ivan Necas 2015-05-04 12:38:51 UTC

I will work on a patch for setting the worker numbers using the installer, with some default better defaults.

Comment 14 Ivan Necas 2015-05-21 16:07:32 UTC

Created redmine issue http://projects.theforeman.org/issues/10579 from this bug

Comment 15 Ivan Necas 2015-05-21 16:25:54 UTC

Apparently, the config option for setting the min-instances is already there:


   katello-installer --foreman-passenger-min-instances=6

Comment 17 Bryan Kearney 2015-08-25 18:34:55 UTC

Upstream bug component is Orchestration

Comment 20 Bryan Kearney 2017-03-14 17:14:32 UTC

An upstream issue has been opened for this. When this is fixed, the next version of satellite will contain the fix. We will no longer be tracking this downstream. If you feel this was closed in error, please feel free to re-open with additional information.

Note You need to log in before you can comment on or make changes to this bug.