1490138 – concurrent registrations/attach or api causes task conflict

Red Hat Satellite engineering is moving the tracking of its product development work on Satellite to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "Satellite project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs will be migrated starting at the end of May. If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "Satellite project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/SAT-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1490138 - concurrent registrations/attach or api causes task conflict

Summary: concurrent registrations/attach or api causes task conflict

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Satellite
Classification:	Red Hat
Component:	Registration
Sub Component:
Version:	6.2.11
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	Unspecified
Assignee:	satellite6-bugs
QA Contact:	Katello QA List
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-10 20:34 UTC by Jan Hutař
Modified:	2017-09-11 20:00 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-11 20:00:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1486375	0	high	CLOSED	Katello::Host::Update gets retriggered frequently when it fails due to locking	2023-09-07 18:56:19 UTC

Internal Links: 1486375

Description Jan Hutař 2017-09-10 20:34:47 UTC

Description of problem:
I'm not sure about the trigger, but I can see lots of errors like:

2017-09-10 16:11:15 ea9effae [app] [I] Started POST "/rhsm/environments/1/consumers" for 172.17.1.16 at 2017-09-10 16:11:15 -0400
2017-09-10 16:11:15 ea9effae [app] [I] Processing by Katello::Api::Rhsm::CandlepinProxiesController#consumer_create as JSON
2017-09-10 16:11:15 ea9effae [app] [I]   Parameters: {"facts"=>"[FILTERED]", "installedProducts"=>[{"productName"=>"Red Hat Enterprise Linux Server", "arch"=>"x86_64", "version"=>"7.4", "productId"=>"69"}], "type"=>"system", "name"=>"c03-h26-r620container878.example.com", "contentTags"=>["rhel-7", "rhel-7-server"], "environment_id"=>"1"}
2017-09-10 16:11:15 ea9effae [app] [I] Expire fragment views/tabs_and_title_records-3 (0.1ms)
2017-09-10 16:11:15 ea9effae [katello/cp_proxy] [I] Authorized user admin(Admin User)
2017-09-10 16:11:15 ea9effae [foreman-tasks/action] [E] Required lock is already taken by other running tasks.
 | Please inspect their state, fix their errors and resume them.
 | 
 | Required lock: read
 | Conflicts with tasks:
 | - https://<fqdn>/foreman_tasks/tasks/63865873-fa5d-4a1a-98cd-e1ae07f4870d (ForemanTasks::Lock::LockConflict)
 | /opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.7.14.14/app/models/foreman_tasks/lock.rb:49:in `block in <class:Lock>'
 | /opt/rh/rh-ror41/root/usr/share/gems/gems/activesupport-4.1.5/lib/active_support/callbacks.rb:438:in `instance_exec'
 | /opt/rh/rh-ror41/root/usr/share/gems/gems/activesupport-4.1.5/lib/active_support/callbacks.rb:438:in `block in make_lambda'
[...]


Version-Release number of selected component (if applicable):
Sat: satellite-6.2.11-2.0.el7sat.noarch
Capsule: satellite-capsule-6.2.11-2.0.el7sat.noarch
Client: python-rhsm-1.19.9-1.el7.x86_64 subscription-manager-1.19.21-1.el7.x86_64


How reproducible:
always with setup described below


Steps to Reproduce:
1. Have Satellite and 16 capsules and 96 systems being registered in parallel
2. By registered, I meanlots of actions:
     subscription-manager register --org Default_Organization --environment Library --username <usr> --password <pass> --force
     subscription-manager list --available --all --matches "..." --pool-only
     subscription-manager attach --pool "..."
     API GET to /api/locations?per_page=1000000 (there is one location per capsule + default one)
     API PUT to /api/hosts/<hostname> to change location by its ID
3. Because we have various retries in place and because of number of systems involved, I do not know for sure what is causing this, but I suspect that registration.


Actual results:
Lots of retries needed at various stages of the process and lots of tracebacks in satellite's logs.


Expected results:
Should work and should not throw tracebacks into the logs

Comment 5 Ivan Necas 2017-09-11 07:12:53 UTC

Looking at the tasks export, it seems the first registration took over 3 minutes and the reason was importing package profile took too much time (30s for a host), blocking the workers from doing other stuff.

It seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1483677 (there is a hotfix attached there: I suggest trying it out, but before that consult with jsherril about the usage, expectations and consequences.

Comment 6 Ivan Necas 2017-09-11 07:23:08 UTC

Another thing that seems to be happening is, when the timeout occurs, the sub-man re-trying the registration several times:

From the task export that is in the foreman-debug I see:

1. bfcdb31c-1bd7-49c7-b9e1-84d19a113949.html - the long-running registration task (due to package profile taking too much time)

2. The conflicting calls

   681c0534-4606-450e-a0a7-49fa1b615032.html
   a8a12044-1d72-41ec-ba33-8b52f6e8559e.html
   701ae477-b15f-45c3-8ed5-2fd8ff918aec.html
   3132ded9-1279-4a58-b0cf-00b57a1d089e.html
   9b0cf81b-21df-4f8a-999c-eda20164ba6c.html


@Adam: might this be something related to the other case with too many re-registration happening at the same time? I wonder if sub-man has some logic around re-tries that would cause this

Comment 7 Jan Hutař 2017-09-11 08:26:06 UTC

(In reply to Ivan Necas from comment #5)
> Looking at the tasks export, it seems the first registration took over 3
> minutes and the reason was importing package profile took too much time (30s
> for a host), blocking the workers from doing other stuff.

Hmm, would it make sense to serialize these tasks then? If they use same resource, that makes sense to me.

Comment 8 Adam Ruzicka 2017-09-11 08:49:28 UTC

(In reply to Ivan Necas from comment #6)
> @Adam: might this be something related to the other case with too many
> re-registration happening at the same time? I wonder if sub-man has some
> logic around re-tries that would cause this

It seems close to BZ1486375, the timings between retries are different (from 10 to 30 seconds here), but that may be caused by the server being under load.

I tried simulating the failure to see whether subman retries, but it doesn't seem to do so, which would mean there's something wrong on our side. I'll try reaching over to candlepin folks if they know anything about it, just to be sure.

Comment 9 Adam Ruzicka 2017-09-11 09:09:20 UTC

I accidentally cleared the needinfo on Justin, raising it again.

Comment 10 Ivan Necas 2017-09-11 11:17:22 UTC

@Jan we are working on making it possible to put specific tasks to different queue, so that thay don't compete for resources with the short-running actions, see also https://bugzilla.redhat.com/show_bug.cgi?id=1386283

Comment 11 Jan Hutař 2017-09-11 20:00:42 UTC

As per email discussion:

Yes, my script does lots of retries on failure and as per Justin this is more a side effect of bug 1490019.

Thank you very much!

Note You need to log in before you can comment on or make changes to this bug.