Bug 1490138

Summary:	concurrent registrations/attach or api causes task conflict
Product:	Red Hat Satellite	Reporter:	Jan Hutař <jhutar>
Component:	Registration	Assignee:	satellite6-bugs <satellite6-bugs>
Status:	CLOSED NOTABUG	QA Contact:	Katello QA List <katello-qa-list>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.2.11	CC:	aruzicka, inecas, jsherril, psuriset, sthirugn
Target Milestone:	Unspecified	Keywords:	Performance, Regression
Target Release:	Unused
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-09-11 20:00:42 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jan Hutař 2017-09-10 20:34:47 UTC

Description of problem:
I'm not sure about the trigger, but I can see lots of errors like:

2017-09-10 16:11:15 ea9effae [app] [I] Started POST "/rhsm/environments/1/consumers" for 172.17.1.16 at 2017-09-10 16:11:15 -0400
2017-09-10 16:11:15 ea9effae [app] [I] Processing by Katello::Api::Rhsm::CandlepinProxiesController#consumer_create as JSON
2017-09-10 16:11:15 ea9effae [app] [I]   Parameters: {"facts"=>"[FILTERED]", "installedProducts"=>[{"productName"=>"Red Hat Enterprise Linux Server", "arch"=>"x86_64", "version"=>"7.4", "productId"=>"69"}], "type"=>"system", "name"=>"c03-h26-r620container878.example.com", "contentTags"=>["rhel-7", "rhel-7-server"], "environment_id"=>"1"}
2017-09-10 16:11:15 ea9effae [app] [I] Expire fragment views/tabs_and_title_records-3 (0.1ms)
2017-09-10 16:11:15 ea9effae [katello/cp_proxy] [I] Authorized user admin(Admin User)
2017-09-10 16:11:15 ea9effae [foreman-tasks/action] [E] Required lock is already taken by other running tasks.
 | Please inspect their state, fix their errors and resume them.
 | 
 | Required lock: read
 | Conflicts with tasks:
 | - https://<fqdn>/foreman_tasks/tasks/63865873-fa5d-4a1a-98cd-e1ae07f4870d (ForemanTasks::Lock::LockConflict)
 | /opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.7.14.14/app/models/foreman_tasks/lock.rb:49:in `block in <class:Lock>'
 | /opt/rh/rh-ror41/root/usr/share/gems/gems/activesupport-4.1.5/lib/active_support/callbacks.rb:438:in `instance_exec'
 | /opt/rh/rh-ror41/root/usr/share/gems/gems/activesupport-4.1.5/lib/active_support/callbacks.rb:438:in `block in make_lambda'
[...]


Version-Release number of selected component (if applicable):
Sat: satellite-6.2.11-2.0.el7sat.noarch
Capsule: satellite-capsule-6.2.11-2.0.el7sat.noarch
Client: python-rhsm-1.19.9-1.el7.x86_64 subscription-manager-1.19.21-1.el7.x86_64


How reproducible:
always with setup described below


Steps to Reproduce:
1. Have Satellite and 16 capsules and 96 systems being registered in parallel
2. By registered, I meanlots of actions:
     subscription-manager register --org Default_Organization --environment Library --username <usr> --password <pass> --force
     subscription-manager list --available --all --matches "..." --pool-only
     subscription-manager attach --pool "..."
     API GET to /api/locations?per_page=1000000 (there is one location per capsule + default one)
     API PUT to /api/hosts/<hostname> to change location by its ID
3. Because we have various retries in place and because of number of systems involved, I do not know for sure what is causing this, but I suspect that registration.


Actual results:
Lots of retries needed at various stages of the process and lots of tracebacks in satellite's logs.


Expected results:
Should work and should not throw tracebacks into the logs

Comment 5 Ivan Necas 2017-09-11 07:12:53 UTC

Looking at the tasks export, it seems the first registration took over 3 minutes and the reason was importing package profile took too much time (30s for a host), blocking the workers from doing other stuff.

It seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1483677 (there is a hotfix attached there: I suggest trying it out, but before that consult with jsherril about the usage, expectations and consequences.

Comment 6 Ivan Necas 2017-09-11 07:23:08 UTC

Another thing that seems to be happening is, when the timeout occurs, the sub-man re-trying the registration several times:

From the task export that is in the foreman-debug I see:

1. bfcdb31c-1bd7-49c7-b9e1-84d19a113949.html - the long-running registration task (due to package profile taking too much time)

2. The conflicting calls

   681c0534-4606-450e-a0a7-49fa1b615032.html
   a8a12044-1d72-41ec-ba33-8b52f6e8559e.html
   701ae477-b15f-45c3-8ed5-2fd8ff918aec.html
   3132ded9-1279-4a58-b0cf-00b57a1d089e.html
   9b0cf81b-21df-4f8a-999c-eda20164ba6c.html


@Adam: might this be something related to the other case with too many re-registration happening at the same time? I wonder if sub-man has some logic around re-tries that would cause this

Comment 7 Jan Hutař 2017-09-11 08:26:06 UTC

(In reply to Ivan Necas from comment #5)
> Looking at the tasks export, it seems the first registration took over 3
> minutes and the reason was importing package profile took too much time (30s
> for a host), blocking the workers from doing other stuff.

Hmm, would it make sense to serialize these tasks then? If they use same resource, that makes sense to me.

Comment 8 Adam Ruzicka 2017-09-11 08:49:28 UTC

(In reply to Ivan Necas from comment #6)
> @Adam: might this be something related to the other case with too many
> re-registration happening at the same time? I wonder if sub-man has some
> logic around re-tries that would cause this

It seems close to BZ1486375, the timings between retries are different (from 10 to 30 seconds here), but that may be caused by the server being under load.

I tried simulating the failure to see whether subman retries, but it doesn't seem to do so, which would mean there's something wrong on our side. I'll try reaching over to candlepin folks if they know anything about it, just to be sure.

Comment 9 Adam Ruzicka 2017-09-11 09:09:20 UTC

I accidentally cleared the needinfo on Justin, raising it again.

Comment 10 Ivan Necas 2017-09-11 11:17:22 UTC

@Jan we are working on making it possible to put specific tasks to different queue, so that thay don't compete for resources with the short-running actions, see also https://bugzilla.redhat.com/show_bug.cgi?id=1386283

Comment 11 Jan Hutař 2017-09-11 20:00:42 UTC

As per email discussion:

Yes, my script does lots of retries on failure and as per Justin this is more a side effect of bug 1490019.

Thank you very much!