Bug 1490138
| Summary: | concurrent registrations/attach or api causes task conflict | ||
|---|---|---|---|
| Product: | Red Hat Satellite | Reporter: | Jan Hutař <jhutar> |
| Component: | Registration | Assignee: | satellite6-bugs <satellite6-bugs> |
| Status: | CLOSED NOTABUG | QA Contact: | Katello QA List <katello-qa-list> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.2.11 | CC: | aruzicka, inecas, jsherril, psuriset, sthirugn |
| Target Milestone: | Unspecified | Keywords: | Performance, Regression |
| Target Release: | Unused | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-09-11 20:00:42 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
Looking at the tasks export, it seems the first registration took over 3 minutes and the reason was importing package profile took too much time (30s for a host), blocking the workers from doing other stuff. It seems to be related to https://bugzilla.redhat.com/show_bug.cgi?id=1483677 (there is a hotfix attached there: I suggest trying it out, but before that consult with jsherril about the usage, expectations and consequences. Another thing that seems to be happening is, when the timeout occurs, the sub-man re-trying the registration several times: From the task export that is in the foreman-debug I see: 1. bfcdb31c-1bd7-49c7-b9e1-84d19a113949.html - the long-running registration task (due to package profile taking too much time) 2. The conflicting calls 681c0534-4606-450e-a0a7-49fa1b615032.html a8a12044-1d72-41ec-ba33-8b52f6e8559e.html 701ae477-b15f-45c3-8ed5-2fd8ff918aec.html 3132ded9-1279-4a58-b0cf-00b57a1d089e.html 9b0cf81b-21df-4f8a-999c-eda20164ba6c.html @Adam: might this be something related to the other case with too many re-registration happening at the same time? I wonder if sub-man has some logic around re-tries that would cause this (In reply to Ivan Necas from comment #5) > Looking at the tasks export, it seems the first registration took over 3 > minutes and the reason was importing package profile took too much time (30s > for a host), blocking the workers from doing other stuff. Hmm, would it make sense to serialize these tasks then? If they use same resource, that makes sense to me. (In reply to Ivan Necas from comment #6) > @Adam: might this be something related to the other case with too many > re-registration happening at the same time? I wonder if sub-man has some > logic around re-tries that would cause this It seems close to BZ1486375, the timings between retries are different (from 10 to 30 seconds here), but that may be caused by the server being under load. I tried simulating the failure to see whether subman retries, but it doesn't seem to do so, which would mean there's something wrong on our side. I'll try reaching over to candlepin folks if they know anything about it, just to be sure. I accidentally cleared the needinfo on Justin, raising it again. @Jan we are working on making it possible to put specific tasks to different queue, so that thay don't compete for resources with the short-running actions, see also https://bugzilla.redhat.com/show_bug.cgi?id=1386283 As per email discussion: Yes, my script does lots of retries on failure and as per Justin this is more a side effect of bug 1490019. Thank you very much! |
Description of problem: I'm not sure about the trigger, but I can see lots of errors like: 2017-09-10 16:11:15 ea9effae [app] [I] Started POST "/rhsm/environments/1/consumers" for 172.17.1.16 at 2017-09-10 16:11:15 -0400 2017-09-10 16:11:15 ea9effae [app] [I] Processing by Katello::Api::Rhsm::CandlepinProxiesController#consumer_create as JSON 2017-09-10 16:11:15 ea9effae [app] [I] Parameters: {"facts"=>"[FILTERED]", "installedProducts"=>[{"productName"=>"Red Hat Enterprise Linux Server", "arch"=>"x86_64", "version"=>"7.4", "productId"=>"69"}], "type"=>"system", "name"=>"c03-h26-r620container878.example.com", "contentTags"=>["rhel-7", "rhel-7-server"], "environment_id"=>"1"} 2017-09-10 16:11:15 ea9effae [app] [I] Expire fragment views/tabs_and_title_records-3 (0.1ms) 2017-09-10 16:11:15 ea9effae [katello/cp_proxy] [I] Authorized user admin(Admin User) 2017-09-10 16:11:15 ea9effae [foreman-tasks/action] [E] Required lock is already taken by other running tasks. | Please inspect their state, fix their errors and resume them. | | Required lock: read | Conflicts with tasks: | - https://<fqdn>/foreman_tasks/tasks/63865873-fa5d-4a1a-98cd-e1ae07f4870d (ForemanTasks::Lock::LockConflict) | /opt/theforeman/tfm/root/usr/share/gems/gems/foreman-tasks-0.7.14.14/app/models/foreman_tasks/lock.rb:49:in `block in <class:Lock>' | /opt/rh/rh-ror41/root/usr/share/gems/gems/activesupport-4.1.5/lib/active_support/callbacks.rb:438:in `instance_exec' | /opt/rh/rh-ror41/root/usr/share/gems/gems/activesupport-4.1.5/lib/active_support/callbacks.rb:438:in `block in make_lambda' [...] Version-Release number of selected component (if applicable): Sat: satellite-6.2.11-2.0.el7sat.noarch Capsule: satellite-capsule-6.2.11-2.0.el7sat.noarch Client: python-rhsm-1.19.9-1.el7.x86_64 subscription-manager-1.19.21-1.el7.x86_64 How reproducible: always with setup described below Steps to Reproduce: 1. Have Satellite and 16 capsules and 96 systems being registered in parallel 2. By registered, I meanlots of actions: subscription-manager register --org Default_Organization --environment Library --username <usr> --password <pass> --force subscription-manager list --available --all --matches "..." --pool-only subscription-manager attach --pool "..." API GET to /api/locations?per_page=1000000 (there is one location per capsule + default one) API PUT to /api/hosts/<hostname> to change location by its ID 3. Because we have various retries in place and because of number of systems involved, I do not know for sure what is causing this, but I suspect that registration. Actual results: Lots of retries needed at various stages of the process and lots of tracebacks in satellite's logs. Expected results: Should work and should not throw tracebacks into the logs