Bug 1320557

Summary: unregistering a Content Host can pause ListenOnCandlepinEvents with Candlepin::Consumer: 410 Gone error
Product: Red Hat Satellite Reporter: Pavel Moravec <pmoravec>
Component: Tasks PluginAssignee: Tomas Strachota <tstrachota>
Status: CLOSED ERRATA QA Contact: Kedar Bidarkar <kbidarka>
Severity: high Docs Contact:
Priority: high    
Version: 6.1.7CC: bbuckingham, bcourt, bkearney, chrobert, ealcaniz, kbidarka, ktordeur, mmccune, sthirugn, tstrachota, xdmoon
Target Milestone: UnspecifiedKeywords: Triaged
Target Release: Unused   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1325879 (view as bug list) Environment:
Last Closed: 2016-05-17 05:33:01 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1325879    
Attachments:
Description Flags
hotfix patch none

Description Pavel Moravec 2016-03-23 13:44:03 UTC
Description of problem:
When unregistering a content host (either by subscription-manager unregister, or deleting Content Host, or deleting Host), Actions::Katello::System::Destroy task is being processed.

This task has a concurrency bug that with some probability causes Actions::Candlepin::ListenOnCandlepinEvents task to be paused with error:

Katello::Resources::Candlepin::Consumer: 410 Gone {"displayMessage":"Unit b3012a90-41b6-4788-98a9-5b41839b6dca has been deleted","requestUuid":"04dbfeda-75cb-4ffa-ad9f-9f4c818fc868","deletedId":"b3012a90-41b6-4788-98a9-5b41839b6dca"} (GET /candlepin/consumers/b3012a90-41b6-4788-98a9-5b41839b6dca)

(UUID matches the UUID of just being deleted Content Host)

sequence of steps leading to the bug:
- Actions::Candlepin::Consumer::Destroy sub-task executed
  - it deletes the consumer from candlepin and recalculates compliance for it (imho redundant step when we delete the consumer, but katello is aware of it)
  - it announces the compliance.create event to ListenOnCandlepinEvents task
  - katello finds the katello_system is present so it runs "re-indexing content host .."
- the parent task Actions::Katello::System::Destroy even _now_ enters Actions::Katello::System::Destroy
  - this subtask deletes the system from katello_systems _after_ the check before "re-indexing content host .." is made, so the re-index is _not_ skipped
- the re-index of the content host calls GET consumer/<uuid> on candlepin, what triggers the 410 Gone in ListenOnCandlepinEvents task


Version-Release number of selected component (if applicable):
Sat 6.1.7


How reproducible:
100% within few minutes


Steps to Reproduce:
1. tail -f /var/log/foreman/production.log | grep -e "in phase Finalize Actions::Katello::System::Destroy" -e "skip re-indexing of non-existent content host" -e "re-indexing content host"

2. Have opened Actions::Candlepin::ListenOnCandlepinEvents task in WebUI

3. On some Content Host, register and unregister it in a loop (here RHEL7 used, update if using RHEL5 or 6 accordingly):

while true; do
  subscription-manager register --force --org="Default_Organization" --environment="Library" --username=admin --password=faYakexMm5XN543x
  subscription-manager subscribe --pool=8aa2d415526494380152732fc8d20dd7
  subscription-manager repos --enable rhel-7-server-rpms --enable rhel-7-server-satellite-tools-6.1-rpms
  date
  subscription-manager unregister
  date
  sleep 5
done

4. monitor the tail -f output and ListenOnCandlepinEvents task


Actual results:
- tail shows:

2016-03-23 14:17:05 [D] re-indexing content host pmoravec-rhel7.gsslab.brq.redhat.com
2016-03-23 14:17:05 [D]          Step f5705893-2577-41c4-9af6-9b7c10ccb646: 6   running >>   success in phase Finalize Actions::Katello::System::Destroy

- ListenOnCandlepinEvents task is paused/error with error:

Katello::Resources::Candlepin::Consumer: 410 Gone {"displayMessage":"Unit b3012a90-41b6-4788-98a9-5b41839b6dca has been deleted","requestUuid":"04dbfeda-75cb-4ffa-ad9f-9f4c818fc868","deletedId":"b3012a90-41b6-4788-98a9-5b41839b6dca"} (GET /candlepin/consumers/b3012a90-41b6-4788-98a9-5b41839b6dca)


Expected results:
- tail can be ok (rather symptom for devels), just the ListenOnCandlepinEvents task needs to be running without an error


Additional info:
seems like a lack of locking / concurrency bug where reindex_pool_subscription_handler.rb:

      def reindex_consumer(message)
        if message.content['newEntity']
          uuid = JSON.parse(message.content['newEntity'])['consumer']['uuid']
          system = ::Katello::System.find_by_uuid(uuid)
          if system.nil?
            @logger.debug "skip re-indexing of non-existent content host #{uuid}"
          else
            @logger.debug "re-indexing content host #{system.name}"
            system.update_index
          end

needs to be executed as atomic operation (not concurrently with Actions::Katello::System::Destroy)

Comment 2 Pavel Moravec 2016-03-23 14:43:39 UTC
Present also in 6.1.8

Comment 3 Pavel Moravec 2016-03-24 11:12:34 UTC
An alternative way to fixing the race/concurrency condition in katello would be to prevent candlepin in generating and sending the (just deleted) consumer compliance. Consumer deletion triggers entitlements deletion what triggers back compliance computation for the just deleted consumer - the latest is ridiculous and sending the compliance.create message to katello in fact triggers the race condition there.

Comment 9 Mike McCune 2016-05-11 21:59:16 UTC
Created attachment 1156316 [details]
hotfix patch

Comment 10 Mike McCune 2016-05-11 22:05:32 UTC
== HOTFIX INSTRUCTIONS ==

1) Download 1320557-001.patch from this bug

2) Copy to:

cp 1320557-001.patch /opt/rh/ruby193/root/usr/share/gems/gems/katello-`rpm -q --queryformat '%{VERSION}\n' ruby193-rubygem-katello`

3) Patch:

patch -p1 < 1320557-001.patch

4) restart:

katello-service restart

Comment 11 Kedar Bidarkar 2016-05-12 20:02:47 UTC
I ran the script in loop for almost an hour and nothing seen, also the dynflow tasks was clean.

VERIFIED With sat6.1.9 compose1

Comment 12 Kedar Bidarkar 2016-05-12 20:15:34 UTC
Need to test is properly, re-opening this bug.

Comment 13 Kedar Bidarkar 2016-05-12 21:48:53 UTC
Also tested by grepping for "410 Gone" 

This issue is no longer seen with sat6.1.9 compose1

Comment 15 errata-xmlrpc 2016-05-17 05:33:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1084