Bug 1159303
| Summary: | pulp.agent.<uuid> queue not deleted when consumer is deleted | |||
|---|---|---|---|---|
| Product: | [Retired] Pulp | Reporter: | Brian Bouterse <bmbouter> | |
| Component: | consumers | Assignee: | Jeff Ortel <jortel> | |
| Status: | CLOSED UPSTREAM | QA Contact: | Irina Gulina <igulina> | |
| Severity: | high | Docs Contact: | ||
| Priority: | high | |||
| Version: | 2.4.3 | CC: | bkearney, cperry, djuran, igulina, jortel, mhrivnak, pmoravec | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | 2.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1159961 (view as bug list) | Environment: | ||
| Last Closed: | 2015-02-28 22:42:48 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1174361, 1175512 | |||
| Bug Blocks: | 1139277, 1159281, 1159961 | |||
|
Description
Brian Bouterse
2014-10-31 12:46:27 UTC
After some investigation it was determined that Pulp does not remove old consumer subscribers during unregistration. The deletion of the pulp.agent.<uuid> is the expected behavior, so we should fix this. After some discussion with jortel, we determined two changes need to be made: 1) pulp.agent.<uuid> needs to be 100% managed server side. This includes the creation and deletion of the queues. The 'manager' code for the consumer can handle this responsibility. Goferd needs to be adjusted to optionally not declare queues from the consumer side, and the pulp-agent and katello-agent will need small changes so that they know the server will create and delete the consumer queues. 2) A reaper task that removes these queues needs to be added. This runs periodically with some frequency and removes "orphaned_queues" that have been orphaned after an unregister event longer than X amount of time. These two options will be specified in server.conf and will come with defaults. 3) Docs need to be written on this indicating how cleanup occurs. The reaper design is necessary versus a "delete right now during unregistration" design because there is one final message that needs to be delivered to the consumer, and we should allow a reasonable amount of time for that to occur before the server force-deletes the queue. (In reply to bbouters from comment #1) > 2) A reaper task that removes these queues needs to be added. This runs > periodically with some frequency and removes "orphaned_queues" that have > been orphaned after an unregister event longer than X amount of time. These > two options will be specified in server.conf and will come with defaults. Note you can use auto-delete queues with deletion timeout (x-declare argument qpid.auto_delete_timeout). If a queue loses its latest consumer, the qpid broker waits for the timeout seconds and then it deletes the queue itself. This might simplify the implementation. Note however that if the queue never has a consumer, it wont be deleted any time (the auto-delete timer is triggered _only_ when latest consumer unsubscribes). Also I am not sure if the same functionality is available in RabbitMQ. Auto delete would be an elegant way to solve this, but I'm not sure how it would allow for consumers to be powered off for longer than the auto_delete timeout. A consumer that was running, but then had its agent service stopped or is powered off will wake up at some point in the future to find its queue missing. Today, if an agent finds its queue missing it recreates it so it would recover, but any meaningful messages that were issued to the consumer (ie: bind or unbind) while it is turned off would be missing. This could cause some very unexpected behaviors for the user where they remember doing an action such as binding a group of consumers to a given repo, but whoever wasn't powered on at that moment doesn't receive the config. Suggestions for how we can make the auto_delete approach work are welcome because it is more elegant but I'm not sure how to resolve this one lingering problem. (In reply to bbouters from comment #3) > Auto delete would be an elegant way to solve this, but I'm not sure how it > would allow for consumers to be powered off for longer than the auto_delete > timeout. A consumer that was running, but then had its agent service stopped > or is powered off will wake up at some point in the future to find its queue > missing. Good point. What about auto_delete queue (optionally with some timeout) that has alternate exchange set, as well as the exchange routing messages to these queues has the same alt.exchange set. Plus having one durable auxiliary queue that gets all messages routed via the alt.exchange. Then: - when deleting the queue due to auto-delete parameter, all messages in the queue are redirected via the alt.exchange to the aux.queue - when a consumer is off and broker should deliver a message to its (deleted) queue, it finds the original exchange does not have a matching binding (to the deleted queue) so the broker re-routes the message to the alt.exchange, i.e. to the aux.queue When a consumer is starting, if it detects its queue is gone, it would have to call QMF method "queueMoveMessages" that moves messages from aux.queue to the proper newly created pulp.agent.<uuid> queue, with some proper filter (to move just messages relevant to the pulp consumer). Gotchas: - needs some more testing and probably bigger code changes - if some default exchange (like "" one) is used for distributing the messages to pulp.agent.<uuid> queues, you cant set alternate exchange for such an exchange. But you can create a new exchange for this traffic. - this solution does not cover situations when a pulp consumer wont power on at all (not sure if allowed scenario). Then the aux.queue would keep messages for this consumer forever. But IMHO this problem is common to any solution - is there a way how to identify a consumer wont ever power on? This design introduces a lot of complexity, but it doesn't fully address all of the gotchas. I think marking a queue as safe-to-be-deleted no sooner than X minutes after the unregistration event occurs seems much simpler and does correctly handle all cases. Raising the priority given the bugs this BZ blocks. FYI trivial reproducer for this is just registering and unregistering a content host via subscription-manager, like: subscription-manager register --org="Default_Organization" --environment="Library" --username=admin --password=<Sat6_admin_password> subscription-manager unregister subscription-manager clean Doing so, one extra pulp.agent.<uuid> queue is created and not deleted. Note that this reproducer covers just one use case, not all of them. 2.6.0-0.7.beta >> rpm -qa pulp-server pulp-server-2.6.0-0.7.beta.fc20.noarch >> pulp-consumer -u admin -p admin register --consumer-id bobik Consumer [bobik] successfully registered >> qpid-stat -q | grep bobik pulp.agent.bobik Y 0 0 0 0 0 0 1 1 >> pulp-consumer unregister Consumer [bobik] successfully unregistered >> qpid-stat -q | grep bobik # the previous comment was with running gofer. here is test #2 - with not running agent: >> pulp-consumer -u admin -p admin register --consumer-id lelik Consumer [lelik] successfully registered >> qpid-stat -q | grep lelik pulp.agent.lelik Y 0 0 0 0 0 0 0 1 >> systemctl stop goferd >> systemctl status goferd goferd.service - Gofer Agent Loaded: loaded (/usr/lib/systemd/system/goferd.service; enabled) Active: inactive (dead) since Fri 2015-02-20 16:01:27 UTC; 2s ago Main PID: 31303 (code=killed, signal=TERM) ... >> qpid-stat -q | grep lelik pulp.agent.lelik Y 0 0 0 0 0 0 0 1 >> pulp-consumer unregister Consumer [lelik] successfully unregistered >> date Fri Feb 20 16:11:22 UTC 2015 >> qpid-stat -q | grep lelik pulp.agent.lelik Y 1 1 0 763 763 0 0 1 >> date Fri Feb 20 16:20:52 UTC 2015 >> qpid-stat -q | grep lelik pulp.agent.lelik Y 1 1 0 763 763 0 0 1 >> date Fri Feb 20 16:24:39 UTC 2015 >> qpid-stat -q | grep lelik # Moved to https://pulp.plan.io/issues/603 |