Bug 1417345
Summary: | Orphaned queues are not auto-deleted for Qpid at scale | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Pradeep Kumar Surisetty <psuriset> |
Component: | katello-agent | Assignee: | Jeff Ortel <jortel> |
Status: | CLOSED ERRATA | QA Contact: | sbadhwar |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.2.6 | CC: | bbuckingham, bkearney, cdonnell, cduryee, cwelton, daniele, dsynk, egolov, jcallaha, jentrena, jhutar, jortel, jsherril, jturel, mkalyat, mmccune, mverma, pcreech, pmoravec, psuriset, raalee, sbadhwar, shisingh, stephenkay, tbhowmik, tsourab, vanhoof |
Target Milestone: | 6.4.0 | Keywords: | Triaged |
Target Release: | Unused | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | scale_lab | ||
Fixed In Version: | gofer-2.12.1-1.el6.noarch, gofer-2.12.1-1.el7.noarch, gofer-2.11.5-1.el5.noarch | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-10-16 17:17:28 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1546813 |
Description
Pradeep Kumar Surisetty
2017-01-28 02:38:03 UTC
[root@gprfc018 tmp]# hammer content-host list --organization-id 1 --uuid 0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3 [Foreman] Password for admin: ---|------|------------------- ID | NAME | INSTALLABLE ERRATA ---|------|------------------- [root@gprfc018 tmp]# Workaround: for queue in $(generate list of pulp.agent.UUID queues); do rm -rf /var/lib/qpidd/.qpidd/qls/jrnl/${queue} qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue $queue --durable done Anyway, it is really weird that: - there is a client thinking it has some UUID, trying to connect via goferd to qpidd - Satellite isnt aware of a Content Host of that UUID (hammer list returns nothing, katello_systems table in psql foreman db doesnt have it) - qpidd does not know the pulp.agent.UUID queue (neither QMF or real Queue object exists) - BUT /var/lib/qpidd/.qpidd/qls/jrnl/ has directory and journal files like there is such durable queue - and also BDB backend of qpidd has that pulp.agent.UUID in its (binary) logs - so the queue was present there That smells like 2 separate underlying problems: - Content Host thinks it has some UUID and activelly uses it, but Satellite isnt aware of it - qpidd thinks the queue isnt there but it has journal files for it (possible scenario leading to that: re-registering a Content Host from one Capsule to another, without goferd restart - such that it keeps the old UUID in memory and keep trying to use it) I noticed first such errors appeared at: Jan 26 05:09:37 gprfc018 qpidd: 2017-01-26 05:09:37 [Protocol] error Error on attach: Node not found: pulp.agent.13a3438c-cff9-49e2-8e09-a120b2d61996 Jan 26 05:09:37 gprfc018 qpidd[1088]: 2017-01-26 05:09:37 [Protocol] error Error on attach: Node not found: pulp.agent.13a3438c-cff9-49e2-8e09-a120b2d61996 (times in EST / the server time) Please provide what actions happened shortly before that (wrt. content hosts, like (re)registering them from a capsule to another, rebooting them, changing something,..). moving to high while we investigate. What's going on with this? I've got similar problems. (In reply to Robert Lee from comment #11) > What's going on with this? I've got similar problems. Please raise rather support case to investigate it separately. Provide there: - sosreport from Satellite and every Capsule - three tables from backend databases: mkdir bz1417345 cd bz1417345 su - postgres -c "psql foreman -c 'select * from katello_systems;'" | gzip > katello_systems.txt.gz su - postgres -c "psql candlepin -c 'select * from cp_consumers;'" | gzip > cp_consumers.txt.gz mongo pulp_database --eval "DBQuery.shellBatchSize = 10000000; db.consumers.find({}).shellPrint()" | gzip > consumers.txt.gz cd .. tar czf bz1417345.tar.gz bz1417345 (and attach the bz1417345.tar.gz file) (In reply to Pavel Moravec from comment #12) > (In reply to Robert Lee from comment #11) > > .. > su - postgres -c "psql candlepin -c 'select * from cp_consumers;'" | gzip > > cp_consumers.txt.gz Apologize for a typo, this command shall be: su - postgres -c "psql candlepin -c 'select * from cp_consumer;'" | gzip > cp_consumers.txt.gz repro steps: * register system to Satellite (use a VM) * suspend VM * delete system via Satellite web UI * wait at least 10 minutes for pulp to perform queue cleanup * resume VM At this point, you'll see the "node not found" error on the Satellite. You will also see similar errors from goferd on the client. The VM suspend is needed because goferd will check if the system is still registered when it starts up. The 10 minute delay is needed so that the queue cleanup job removes the queue, which contains the "I've been unregistered" message for the client. Created redmine issue http://projects.theforeman.org/issues/24006 from this bug Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/24006 has been resolved. Moving this back ON_QA as the test was done with an old version of katello-agent TK, Can you tell me exactly the steps you used to reproduce the problem? Following the steps in #20 I got output shown below which makes me thing the issue no longer exists: Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: [INFO][pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://sat-r220-01.lab.eng.rdu2.redhat.com:5647 Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:394 - plugin:katello, detached [pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:692 - plugin:katello loaded using: katello.agent.goferd.plugin Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-5] gofer.rmi.store:108 - Using: /var/lib/gofer/messaging/pending/katello Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] rhsm.connection:868 - Connection built: host=sat-r220-01.lab.eng.rdu2.redhat.com port=443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] rhsm.connection:586 - Response: status=410, request="GET /rhsm/consumers/2591a830-b6e9-44d6-b442-9fb2f997abfc" Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:494 - plugin:katello, reloaded (In reply to Jonathon Turel from comment #36) > TK, > > Can you tell me exactly the steps you used to reproduce the problem? > Following the steps in #20 I got output shown below which makes me thing the > issue no longer exists: > > Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: > [INFO][pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] > gofer.messaging.adapter.proton.connection:131 - closed: > proton+amqps://sat-r220-01.lab.eng.rdu2.redhat.com:5647 > Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] gofer.agent.plugin:394 - plugin:katello, detached > [pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] gofer.agent.plugin:692 - plugin:katello loaded using: > katello.agent.goferd.plugin > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-5] gofer.rmi.store:108 - Using: > /var/lib/gofer/messaging/pending/katello > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] rhsm.connection:868 - Connection built: > host=sat-r220-01.lab.eng.rdu2.redhat.com port=443 handler=/rhsm > auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] rhsm.connection:586 - Response: status=410, request="GET > /rhsm/consumers/2591a830-b6e9-44d6-b442-9fb2f997abfc" > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] gofer.agent.plugin:494 - plugin:katello, reloaded 1. Registered 3.5K hosts to satellite. 2. Installed required packages via Remote Execution. 2. Deleted a host from hosts tab. 3. Waited for atleast an hour. Thanks TK. I was just talking with Chris Duryee who provided the repro steps above who has made an important observation: The error you're seeing is still going to be logged - but it should not be logged over and over again. You should observe that the agent is reloaded with a log line such as what I pasted above. Do you observe that the plugin is reloading or does that error continue to fill the logs? Reload log line looks like: 0ef1a86dfa650e76785cdab4a738cfbc940630e7 The BZ was written without "Expected results:" so based on the available information, engineering determined that the main complaint was the repeated logging of 'node not found'. This was resolved. However, when host is removed (unregistered) on the server and the queue deleted out from under the agent, a single error will still be logged. This is by design. However, if the expectation is that either: 1. A single message is logged at WARN/DEBUG (instead of ERROR) 2. Nothing is logged. Please update the BZ. I vote for logging at DEBUG because it eliminates the noise in the log while still supports troubleshooting. +1 to DEBUG log level Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2929 Can we remove the need info flag from this issue since it's resolved? |