Bug 1417345

Summary: Orphaned queues are not auto-deleted for Qpid at scale
Product: Red Hat Satellite Reporter: Pradeep Kumar Surisetty <psuriset>
Component: katello-agentAssignee: Jeff Ortel <jortel>
Status: CLOSED ERRATA QA Contact: sbadhwar
Severity: high Docs Contact:
Priority: high    
Version: 6.2.6CC: bbuckingham, bkearney, cdonnell, cduryee, cwelton, daniele, dsynk, egolov, jcallaha, jentrena, jhutar, jortel, jsherril, jturel, mkalyat, mmccune, mverma, pcreech, pmoravec, psuriset, raalee, sbadhwar, shisingh, stephenkay, tbhowmik, tsourab, vanhoof
Target Milestone: 6.4.0Keywords: Triaged
Target Release: Unused   
Hardware: All   
OS: Linux   
Whiteboard: scale_lab
Fixed In Version: gofer-2.12.1-1.el6.noarch, gofer-2.12.1-1.el7.noarch, gofer-2.11.5-1.el5.noarch Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-16 17:17:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1546813    

Description Pradeep Kumar Surisetty 2017-01-28 02:38:03 UTC
Description of problem:

Have been running 6k+ content hosts with satelite 6.2/ capsule. 

Started noticing 'error Error on attach: Node not found:'. 
---
 qpidd[13021]: 2017-01-27 21:35:37 [Protocol] error Error on attach: Node not found: pulp.agent.7e8e5e57-b61b-49be-a023-d2d375914382
 qpidd: 2017-01-27 21:35:37 [Protocol] error Error on attach: Node not found: pulp.agent.7e8e5e57-b61b-49be-a023-d2d375914382
 qpidd: 2017-01-27 21:35:40 [Protocol] error Error on attach: Node not found: pulp.agent.4d91dd05-a52c-418d-8535-ffb2eb9bc18d
 qpidd[13021]: 2017-01-27 21:35:40 [Protocol] error Error on attach: Node not found: pulp.agent.4d91dd05-a52c-418d-8535-ffb2eb9bc18d
---


Below uuid has orphaned queue, which doesnt have any nodes associated with it. 

[root@host tmp]# cat /var/log/messages| grep  "error Error on attach: Node not found" |  awk -F "error Error on attach: Node not found: " '{ print $2 }' | cut -d. -f3 | sort -u 

01342d48-7dd6-4f41-b53c-97c92ed20818
0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3
13a3438c-cff9-49e2-8e09-a120b2d61996
170a7bd6-d892-4e49-baa3-c5610b294979
20c8e747-b0f3-4c9f-bd32-bd6052140a81
27a99624-8c85-457f-b976-73636e52d4fc
3607362d-b323-47c4-b630-2da60fb21544
3613911b-5441-491b-8206-5c8869359669
49bfa8a3-e128-4cf9-b146-98f830399dba
4d91dd05-a52c-418d-8535-ffb2eb9bc18d
4f6a2f5b-da05-4ff8-a708-80464762c0bb
53d9acd9-445c-4f61-aa5b-160515ee9245
54806e8b-a2c3-489c-a3e6-d5457982e44d
59256ebe-57e0-41d9-84d3-5cc412aa7926
6b053bfa-5da0-4758-b3bb-9f02ce62f9f1
7376d9ce-7e32-4f67-943c-6df196f1e073
7714e9fc-a545-4f50-b614-3076a602f53c
7e8e5e57-b61b-49be-a023-d2d375914382
8aee2667-a8f5-4f44-9c8f-a5fb2db6f498
8c18a6cb-9c84-4427-8075-e9b67ee299a2
8ec32fd5-23f5-477d-819a-980b4a1ab93a
9667fea9-0efa-4289-871c-6a0a4e56df87
ae4dd874-8101-44b5-90d6-002d9f83e390
b110adc1-5c61-4fb1-91d9-3a5c3c6fa90f
b65b87d3-4842-4fab-aa4a-310137ebea4b
b8d0f181-6d86-4518-bc29-a31ed0bb2fc2
be1a0629-55ed-4d1b-b307-bc34cf85fcc4
c432563f-baa6-4bad-9d85-dec46d02e793
cdcf497f-58f8-4143-9bbc-cc2b5178201e
d04924cc-7a65-46f6-aa8a-5d2850261445
e386f227-9d9e-4da5-80f3-ea2251ae16f0
f2a8f65d-78ef-4c54-93f7-783501fa2a0f
f63a1132-1d49-46d6-bf26-f7f6e8ea991b
fcfad18c-69ca-4930-9bbb-e97432228896


There is no such node with this uuid

[root@host tmp]# qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep 0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3
[root@host tmp]# 


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Run at scale. Some hosts client dies ..etc
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Pradeep Kumar Surisetty 2017-01-28 02:38:51 UTC
[root@gprfc018 tmp]# hammer content-host list  --organization-id 1 --uuid 0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3
[Foreman] Password for admin: 
---|------|-------------------
ID | NAME | INSTALLABLE ERRATA
---|------|-------------------
[root@gprfc018 tmp]#

Comment 2 Pavel Moravec 2017-01-28 15:53:19 UTC
Workaround:

for queue in $(generate list of pulp.agent.UUID queues); do
  rm -rf /var/lib/qpidd/.qpidd/qls/jrnl/${queue}
  qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue $queue --durable
done

Anyway, it is really weird that:
- there is a client thinking it has some UUID, trying to connect via goferd to qpidd
- Satellite isnt aware of a Content Host of that UUID (hammer list returns nothing, katello_systems table in psql foreman db doesnt have it)
- qpidd does not know the pulp.agent.UUID queue (neither QMF or real Queue object exists)
- BUT /var/lib/qpidd/.qpidd/qls/jrnl/ has directory and journal files like there is such durable queue
- and also BDB backend of qpidd has that pulp.agent.UUID in its (binary) logs - so the queue was present there

That smells like 2 separate underlying problems:
- Content Host thinks it has some UUID and activelly uses it, but Satellite isnt aware of it
- qpidd thinks the queue isnt there but it has journal files for it

(possible scenario leading to that: re-registering a Content Host from one Capsule to another, without goferd restart - such that it keeps the old UUID in memory and keep trying to use it)

Comment 3 Pavel Moravec 2017-01-28 16:20:51 UTC
I noticed first such errors appeared at:

Jan 26 05:09:37 gprfc018 qpidd: 2017-01-26 05:09:37 [Protocol] error Error on attach: Node not found: pulp.agent.13a3438c-cff9-49e2-8e09-a120b2d61996
Jan 26 05:09:37 gprfc018 qpidd[1088]: 2017-01-26 05:09:37 [Protocol] error Error on attach: Node not found: pulp.agent.13a3438c-cff9-49e2-8e09-a120b2d61996

(times in EST / the server time)

Please provide what actions happened shortly before that (wrt. content hosts, like (re)registering them from a capsule to another, rebooting them, changing something,..).

Comment 10 Bryan Kearney 2017-08-11 13:32:26 UTC
moving to high while we investigate.

Comment 11 Robert Lee 2017-08-15 16:42:02 UTC
What's going on with this? I've got similar problems.

Comment 12 Pavel Moravec 2017-08-15 19:26:46 UTC
(In reply to Robert Lee from comment #11)
> What's going on with this? I've got similar problems.

Please raise rather support case to investigate it separately. Provide there:

- sosreport from Satellite and every Capsule
- three tables from backend databases:

mkdir bz1417345

cd bz1417345

su - postgres -c "psql foreman -c 'select * from katello_systems;'" | gzip > katello_systems.txt.gz

su - postgres -c "psql candlepin -c 'select * from cp_consumers;'" | gzip > cp_consumers.txt.gz

mongo pulp_database --eval "DBQuery.shellBatchSize = 10000000; db.consumers.find({}).shellPrint()" | gzip > consumers.txt.gz

cd ..

tar czf bz1417345.tar.gz bz1417345

(and attach the bz1417345.tar.gz file)

Comment 13 Pavel Moravec 2017-08-18 13:40:10 UTC
(In reply to Pavel Moravec from comment #12)
> (In reply to Robert Lee from comment #11)
> > ..
> su - postgres -c "psql candlepin -c 'select * from cp_consumers;'" | gzip >
> cp_consumers.txt.gz

Apologize for a typo, this command shall be:

su - postgres -c "psql candlepin -c 'select * from cp_consumer;'" | gzip > cp_consumers.txt.gz

Comment 20 Chris Duryee 2018-04-03 13:27:39 UTC
repro steps:

* register system to Satellite (use a VM)
* suspend VM
* delete system via Satellite web UI
* wait at least 10 minutes for pulp to perform queue cleanup
* resume VM

At this point, you'll see the "node not found" error on the Satellite. You will also see similar errors from goferd on the client.

The VM suspend is needed because goferd will check if the system is still registered when it starts up. The 10 minute delay is needed so that the queue cleanup job removes the queue, which contains the "I've been unregistered" message for the client.

Comment 26 Jonathon Turel 2018-06-19 16:32:48 UTC
Created redmine issue http://projects.theforeman.org/issues/24006 from this bug

Comment 27 Satellite Program 2018-06-20 18:20:54 UTC
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/24006 has been resolved.

Comment 33 Mike McCune 2018-09-18 15:41:50 UTC
Moving this back ON_QA as the test was done with an old version of katello-agent

Comment 36 Jonathon Turel 2018-09-30 22:20:08 UTC
TK,

Can you tell me exactly the steps you used to reproduce the problem? Following the steps in #20 I got output shown below which makes me thing the issue no longer exists:

Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: [INFO][pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://sat-r220-01.lab.eng.rdu2.redhat.com:5647
Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:394 - plugin:katello, detached [pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc]
Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:692 - plugin:katello loaded using: katello.agent.goferd.plugin
Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-5] gofer.rmi.store:108 - Using: /var/lib/gofer/messaging/pending/katello
Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] rhsm.connection:868 - Connection built: host=sat-r220-01.lab.eng.rdu2.redhat.com port=443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False
Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] rhsm.connection:586 - Response: status=410, request="GET /rhsm/consumers/2591a830-b6e9-44d6-b442-9fb2f997abfc"
Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:494 - plugin:katello, reloaded

Comment 37 T K Sourab 2018-10-01 04:37:30 UTC
(In reply to Jonathon Turel from comment #36)
> TK,
> 
> Can you tell me exactly the steps you used to reproduce the problem?
> Following the steps in #20 I got output shown below which makes me thing the
> issue no longer exists:
> 
> Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]:
> [INFO][pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc]
> gofer.messaging.adapter.proton.connection:131 - closed:
> proton+amqps://sat-r220-01.lab.eng.rdu2.redhat.com:5647
> Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]:
> [INFO][Thread-4] gofer.agent.plugin:394 - plugin:katello, detached
> [pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc]
> Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]:
> [INFO][Thread-4] gofer.agent.plugin:692 - plugin:katello loaded using:
> katello.agent.goferd.plugin
> Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]:
> [INFO][Thread-5] gofer.rmi.store:108 - Using:
> /var/lib/gofer/messaging/pending/katello
> Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]:
> [INFO][Thread-4] rhsm.connection:868 - Connection built:
> host=sat-r220-01.lab.eng.rdu2.redhat.com port=443 handler=/rhsm
> auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False
> Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]:
> [INFO][Thread-4] rhsm.connection:586 - Response: status=410, request="GET
> /rhsm/consumers/2591a830-b6e9-44d6-b442-9fb2f997abfc"
> Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]:
> [INFO][Thread-4] gofer.agent.plugin:494 - plugin:katello, reloaded

1. Registered 3.5K hosts to satellite.
2. Installed required packages via Remote Execution.
2. Deleted a host from hosts tab.
3. Waited for atleast an hour.

Comment 39 Jonathon Turel 2018-10-01 14:48:45 UTC
Thanks TK. I was just talking with Chris Duryee who provided the repro steps above who has made an important observation: The error you're seeing is still going to be logged - but it should not be logged over and over again. You should observe that the agent is reloaded with a log line such as what I pasted above. 

Do you observe that the plugin is reloading or does that error continue to fill the logs?

Reload log line looks like: 0ef1a86dfa650e76785cdab4a738cfbc940630e7

Comment 40 Jeff Ortel 2018-10-01 15:30:36 UTC
The BZ was written without "Expected results:" so based on the available information, engineering determined that the main complaint was the repeated logging of 'node not found'.  This was resolved.  However, when host is removed (unregistered) on the server and the queue deleted out from under the agent, a single error will still be logged.  This is by design.  However, if the expectation is that either:

1. A single message is logged at WARN/DEBUG (instead of ERROR)
2. Nothing is logged.

Please update the BZ.

I vote for logging at DEBUG because it eliminates the noise in the log while still supports troubleshooting.

Comment 41 Mike McCune 2018-10-01 15:34:01 UTC
+1 to DEBUG log level

Comment 45 errata-xmlrpc 2018-10-16 17:17:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2929

Comment 46 T K Sourab 2019-11-15 07:09:54 UTC
Can we remove the need info flag from this issue since it's resolved?