Hide Forgot
Description of problem: Have been running 6k+ content hosts with satelite 6.2/ capsule. Started noticing 'error Error on attach: Node not found:'. --- qpidd[13021]: 2017-01-27 21:35:37 [Protocol] error Error on attach: Node not found: pulp.agent.7e8e5e57-b61b-49be-a023-d2d375914382 qpidd: 2017-01-27 21:35:37 [Protocol] error Error on attach: Node not found: pulp.agent.7e8e5e57-b61b-49be-a023-d2d375914382 qpidd: 2017-01-27 21:35:40 [Protocol] error Error on attach: Node not found: pulp.agent.4d91dd05-a52c-418d-8535-ffb2eb9bc18d qpidd[13021]: 2017-01-27 21:35:40 [Protocol] error Error on attach: Node not found: pulp.agent.4d91dd05-a52c-418d-8535-ffb2eb9bc18d --- Below uuid has orphaned queue, which doesnt have any nodes associated with it. [root@host tmp]# cat /var/log/messages| grep "error Error on attach: Node not found" | awk -F "error Error on attach: Node not found: " '{ print $2 }' | cut -d. -f3 | sort -u 01342d48-7dd6-4f41-b53c-97c92ed20818 0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3 13a3438c-cff9-49e2-8e09-a120b2d61996 170a7bd6-d892-4e49-baa3-c5610b294979 20c8e747-b0f3-4c9f-bd32-bd6052140a81 27a99624-8c85-457f-b976-73636e52d4fc 3607362d-b323-47c4-b630-2da60fb21544 3613911b-5441-491b-8206-5c8869359669 49bfa8a3-e128-4cf9-b146-98f830399dba 4d91dd05-a52c-418d-8535-ffb2eb9bc18d 4f6a2f5b-da05-4ff8-a708-80464762c0bb 53d9acd9-445c-4f61-aa5b-160515ee9245 54806e8b-a2c3-489c-a3e6-d5457982e44d 59256ebe-57e0-41d9-84d3-5cc412aa7926 6b053bfa-5da0-4758-b3bb-9f02ce62f9f1 7376d9ce-7e32-4f67-943c-6df196f1e073 7714e9fc-a545-4f50-b614-3076a602f53c 7e8e5e57-b61b-49be-a023-d2d375914382 8aee2667-a8f5-4f44-9c8f-a5fb2db6f498 8c18a6cb-9c84-4427-8075-e9b67ee299a2 8ec32fd5-23f5-477d-819a-980b4a1ab93a 9667fea9-0efa-4289-871c-6a0a4e56df87 ae4dd874-8101-44b5-90d6-002d9f83e390 b110adc1-5c61-4fb1-91d9-3a5c3c6fa90f b65b87d3-4842-4fab-aa4a-310137ebea4b b8d0f181-6d86-4518-bc29-a31ed0bb2fc2 be1a0629-55ed-4d1b-b307-bc34cf85fcc4 c432563f-baa6-4bad-9d85-dec46d02e793 cdcf497f-58f8-4143-9bbc-cc2b5178201e d04924cc-7a65-46f6-aa8a-5d2850261445 e386f227-9d9e-4da5-80f3-ea2251ae16f0 f2a8f65d-78ef-4c54-93f7-783501fa2a0f f63a1132-1d49-46d6-bf26-f7f6e8ea991b fcfad18c-69ca-4930-9bbb-e97432228896 There is no such node with this uuid [root@host tmp]# qpid-stat --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 -q | grep 0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3 [root@host tmp]# Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Run at scale. Some hosts client dies ..etc 2. 3. Actual results: Expected results: Additional info:
[root@gprfc018 tmp]# hammer content-host list --organization-id 1 --uuid 0e0ce30c-ae99-4130-9d7f-6e6d16fe26b3 [Foreman] Password for admin: ---|------|------------------- ID | NAME | INSTALLABLE ERRATA ---|------|------------------- [root@gprfc018 tmp]#
Workaround: for queue in $(generate list of pulp.agent.UUID queues); do rm -rf /var/lib/qpidd/.qpidd/qls/jrnl/${queue} qpid-config --ssl-certificate=/etc/pki/katello/qpid_client_striped.crt -b amqps://localhost:5671 add queue $queue --durable done Anyway, it is really weird that: - there is a client thinking it has some UUID, trying to connect via goferd to qpidd - Satellite isnt aware of a Content Host of that UUID (hammer list returns nothing, katello_systems table in psql foreman db doesnt have it) - qpidd does not know the pulp.agent.UUID queue (neither QMF or real Queue object exists) - BUT /var/lib/qpidd/.qpidd/qls/jrnl/ has directory and journal files like there is such durable queue - and also BDB backend of qpidd has that pulp.agent.UUID in its (binary) logs - so the queue was present there That smells like 2 separate underlying problems: - Content Host thinks it has some UUID and activelly uses it, but Satellite isnt aware of it - qpidd thinks the queue isnt there but it has journal files for it (possible scenario leading to that: re-registering a Content Host from one Capsule to another, without goferd restart - such that it keeps the old UUID in memory and keep trying to use it)
I noticed first such errors appeared at: Jan 26 05:09:37 gprfc018 qpidd: 2017-01-26 05:09:37 [Protocol] error Error on attach: Node not found: pulp.agent.13a3438c-cff9-49e2-8e09-a120b2d61996 Jan 26 05:09:37 gprfc018 qpidd[1088]: 2017-01-26 05:09:37 [Protocol] error Error on attach: Node not found: pulp.agent.13a3438c-cff9-49e2-8e09-a120b2d61996 (times in EST / the server time) Please provide what actions happened shortly before that (wrt. content hosts, like (re)registering them from a capsule to another, rebooting them, changing something,..).
moving to high while we investigate.
What's going on with this? I've got similar problems.
(In reply to Robert Lee from comment #11) > What's going on with this? I've got similar problems. Please raise rather support case to investigate it separately. Provide there: - sosreport from Satellite and every Capsule - three tables from backend databases: mkdir bz1417345 cd bz1417345 su - postgres -c "psql foreman -c 'select * from katello_systems;'" | gzip > katello_systems.txt.gz su - postgres -c "psql candlepin -c 'select * from cp_consumers;'" | gzip > cp_consumers.txt.gz mongo pulp_database --eval "DBQuery.shellBatchSize = 10000000; db.consumers.find({}).shellPrint()" | gzip > consumers.txt.gz cd .. tar czf bz1417345.tar.gz bz1417345 (and attach the bz1417345.tar.gz file)
(In reply to Pavel Moravec from comment #12) > (In reply to Robert Lee from comment #11) > > .. > su - postgres -c "psql candlepin -c 'select * from cp_consumers;'" | gzip > > cp_consumers.txt.gz Apologize for a typo, this command shall be: su - postgres -c "psql candlepin -c 'select * from cp_consumer;'" | gzip > cp_consumers.txt.gz
repro steps: * register system to Satellite (use a VM) * suspend VM * delete system via Satellite web UI * wait at least 10 minutes for pulp to perform queue cleanup * resume VM At this point, you'll see the "node not found" error on the Satellite. You will also see similar errors from goferd on the client. The VM suspend is needed because goferd will check if the system is still registered when it starts up. The 10 minute delay is needed so that the queue cleanup job removes the queue, which contains the "I've been unregistered" message for the client.
Created redmine issue http://projects.theforeman.org/issues/24006 from this bug
Moving this bug to POST for triage into Satellite 6 since the upstream issue http://projects.theforeman.org/issues/24006 has been resolved.
Moving this back ON_QA as the test was done with an old version of katello-agent
TK, Can you tell me exactly the steps you used to reproduce the problem? Following the steps in #20 I got output shown below which makes me thing the issue no longer exists: Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: [INFO][pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] gofer.messaging.adapter.proton.connection:131 - closed: proton+amqps://sat-r220-01.lab.eng.rdu2.redhat.com:5647 Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:394 - plugin:katello, detached [pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:692 - plugin:katello loaded using: katello.agent.goferd.plugin Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-5] gofer.rmi.store:108 - Using: /var/lib/gofer/messaging/pending/katello Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] rhsm.connection:868 - Connection built: host=sat-r220-01.lab.eng.rdu2.redhat.com port=443 handler=/rhsm auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] rhsm.connection:586 - Response: status=410, request="GET /rhsm/consumers/2591a830-b6e9-44d6-b442-9fb2f997abfc" Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: [INFO][Thread-4] gofer.agent.plugin:494 - plugin:katello, reloaded
(In reply to Jonathon Turel from comment #36) > TK, > > Can you tell me exactly the steps you used to reproduce the problem? > Following the steps in #20 I got output shown below which makes me thing the > issue no longer exists: > > Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: > [INFO][pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] > gofer.messaging.adapter.proton.connection:131 - closed: > proton+amqps://sat-r220-01.lab.eng.rdu2.redhat.com:5647 > Sep 28 16:50:59 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] gofer.agent.plugin:394 - plugin:katello, detached > [pulp.agent.2591a830-b6e9-44d6-b442-9fb2f997abfc] > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] gofer.agent.plugin:692 - plugin:katello loaded using: > katello.agent.goferd.plugin > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-5] gofer.rmi.store:108 - Using: > /var/lib/gofer/messaging/pending/katello > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] rhsm.connection:868 - Connection built: > host=sat-r220-01.lab.eng.rdu2.redhat.com port=443 handler=/rhsm > auth=identity_cert ca_dir=/etc/rhsm/ca/ insecure=False > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] rhsm.connection:586 - Response: status=410, request="GET > /rhsm/consumers/2591a830-b6e9-44d6-b442-9fb2f997abfc" > Sep 28 16:51:01 rhel7.strangeways.example.com goferd[13054]: > [INFO][Thread-4] gofer.agent.plugin:494 - plugin:katello, reloaded 1. Registered 3.5K hosts to satellite. 2. Installed required packages via Remote Execution. 2. Deleted a host from hosts tab. 3. Waited for atleast an hour.
Thanks TK. I was just talking with Chris Duryee who provided the repro steps above who has made an important observation: The error you're seeing is still going to be logged - but it should not be logged over and over again. You should observe that the agent is reloaded with a log line such as what I pasted above. Do you observe that the plugin is reloading or does that error continue to fill the logs? Reload log line looks like: 0ef1a86dfa650e76785cdab4a738cfbc940630e7
The BZ was written without "Expected results:" so based on the available information, engineering determined that the main complaint was the repeated logging of 'node not found'. This was resolved. However, when host is removed (unregistered) on the server and the queue deleted out from under the agent, a single error will still be logged. This is by design. However, if the expectation is that either: 1. A single message is logged at WARN/DEBUG (instead of ERROR) 2. Nothing is logged. Please update the BZ. I vote for logging at DEBUG because it eliminates the noise in the log while still supports troubleshooting.
+1 to DEBUG log level
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2929
Can we remove the need info flag from this issue since it's resolved?