Description of problem: While patching the system using Katello Agent, errata management tasks/actions hang for more than 14 hours (then we cancelled them). It was cause by qpidd/qdrouterd on a satellite server not having capacity for all the clients we have (we need to increase max open files and fs.aio-max-nr), but this bug is about that infinite hang of a task. Version-Release number of selected component (if applicable): Red Hat Satellite (build: 6.6.0 Beta) Version 6.6 © 2019 Red Hat Inc. How reproducible: Always Steps to Reproduce: 1. Login to Satellite operational Portal 2. Navigate to Hosts -> Content Hosts -> Select multiple hosts -> Click on Select Action -> Manage Errata -> Select all the errata avaiable -> Click on Install Selected -> via katello agent -> Done 3. In order to check the status of the task, go to Monitor -> Tasks -> here you will see one task named as "Bulk action" which is showing pending. Actual results: The task is stuck and it started at 14 hours ago (we tried with group of 10 hosts and about 1/2 failed, remaining hanged). Expected results: The bulk action should not be stuck. It should be a success or a failure. Additional info: Katello Agent is installed on all the hosts.
Applying the following tunings resolved the situation/problem: https://github.com/redhat-performance/satellite-tune/blob/master/ansible/roles/qpidd-fs-aio-max-nr/tasks/main.yaml https://github.com/redhat-performance/satellite-tune/blob/master/ansible/roles/qdrouterd-max-open-files/tasks/main.yaml https://github.com/redhat-performance/satellite-tune/blob/master/ansible/roles/qpidd-max-open-files/tasks/main.yaml
clearing the needinfo as no request was made.
We definitely have a '500' connected agents limit in 6.6 with default settings. I configured ~700 agent containers to connect to a Satellite 6.6 server and it maxes out at 500: # qpid-stat -q --ssl-certificate=/etc/pki/pulp/qpid/client.crt -b amqps://localhost:5671 |grep pulp.agent | wc -l 503 # docker ps | wc -l 700 new connection attempts result in: Sep 28 11:37:13 ci-vm-10-0-150-175.hosted.upshift.rdu2.redhat.com goferd[13653]: [ERROR][worker-0] gofer.messaging.adapter.connect:33 - connect: proton+amqps://sat-r220-09.lab.eng.rdu2.redhat.com:5647, failed: Connection amqps://sat-r220-09.lab.eng.rdu2.redhat.com:5647 disconnected: Condition('amqp:resource-limit-exceeded', 'local-idle-timeout expired') the fix is to add these 2 configurations to /etc/foreman-installer/custom-hiera.yaml qpid::open_file_limit: 65536 qpid::router::open_file_limit: 150100 run 'satellite-installer' and restart. Once applied, clients are able to connect: Sep 28 11:58:18 ci-vm-10-0-150-175.hosted.upshift.rdu2.redhat.com goferd[13653]: [INFO][pulp.agent.70dc6424-48d7-43bf-92a0-f465df9eea89] gofer.messaging.adapter.connect:30 - connected: proton+amqps://sat-r220-09.lab.eng.rdu2.redhat.com:5647 This is covered in the 6.5 and 6.6 Tuning Guide: https://access.redhat.com/solutions/4224211 as well as the Tuning Profiles documented: https://github.com/RedHatSatellite/satellite-support/tree/master/tuning-profiles Going to close this out as NOTABUG as it is documented in our tuning guides
note, fs.aio-max-nr is not required tuning for 500 gofer/katello-agent clients, just the open_file_limit.