Description of problem: logrotate send a HUP to httpd on a daily base to release and recreate logfiles after rotation. This triggers a restart of all Pulp related child processes of Apache, during the restart Katello gets a HTTP 500 error for any new jobs. Version-Release number of selected component (if applicable): Satellite 6.2.12 How reproducible: at random times Steps to Reproduce: 1. Add sync jobs to Satellite 2. Wait until Actions::Pulp::Repository::DistributorPublish happens at the same time as logrotate Actual results: Actions::Pulp::Repository::DistributorPublish fails with error RestClient::InternalServerError 500 Internal Server Error Expected results: Katello jobs do not fail or do a retry once Pulp is accepting requests again. Additional info:
Further feedback about the issue. The Actions::Pulp::Repository::DistributorPublish jobs are filed by a cron daily script and - if there are unpublished repos - executed at the same time as the httpd logrotate. A clash is almost unavoidable: ---8<--- # rpm -qf /etc/cron.daily/katello-repository-publish-check katello-common-3.0.0-26.el7sat.noarch # cat /etc/cron.daily/katello-repository-publish-check #!/bin/env bash # Script for checking for any repositories that have not been published since # their last sync, and republishing them foreman-rake katello:publish_unpublished_repositories RAILS_ENV=production >/dev/null 2>&1 ---8<---
If logrotate sends SIGHUP to httpd, it's expected that child httpd processes will be terminated and restarted, right? I guess SIGHUP is a default for logrotate, maybe more graceful restart of Apache will be better? SIGUSR1? https://httpd.apache.org/docs/2.4/stopping.html#graceful I'm not sure what Pulp can do here. Any suggestions from Katello side to handle downtime of Apache?
Tanya, I had a quick check how logrotate restarts Apache: # less /etc/logrotate.d/httpd ... postrotate /bin/systemctl reload httpd.service > /dev/null 2>/dev/null || true # less /usr/lib/systemd/system/httpd.service ... ExecReload=/usr/sbin/httpd $OPTIONS -k graceful So we have already a graceful restart. The running requests should be served out correctly. From my understanding Apache usually keeps new connections on hold until the backends are started again, at least I never had an issue with PHP or FastCGI based applications during a restart. We might have a situation in WSGI that requests are already handled before the Python application is ready. But yes, the longer I think about this kind of issue the more I think Katello should do some retry when Pulp returns unexpected error messages. HTH, Beat
We could add a retry for simple task status fetching (which i believe is what is the case here), but i'm hesitant to add it for all calls to pulp. That should at least get us through the restart and not 'block' the task monitoring piece.
The root cause of the wsgi app failure is likely the same as in BZ#1516481. Update of gofer is needed to avoid this failure.
Since most seem to think that this is indeed on the pulp / gofer side I have reassigned the component and marked it as dependent on BZ#1516481. Please let me know if it should in be a duplicate or something else.
I think this can also be closed as fixed in 6.4. The dependent BZ https://bugzilla.redhat.com/show_bug.cgi?id=1572298 is already closed for 6.4
Thanks Peter! Closing this one based upon comment 13. The solution for bug 1572298 is indeed in 6.4. If there are any concerns or if the issue re-appears, please feel free to re-open.