Bug 1850934
Summary: | Satellite-installer failed with error "Could not evaluate: Proxy xyz..com cannot be retrieved: unknown error (response 502)" | ||
---|---|---|---|
Product: | Red Hat Satellite | Reporter: | Devendra Singh <desingh> |
Component: | Installation | Assignee: | Eric Helms <ehelms> |
Status: | CLOSED ERRATA | QA Contact: | Devendra Singh <desingh> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 6.8.0 | CC: | ahumbe, ehelms, ekohlvan, mawerner, mmccune, swadeley, vijsingh, vsedmik, zhunting |
Target Milestone: | 6.8.0 | Keywords: | AutomationBlocker, Triaged, Upgrades |
Target Release: | Unused | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | foreman-2.1.2.2-1 | Doc Type: | Known Issue |
Doc Text: |
Satellite-installer failed with error "Could not evaluate: Proxy xyz..com cannot be retrieved: unknown error (response 502)"
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2020-10-27 13:03:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Devendra Singh
2020-06-25 08:30:49 UTC
This appears to be some sort of timing issue: 1) I can reproduce the issue on qe-sat6-upgrade-rhel7 but not with 100% reliability: [root@qe-sat6-upgrade-rhel7 ~]# foreman-maintain packages install -y foreman-discovery-image ... -------------------------------------------------------------------------------- Procedures::Installer::Upgrade: [OK] -------------------------------------------------------------------------------- Check status of version locking of packages: Automatic locking of package versions is enabled in installer. Packages are locked. [OK] -------------------------------------------------------------------------------- **** Worked fine, try again: [root@qe-sat6-upgrade-rhel7 ~]# foreman-maintain packages install -y foreman-discovery-image ... -------------------------------------------------------------------------------- Install packages: [OK] -------------------------------------------------------------------------------- Procedures::Installer::Upgrade: [FAIL] /Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[qe-sat6-upgrade-rhel7.satqe.lab.eng.rdu2.redhat.com]: Could not evaluate: Proxy qe-sat6-upgrade-rhel7.satqe.lab.eng.rdu2.redhat.com cannot be retrieved: unknown error (response 502) 2) I tried this repeatedly on other VMs and baremetal and it worked 100% of the time, no errors. Is there a host, other than qe-sat6-upgrade-rhel7 where we can reproduce this error? Updated title of the BZ to reflect that the error now only occurs using foreman-maintain to install packages. Please clarify if the bug is more broad than that. It is more broad that that and the previous summary was fine. We've also seen it in upstream CI where we don't use foreman-maintain at all. The problem is that Apache returns a HTTP 502 when foreman-installer attempts to register the Foreman Proxy. It does that when it does a HTTP call to the REST API in order to get the installed proxy. Ewoud, do we have a plan of attack for this issue? How frequent is it? The workaround is just 'run the installer again' right? We don't see it anymore in upstream CI so it disappeared from our radar. It has all the signs of a timing issue which is probably a missing dependency. Today I managed to reproduce it on my own system with a very underpowered VM. Some observations: From the installer log: [ WARN 2020-07-28T09:15:18 main] /Service[foreman]: Triggered 'refresh' from 1 event [DEBUG 2020-07-28T09:15:18 main] /Service[foreman]: The container Class[Foreman::Service] will propagate my refresh event [ INFO 2020-07-28T09:15:18 main] /Service[foreman]: Evaluated in 1.22 seconds --- SNIP --- [DEBUG 2020-07-28T09:15:51 main] Foreman_smartproxy[centos7-katello-nightly.wisse.example.com](provider=rest_v3): Making get request to https://centos7-katello-nightly.wisse.example.com/api/v2/smart_proxies?search=name=%22centos7-katello-nightly.wisse.example.com%22 [DEBUG 2020-07-28T09:16:52 main] Foreman_smartproxy[centos7-katello-nightly.wisse.example.com](provider=rest_v3): Received response 502 from request to https://centos7-katello-nightly.wisse.example.com/api/v2/smart_proxies?search=name=%22centos7-katello-nightly.wisse.example.com%22 [ERROR 2020-07-28T09:16:52 main] /Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[centos7-katello-nightly.wisse.example.com]: Could not evaluate: Proxy centos7-katello-nightly.wisse.example.com cannot be retrieved: unknown error (response 502) This corresponds to the Apache log: [Tue Jul 28 09:16:51.984032 2020] [proxy_http:error] [pid 15486] (70007)The timeout specified has expired: [client 192.168.122.182:59632] AH01102: error reading status line from remote server 127.0.0.1:3000 The interesting part is the service restart. It happens in 1 second. However, looking at the systemd journal it takes much longer: Jul 28 09:15:18 centos7-katello-nightly.wisse.example.com systemd[1]: Stopped Foreman. Jul 28 09:15:18 centos7-katello-nightly.wisse.example.com systemd[1]: Started Foreman. Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: /usr/share/foreman/lib/foreman.rb:8: warning: already initialized constant Foreman::UUID_REGEXP Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: /usr/share/foreman/lib/foreman.rb:8: warning: previous definition of UUID_REGEXP was here Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: => Booting Puma Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: => Rails 6.0.3.1 application starting in production Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: => Run `rails server --help` for more startup options Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] Puma starting in cluster mode... Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Version 4.3.3 (ruby 2.5.5-p157), codename: Mysterious Traveller Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Min threads: 0, max threads: 16 Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Environment: production Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Process workers: 2 Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Preloading application Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Activated tcp://127.0.0.1:3000 Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] Use Ctrl-C to stop Jul 28 09:16:49 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] - Worker 1 (pid: 7273) booted, phase: 0 Jul 28 09:16:49 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] - Worker 0 (pid: 7268) booted, phase: 0 It actually came online just before the timeout of 1 minute was reached. Most likely the individual workers were still booting and not ready to respond. The restart time of 1 second is surprising because you would think that it takes longer. This is because the service is using the simple type in systemd. It starts the process and returns: # systemctl cat foreman.service | grep Type Type=simple I looked into using Type=notify (which would make systemctl (re)start wait for the service to really be online - the installer would then guarantee it's online) in https://projects.theforeman.org/issues/29356 but the library needs some work to be compatible with current versions of Puma. This leads me to the conclusion that at least as far as I can reproduce it, this is again related to slow hardware. We've seen this more often in bugreports, also when using Passenger. It is possible to increase the timeout by setting it via hiera. Either a global default for all proxy config: apache::mod::proxy:proxy_timeout: 120 Or specifically for the Foreman vhost: foreman::config::apache::proxy_params: retry: 0 The retry: 0 is what we set by default. Options are documented @ https://httpd.apache.org/docs/current/mod/mod_proxy.html#proxypass. I suspect it's either connectiontimeout or timeout (or both) that need to be raised. I was asked to be more concrete in the suggestion. Based on my previous, I'd suggest running the following command (as root) prior to installing/upgrading as a workaround: echo "apache::mod::proxy::proxy_timeout: 120" >> /etc/foreman-installer/custom-hiera.yaml I have no basis for 120 other than it's twice as high as the observed default of 60. That this makes a difference is only based on reasoning and reading documentation, didn't actually test this myself. After running it, you should observe it sets ProxyTimeout in /etc/httpd/conf.modules.d/proxy.conf. (In reply to Ewoud Kohl van Wijngaarden from comment #26) > I was asked to be more concrete in the suggestion. Based on my previous, I'd > suggest running the following command (as root) prior to > installing/upgrading as a workaround: > > echo "apache::mod::proxy::proxy_timeout: 120" >> > /etc/foreman-installer/custom-hiera.yaml > > I have no basis for 120 other than it's twice as high as the observed > default of 60. That this makes a difference is only based on reasoning and > reading documentation, didn't actually test this myself. > > After running it, you should observe it sets ProxyTimeout in > /etc/httpd/conf.modules.d/proxy.conf. This workaround works, I have tried after applying the workaround and didn't see any problem. Below are observations that have been noticed around Azure Provisioning tests using provided workaround ... 6.8.0: 1) after set the Proxy_Timeout value to 120 (/etc/httpd/conf.modules.d/proxy.conf) the tests(UIs/APIs/CLIs) are still failing with errors i.e "502 Server Error: Proxy Error or TimeOutError or Error: 502 Bad Gateway" 2) And after set Proxy_Timeout value to 450+ all tests are running fine however In 6.7: 1) The Proxy_Timeout value is 60 (default)(/etc/httpd/conf.modules.d/proxy.conf) and UIs/CLIs/APIs tests are getting completed successfully. so overall it looks that Proxy_Timeout value requires to set 3/4 times higher than 6.7.z . The difference is that previously the proxy timeout was mostly irrelevant. mod_passenger, as used in 6.7, runs inside the Apache process while Puma, as used in 6.8 runs as a separate process and Apache proxies to it. It is very concerning that a value higher than 120 is needed. That is *very* slow. That a higher timeout does work indicates a high load rather than actual failures. It would be good to know how heavily loaded the machine is. Can you collect some numbers on that? With 6.8 you can revert back to passenger using `--foreman-passenger true`. That's another thing you can try to verify. With that I would also use the default proxy timeout. (In reply to Ewoud Kohl van Wijngaarden from comment #29) > The difference is that previously the proxy timeout was mostly irrelevant. > mod_passenger, as used in 6.7, runs inside the Apache process while Puma, as > used in 6.8 runs as a separate process and Apache proxies to it. > > It is very concerning that a value higher than 120 is needed. That is *very* > slow. That a higher timeout does work indicates a high load rather than > actual failures. It would be good to know how heavily loaded the machine is. > Can you collect some numbers on that? > I checked on freshly installed Satellite however did not notice any specific load there. > With 6.8 you can revert back to passenger using `--foreman-passenger true`. > That's another thing you can try to verify. With that I would also use the > default proxy timeout. I tired to use that but failed with error during `--foreman-passenger true` .. ```` [ERROR 2020-08-18T09:23:32 main] /Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[dhcp-2-17.vms.sat]: Could not evaluate: Proxy dhcp-2-17.vms.sat cannot be retrieved: unknown error (response 500) ```` A HTTP 500 indicates Apache or Passenger or Foreman itself failed to respond correctly. You should check /var/log/httpd and /var/log/foreman for errors. I have also seen this happen when the Foreman application itself failed to start up in time. Given we're seeing the same issues with the standalone service, this would be my bet. If there is load (CPU, memory, disk or swapping), then that is the problem. If there isn't, then it's probably waiting on some timeout. Can be networking. Is there by any chance a resolver that's dropping requests in /etc/resolv.conf? (In reply to Ewoud Kohl van Wijngaarden from comment #32) > A HTTP 500 indicates Apache or Passenger or Foreman itself failed to respond > correctly. You should check /var/log/httpd and /var/log/foreman for errors. > I have also seen this happen when the Foreman application itself failed to > start up in time. Given we're seeing the same issues with the standalone > service, this would be my bet. saw few errors i.e. /var/log/httpd/error_log [ 2020-08-19 03:30:16.5683 11965/7fc7d493f700 Pool2/Implementation.cpp:849 ]: Could not spawn process for group /usr/share/foreman#default: An error occured while starting up the preloader. in 'void Passenger::ApplicationPool2::SmartSpawner::handleErrorResponse(Passenger::ApplicationPool2::SmartSpawner::StartupDetails&)' (SmartSpawner.h:455) in 'std::string Passenger::ApplicationPool2::SmartSpawner::negotiatePreloaderStartup(Passenger::ApplicationPool2::SmartSpawner::StartupDetails&)' (SmartSpawner.h:558) in 'void Passenger::ApplicationPool2::SmartSpawner::startPreloader()' (SmartSpawner.h:206) in 'virtual Passenger::ApplicationPool2::ProcessPtr Passenger::ApplicationPool2::SmartSpawner::spawn(const Passenger::ApplicationPool2::Options&)' (SmartSpawner.h:744) in 'void Passenger::ApplicationPool2::Group::spawnThreadRealMain(const SpawnerPtr&, const Passenger::ApplicationPool2::Options&, unsigned int)' (Implementation.cpp:782) [ 2020-08-19 03:30:16.5685 11965/7fc7d2a1b700 agents/HelperAgent/RequestHandler.h:1971 ]: [Client 20] Cannot checkout session. An error occured while starting up the preloader. Attached logs for reference > > If there is load (CPU, memory, disk or swapping), then that is the problem. With `--foreman-passenger true` It's noticed the slight load increased (attached logs for reference one) > If there isn't, then it's probably waiting on some timeout. Can be > networking. Is there by any chance a resolver that's dropping requests in > /etc/resolv.conf? No , there is no drop noticed due to etc/resolv.conf . question here , is that `--foreman-passenger true` a recommendation for CUs too if they face same issue ? We are going to look to address this by adding systemd notify support to our Puma service deployment. This will mean packaging the puma-systemd-plugin (https://github.com/sj26/puma-plugin-systemd) and adding some minor configuration for it. We need some upstream PRs merged into the project for it to work, and if no traction from the maintainer is achieved we will look to maintain a fork in the meantime. I dug into the Apache logs and retrieved the actual error: [ 2020-08-19 03:30:16.5685 11965/7fc7d2a1b700 agents/HelperAgent/RequestHandler.h:1971 ]: [Client 20] Cannot checkout session. An error occured while starting up the preloader. Error page: File exists @ dir_s_mkdir - /usr/share/foreman/log (Errno::EEXIST) /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:232:in `mkdir' /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:232:in `fu_mkdir' /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:210:in `block (2 levels) in mkdir_p' /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:208:in `reverse_each' /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:208:in `block in mkdir_p' /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:193:in `each' /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:193:in `mkdir_p' /usr/share/foreman/lib/foreman/logging.rb:134:in `ensure_log_directory' /usr/share/foreman/lib/foreman/logging.rb:15:in `configure' /usr/share/foreman/config/application.rb:220:in `<class:Application>' /usr/share/foreman/config/application.rb:93:in `<module:Foreman>' /usr/share/foreman/config/application.rb:92:in `<top (required)>' /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require' /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require' /usr/share/foreman/config/environment.rb:2:in `<top (required)>' /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require' /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require' config.ru:5:in `block in <main>' /opt/theforeman/tfm/root/usr/share/gems/gems/rack-2.2.2/lib/rack/builder.rb:125:in `instance_eval' /opt/theforeman/tfm/root/usr/share/gems/gems/rack-2.2.2/lib/rack/builder.rb:125:in `initialize' config.ru:1:in `new' config.ru:1:in `<main>' /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:105:in `eval' /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:105:in `preload_app' /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:150:in `<module:App>' /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:29:in `<module:PhusionPassenger>' /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:28:in `<main>' This suggests /usr/share/foreman/log is a dangling symlink. That's weird, but it's what I'd look into. Maybe there are permission issues. Moving discussion here as looks diff BZ then this one : https://bugzilla.redhat.com/show_bug.cgi?id=1875426 Verified on 6.8 Snap14 Verfifcation points: 1- # rpm -q foreman-2.1.2.7-1.el7sat.noarch foreman-2.1.2.7-1.el7sat.noarch 2- Checked satellite-installer in upgrade didn't see this problem 3- Installed foreman-discovery-image multiple number of time using foreman and didn't see this issue. # foreman-maintain packages install -y foreman-discovery-image Running preparation steps required to run the next scenarios ================================================================================ Check if tooling for package locking is installed: [OK] -------------------------------------------------------------------------------- Running install packages in unlocked session ================================================================================ Confirm installer run is allowed: WARNING: This script runs satellite-installer after the yum execution to ensure the Satellite is in a consistent state. As a result some of your services may be restarted. Do you want to proceed? (assuming yes) [OK] -------------------------------------------------------------------------------- Unlock packages: [OK] -------------------------------------------------------------------------------- Install packages: [OK] -------------------------------------------------------------------------------- Procedures::Installer::Upgrade: [OK] -------------------------------------------------------------------------------- Check status of version locking of packages: Automatic locking of package versions is enabled in installer. Packages are locked. [OK] -------------------------------------------------------------------------------- Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: Satellite 6.8 release), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4366 |