Bug 1850934

Summary:	Satellite-installer failed with error "Could not evaluate: Proxy xyz..com cannot be retrieved: unknown error (response 502)"
Product:	Red Hat Satellite	Reporter:	Devendra Singh <desingh>
Component:	Installation	Assignee:	Eric Helms <ehelms>
Status:	CLOSED ERRATA	QA Contact:	Devendra Singh <desingh>
Severity:	high	Docs Contact:
Priority:	high
Version:	6.8.0	CC:	ahumbe, ehelms, ekohlvan, mawerner, mmccune, swadeley, vijsingh, vsedmik, zhunting
Target Milestone:	6.8.0	Keywords:	AutomationBlocker, Triaged, Upgrades
Target Release:	Unused
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	foreman-2.1.2.2-1	Doc Type:	Known Issue
Doc Text:	Satellite-installer failed with error "Could not evaluate: Proxy xyz..com cannot be retrieved: unknown error (response 502)"	Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 13:03:32 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devendra Singh 2020-06-25 08:30:49 UTC

Description of problem: Satellite-installer failed with error "Could not evaluate: Proxy xyz.com cannot be retrieved:  unknown error (response 502)"


Version-Release number of selected component (if applicable):
6.8

How reproducible:
always

Steps to Reproduce:
1. Prepare the environment to perform the satellite-upgrade from 6.7 to 6.8 
2. Configure all the repository(Enabled the required repository)
3. Execute the pre-check and it has completed successfully.
4. Apply the workaround as mentioned on https://bugzilla.redhat.com/show_bug.cgi?id=1846593#c5
5. Trigger the upgrade job using below command

foreman-maintain upgrade run --whitelist="disk-performance, repositories-validate, repositories-setup" --target-version 6.8 -y
....
....
/Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[xyz.com]: Could not evaluate: Proxy xyz.com cannot be retrieved: unknown error (response 502)


Actual results:
satellite-installer failed with error.

Expected results:
satellite-installer job should complete successfully.


Additional info:

Comment 15 Mike McCune 2020-07-21 21:20:12 UTC

This appears to be some sort of timing issue:

1) I can reproduce the issue on qe-sat6-upgrade-rhel7 but not with 100% reliability:

[root@qe-sat6-upgrade-rhel7 ~]# foreman-maintain packages install -y foreman-discovery-image
...
--------------------------------------------------------------------------------
Procedures::Installer::Upgrade:                                       [OK]
--------------------------------------------------------------------------------
Check status of version locking of packages: 
  Automatic locking of package versions is enabled in installer.
  Packages are locked.                                                [OK]
--------------------------------------------------------------------------------

**** Worked fine, try again:

[root@qe-sat6-upgrade-rhel7 ~]# foreman-maintain packages install -y foreman-discovery-image
...
 --------------------------------------------------------------------------------
 Install packages:                                                     [OK]
 --------------------------------------------------------------------------------
 Procedures::Installer::Upgrade:                                       [FAIL]
  /Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[qe-sat6-upgrade-rhel7.satqe.lab.eng.rdu2.redhat.com]: Could not evaluate: Proxy qe-sat6-upgrade-rhel7.satqe.lab.eng.rdu2.redhat.com cannot be retrieved: unknown error (response 502)



2) I tried this repeatedly on other VMs and baremetal and it worked 100% of the time, no errors.

Is there a host, other than  qe-sat6-upgrade-rhel7 where we can reproduce this error?

Comment 16 Mike McCune 2020-07-21 22:19:34 UTC

Updated title of the BZ to reflect that the error now only occurs using foreman-maintain to install packages. Please clarify if the bug is more broad than that.

Comment 17 Ewoud Kohl van Wijngaarden 2020-07-22 08:59:53 UTC

It is more broad that that and the previous summary was fine. We've also seen it in upstream CI where we don't use foreman-maintain at all. The problem is that Apache returns a HTTP 502 when foreman-installer attempts to register the Foreman Proxy. It does that when it does a HTTP call to the REST API in order to get the installed proxy.

Comment 18 Mike McCune 2020-07-22 16:32:43 UTC

Ewoud, do we have a plan of attack for this issue? How frequent is it? 

The workaround is just 'run the installer again' right?

Comment 19 Ewoud Kohl van Wijngaarden 2020-07-23 14:40:35 UTC

We don't see it anymore in upstream CI so it disappeared from our radar. It has all the signs of a timing issue which is probably a missing dependency.

Comment 24 Ewoud Kohl van Wijngaarden 2020-07-28 10:23:27 UTC

Today I managed to reproduce it on my own system with a very underpowered VM. Some observations:

From the installer log:

[ WARN 2020-07-28T09:15:18 main]  /Service[foreman]: Triggered 'refresh' from 1 event
[DEBUG 2020-07-28T09:15:18 main]  /Service[foreman]: The container Class[Foreman::Service] will propagate my refresh event
[ INFO 2020-07-28T09:15:18 main]  /Service[foreman]: Evaluated in 1.22 seconds
--- SNIP ---
[DEBUG 2020-07-28T09:15:51 main]  Foreman_smartproxy[centos7-katello-nightly.wisse.example.com](provider=rest_v3): Making get request to https://centos7-katello-nightly.wisse.example.com/api/v2/smart_proxies?search=name=%22centos7-katello-nightly.wisse.example.com%22
[DEBUG 2020-07-28T09:16:52 main]  Foreman_smartproxy[centos7-katello-nightly.wisse.example.com](provider=rest_v3): Received response 502 from request to https://centos7-katello-nightly.wisse.example.com/api/v2/smart_proxies?search=name=%22centos7-katello-nightly.wisse.example.com%22
[ERROR 2020-07-28T09:16:52 main]  /Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[centos7-katello-nightly.wisse.example.com]: Could not evaluate: Proxy centos7-katello-nightly.wisse.example.com cannot be retrieved: unknown error (response 502)

This corresponds to the Apache log:

[Tue Jul 28 09:16:51.984032 2020] [proxy_http:error] [pid 15486] (70007)The timeout specified has expired: [client 192.168.122.182:59632] AH01102: error reading status line from remote server 127.0.0.1:3000

The interesting part is the service restart. It happens in 1 second. However, looking at the systemd journal it takes much longer:

Jul 28 09:15:18 centos7-katello-nightly.wisse.example.com systemd[1]: Stopped Foreman.
Jul 28 09:15:18 centos7-katello-nightly.wisse.example.com systemd[1]: Started Foreman.
Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: /usr/share/foreman/lib/foreman.rb:8: warning: already initialized constant Foreman::UUID_REGEXP
Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: /usr/share/foreman/lib/foreman.rb:8: warning: previous definition of UUID_REGEXP was here
Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: => Booting Puma
Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: => Rails 6.0.3.1 application starting in production
Jul 28 09:15:24 centos7-katello-nightly.wisse.example.com foreman[7070]: => Run `rails server --help` for more startup options
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] Puma starting in cluster mode...
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Version 4.3.3 (ruby 2.5.5-p157), codename: Mysterious Traveller
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Min threads: 0, max threads: 16
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Environment: production
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Process workers: 2
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Preloading application
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] * Activated tcp://127.0.0.1:3000
Jul 28 09:16:48 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] Use Ctrl-C to stop
Jul 28 09:16:49 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] - Worker 1 (pid: 7273) booted, phase: 0
Jul 28 09:16:49 centos7-katello-nightly.wisse.example.com foreman[7070]: [7070] - Worker 0 (pid: 7268) booted, phase: 0

It actually came online just before the timeout of 1 minute was reached. Most likely the individual workers were still booting and not ready to respond.

The restart time of 1 second is surprising because you would think that it takes longer. This is because the service is using the simple type in systemd. It starts the process and returns:

# systemctl cat foreman.service | grep Type
Type=simple

I looked into using Type=notify (which would make systemctl (re)start wait for the service to really be online - the installer would then guarantee it's online) in https://projects.theforeman.org/issues/29356 but the library needs some work to be compatible with current versions of Puma.

This leads me to the conclusion that at least as far as I can reproduce it, this is again related to slow hardware. We've seen this more often in bugreports, also when using Passenger.

It is possible to increase the timeout by setting it via hiera. Either a global default for all proxy config:

apache::mod::proxy:proxy_timeout: 120

Or specifically for the Foreman vhost:

foreman::config::apache::proxy_params:
  retry: 0

The retry: 0 is what we set by default. Options are documented @ https://httpd.apache.org/docs/current/mod/mod_proxy.html#proxypass. I suspect it's either connectiontimeout or timeout (or both) that need to be raised.

Comment 26 Ewoud Kohl van Wijngaarden 2020-07-28 17:44:59 UTC

I was asked to be more concrete in the suggestion. Based on my previous, I'd suggest running the following command (as root) prior to installing/upgrading as a workaround:

echo "apache::mod::proxy::proxy_timeout: 120" >> /etc/foreman-installer/custom-hiera.yaml

I have no basis for 120 other than it's twice as high as the observed default of 60. That this makes a difference is only based on reasoning and reading documentation, didn't actually test this myself.

After running it, you should observe it sets ProxyTimeout in /etc/httpd/conf.modules.d/proxy.conf.

Comment 27 Devendra Singh 2020-07-30 15:27:28 UTC

(In reply to Ewoud Kohl van Wijngaarden from comment #26)
> I was asked to be more concrete in the suggestion. Based on my previous, I'd
> suggest running the following command (as root) prior to
> installing/upgrading as a workaround:
> 
> echo "apache::mod::proxy::proxy_timeout: 120" >>
> /etc/foreman-installer/custom-hiera.yaml
> 
> I have no basis for 120 other than it's twice as high as the observed
> default of 60. That this makes a difference is only based on reasoning and
> reading documentation, didn't actually test this myself.
> 
> After running it, you should observe it sets ProxyTimeout in
> /etc/httpd/conf.modules.d/proxy.conf.

This workaround works, I have tried after applying the workaround and didn't see any problem.

Comment 28 vijsingh 2020-08-18 06:58:57 UTC

Below are observations that have been noticed around Azure Provisioning tests using provided workaround ...


 6.8.0:

  1) after set the Proxy_Timeout value to 120 (/etc/httpd/conf.modules.d/proxy.conf) the tests(UIs/APIs/CLIs) are still failing with errors i.e "502 Server Error: Proxy Error or TimeOutError or Error: 502 Bad Gateway" 
  2) And after set Proxy_Timeout value to 450+ all tests are running fine 


 however In 6.7:

  1) The Proxy_Timeout value is 60 (default)(/etc/httpd/conf.modules.d/proxy.conf) and UIs/CLIs/APIs tests are getting completed successfully.


so overall it looks that Proxy_Timeout value requires to set 3/4 times higher than 6.7.z .

Comment 29 Ewoud Kohl van Wijngaarden 2020-08-18 08:19:09 UTC

The difference is that previously the proxy timeout was mostly irrelevant. mod_passenger, as used in 6.7, runs inside the Apache process while Puma, as used in 6.8 runs as a separate process and Apache proxies to it.

It is very concerning that a value higher than 120 is needed. That is *very* slow. That a higher timeout does work indicates a high load rather than actual failures. It would be good to know how heavily loaded the machine is. Can you collect some numbers on that?

With 6.8 you can revert back to passenger using `--foreman-passenger true`. That's another thing you can try to verify. With that I would also use the default proxy timeout.

Comment 30 vijsingh 2020-08-18 14:11:21 UTC

(In reply to Ewoud Kohl van Wijngaarden from comment #29)
> The difference is that previously the proxy timeout was mostly irrelevant.
> mod_passenger, as used in 6.7, runs inside the Apache process while Puma, as
> used in 6.8 runs as a separate process and Apache proxies to it.
> 
> It is very concerning that a value higher than 120 is needed. That is *very*
> slow. That a higher timeout does work indicates a high load rather than
> actual failures. It would be good to know how heavily loaded the machine is.
> Can you collect some numbers on that?
> 

I checked on freshly installed Satellite however did not notice any specific load there.

> With 6.8 you can revert back to passenger using `--foreman-passenger true`.
> That's another thing you can try to verify. With that I would also use the
> default proxy timeout.

I tired to use that but failed with error during `--foreman-passenger true` ..

````
[ERROR 2020-08-18T09:23:32 main]  /Stage[main]/Foreman_proxy::Register/Foreman_smartproxy[dhcp-2-17.vms.sat]: Could not evaluate: Proxy dhcp-2-17.vms.sat cannot be retrieved: unknown error (response 500)
````

Comment 32 Ewoud Kohl van Wijngaarden 2020-08-18 16:20:06 UTC

A HTTP 500 indicates Apache or Passenger or Foreman itself failed to respond correctly. You should check /var/log/httpd and /var/log/foreman for errors. I have also seen this happen when the Foreman application itself failed to start up in time. Given we're seeing the same issues with the standalone service, this would be my bet.

If there is load (CPU, memory, disk or swapping), then that is the problem. If there isn't, then it's probably waiting on some timeout. Can be networking. Is there by any chance a resolver that's dropping requests in /etc/resolv.conf?

Comment 33 vijsingh 2020-08-19 08:31:46 UTC

(In reply to Ewoud Kohl van Wijngaarden from comment #32)
> A HTTP 500 indicates Apache or Passenger or Foreman itself failed to respond
> correctly. You should check /var/log/httpd and /var/log/foreman for errors.
> I have also seen this happen when the Foreman application itself failed to
> start up in time. Given we're seeing the same issues with the standalone
> service, this would be my bet.

saw few errors i.e. /var/log/httpd/error_log

[ 2020-08-19 03:30:16.5683 11965/7fc7d493f700 Pool2/Implementation.cpp:849 ]: Could not spawn process for group /usr/share/foreman#default: An error occured while starting up the preloader.
     in 'void Passenger::ApplicationPool2::SmartSpawner::handleErrorResponse(Passenger::ApplicationPool2::SmartSpawner::StartupDetails&)' (SmartSpawner.h:455)
     in 'std::string Passenger::ApplicationPool2::SmartSpawner::negotiatePreloaderStartup(Passenger::ApplicationPool2::SmartSpawner::StartupDetails&)' (SmartSpawner.h:558)
     in 'void Passenger::ApplicationPool2::SmartSpawner::startPreloader()' (SmartSpawner.h:206)
     in 'virtual Passenger::ApplicationPool2::ProcessPtr Passenger::ApplicationPool2::SmartSpawner::spawn(const Passenger::ApplicationPool2::Options&)' (SmartSpawner.h:744)
     in 'void Passenger::ApplicationPool2::Group::spawnThreadRealMain(const SpawnerPtr&, const Passenger::ApplicationPool2::Options&, unsigned int)' (Implementation.cpp:782)

[ 2020-08-19 03:30:16.5685 11965/7fc7d2a1b700 agents/HelperAgent/RequestHandler.h:1971 ]: [Client 20] Cannot checkout session. An error occured while starting up the preloader.


Attached logs for reference

> 
> If there is load (CPU, memory, disk or swapping), then that is the problem.

With `--foreman-passenger true` It's noticed the slight load increased (attached logs for reference one)
 

> If there isn't, then it's probably waiting on some timeout. Can be
> networking. Is there by any chance a resolver that's dropping requests in
> /etc/resolv.conf?

No , there is no drop noticed due to etc/resolv.conf .



question here , is that `--foreman-passenger true` a recommendation for CUs too if they face same issue ?

Comment 37 Eric Helms 2020-08-20 17:17:41 UTC

We are going to look to address this by adding systemd notify support to our Puma service deployment. This will mean packaging the puma-systemd-plugin (https://github.com/sj26/puma-plugin-systemd) and adding some minor configuration for it. We need some upstream PRs merged into the project for it to work, and if no traction from the maintainer is achieved we will look to maintain a fork in the meantime.

Comment 42 Ewoud Kohl van Wijngaarden 2020-09-04 09:38:52 UTC

I dug into the Apache logs and retrieved the actual error:

[ 2020-08-19 03:30:16.5685 11965/7fc7d2a1b700 agents/HelperAgent/RequestHandler.h:1971 ]: [Client 20] Cannot checkout session. An error occured while starting up the preloader.
Error page:
File exists @ dir_s_mkdir - /usr/share/foreman/log (Errno::EEXIST)
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:232:in `mkdir'
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:232:in `fu_mkdir'
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:210:in `block (2 levels) in mkdir_p'
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:208:in `reverse_each'
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:208:in `block in mkdir_p'
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:193:in `each'
  /opt/rh/rh-ruby25/root/usr/share/ruby/fileutils.rb:193:in `mkdir_p'
  /usr/share/foreman/lib/foreman/logging.rb:134:in `ensure_log_directory'
  /usr/share/foreman/lib/foreman/logging.rb:15:in `configure'
  /usr/share/foreman/config/application.rb:220:in `<class:Application>'
  /usr/share/foreman/config/application.rb:93:in `<module:Foreman>'
  /usr/share/foreman/config/application.rb:92:in `<top (required)>'
  /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  /usr/share/foreman/config/environment.rb:2:in `<top (required)>'
  /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  /opt/rh/rh-ruby25/root/usr/share/rubygems/rubygems/core_ext/kernel_require.rb:59:in `require'
  config.ru:5:in `block in <main>'
  /opt/theforeman/tfm/root/usr/share/gems/gems/rack-2.2.2/lib/rack/builder.rb:125:in `instance_eval'
  /opt/theforeman/tfm/root/usr/share/gems/gems/rack-2.2.2/lib/rack/builder.rb:125:in `initialize'
  config.ru:1:in `new'
  config.ru:1:in `<main>'
  /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:105:in `eval'
  /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:105:in `preload_app'
  /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:150:in `<module:App>'
  /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:29:in `<module:PhusionPassenger>'
  /usr/share/gems/gems/passenger-4.0.18/helper-scripts/rack-preloader.rb:28:in `<main>'

This suggests /usr/share/foreman/log is a dangling symlink. That's weird, but it's what I'd look into. Maybe there are permission issues.

Comment 43 vijsingh 2020-09-04 11:19:27 UTC

Moving discussion here as looks diff BZ then this one : https://bugzilla.redhat.com/show_bug.cgi?id=1875426

Comment 44 Devendra Singh 2020-09-07 10:30:59 UTC

Verified on 6.8 Snap14

Verfifcation points:

1- # rpm -q foreman-2.1.2.7-1.el7sat.noarch
foreman-2.1.2.7-1.el7sat.noarch

2- Checked satellite-installer in upgrade didn't see this problem

3- Installed foreman-discovery-image multiple number of time using foreman and didn't see this issue.

# foreman-maintain packages install -y foreman-discovery-image
Running preparation steps required to run the next scenarios
================================================================================
Check if tooling for package locking is installed:                    [OK]
--------------------------------------------------------------------------------


Running install packages in unlocked session
================================================================================
Confirm installer run is allowed: 

WARNING: This script runs satellite-installer after the yum execution 
to ensure the Satellite is in a consistent state.
As a result some of your services may be restarted. 

Do you want to proceed? (assuming yes)
                                                                      [OK]      
--------------------------------------------------------------------------------
Unlock packages:                                                      [OK]
--------------------------------------------------------------------------------
Install packages:                                                     [OK]
--------------------------------------------------------------------------------
Procedures::Installer::Upgrade:                                       [OK]
--------------------------------------------------------------------------------
Check status of version locking of packages: 
  Automatic locking of package versions is enabled in installer.
  Packages are locked.                                                [OK]
--------------------------------------------------------------------------------

Comment 47 errata-xmlrpc 2020-10-27 13:03:32 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: Satellite 6.8 release), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:4366