Bug 1866117

Summary:	Too many CA certs in additionalTrustBundles of install-config.yaml causes installation to fail
Product:	OpenShift Container Platform	Reporter:	David Johnston <djohnsto>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED ERRATA	QA Contact:	Micah Abbott <miabbott>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.5	CC:	adahiya, bbreard, bgilbert, donny, imcleod, jbasquil, jcallen, jcall, jima, jligon, kgarriso, miabbott, nstielau, rioliu, scuppett, skunkerk
Target Milestone:	---	Flags:	miabbott: needinfo- miabbott: needinfo- miabbott: needinfo-
Target Release:	4.6.0
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1874815 (view as bug list)		Environment:
Last Closed:	2020-10-27 16:23:15 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1874815

Description David Johnston 2020-08-04 22:22:21 UTC

Description of problem:

vSphere 6.7 UPI. install-config.yaml was including a significant amount of CA certs in the additionalTrustBundles section. During bootstrap process of installation, when the masters were trying to retrieve their configuration from the bootstrap (https://api-int.<domain>:22623/config/master), the bootstrap server was taking more than 14 seconds to process the response, sometimes even 20 seconds or more. This was causing timeouts in the masters, as they seem to have 2 timeouts of 6 and 10 seconds each. This was resolved by updating the install-config.yaml to shorten the additionalTrustBundles list to include only the CA certs required to pull images from the internal container registry. With only 2 CA certs, the API request takes less than 3 seconds to be processed and therefore the installation can process normally. 

Version-Release number of the following components:
ocp 4.5.4
openshift-install 4.5.4
rhcos-4.5.2


Steps to Reproduce:
1. Add large amount of CA certs to additionalTrustBundles in install-config.yaml and re-run installation.

Additional info:
The same install-config.yaml with the 181 CA certs worked fine with OCP 4.4.z, but not with 4.5.4.

Comment 1 Abhinav Dahiya 2020-08-04 23:13:29 UTC

The ignition is served by the machine-config-server and the ignition binary fetches it on RHCOS. we might have to allow users to configure this timeout or improve transport time by compression on server side.

Secondly I think adding 181 certs to the additional trust bundle is I think over use by the user and we should recommend users to put the certs only required for things like
- trusting the custom registry
- proxy trust etc.

And not the entire database of certificates.

Comment 2 Abhinav Dahiya 2020-08-04 23:14:16 UTC

If RHCOS thinks nothing could be done on their side, move it back to installer and we can close or limit the size of the additional trust bundle.

Comment 3 Micah Abbott 2020-08-05 14:50:03 UTC

181 CA certs during install seems...excessive.  I would echo Abhinav's comment that users should be limiting the CA certs used during the install to those that are absolutely required during the install phase.  

Users can add additional CA certs into specific containers[1] or landing CA certs onto the nodes via MachineConfigs[2] as day 2 operations.


I'm going to ask Benjamin or Sohan for input if we can (or should) do anything to assist with handling of a large amount of CA certs.  If nothing comes up, let's move this back to Installer.


[1] https://docs.openshift.com/container-platform/4.5/networking/configuring-a-custom-pki.html#certificate-injection-using-operators_configuring-a-custom-pki
[2] https://docs.openshift.com/container-platform/4.5/security/certificate-types-descriptions.html#proxy-certificates_ocp-certificates

Comment 4 Benjamin Gilbert 2020-08-05 20:33:52 UTC

Ignition provides the ignition.timeouts.http_response_headers config field, defaulting to 10 seconds.  The config could specify a larger value if desired.

The described times seem excessive, though.  It might be worth checking whether the MCS is doing any unnecessary work during CA rendering.

Comment 5 Micah Abbott 2020-08-07 16:05:20 UTC

Passing over to the MCO team to see if there anything to be done on the MCS side re: comment #4

If nothing can be done, please send back to the Installer team so they can investigate limiting the size of the additional trust bundle (comment #2)

Comment 6 David Johnston 2020-08-07 18:11:01 UTC

May we please have an RCA as to why this worked in 4.4 but now doesn’t work in 4.5? What exactly changed?

Comment 7 Kirsten Garrison 2020-08-07 20:23:49 UTC

Looking around I see several template changes for vsphere in 4.5 and am wondering if any of them are relevant to the issue at hand?

https://github.com/openshift/machine-config-operator/pull/1657/commits/fa4b14a966506b07fcc6fec2d6233bbc7a04f507
https://github.com/openshift/machine-config-operator/commit/5c11d552e3a5f87c9a57d764bd9a663b48bccbae#diff-fed36e93a0509e20f2dc96cbbd85b678

We are continuing to look at this but will pass to vsphere team to look further at the vsphere specific changes in 4.5

Comment 10 Scott Dodson 2020-08-13 17:09:31 UTC

Has the customer verified the workaround of setting ignition.timeouts.http_response_headers to a value higher than 10s?

Comment 11 Scott Dodson 2020-08-13 18:13:01 UTC

That needs to be set in the stub ignition file that's used to boot your hosts BTW.

Comment 12 Abhinav Dahiya 2020-08-17 16:47:12 UTC

> https://bugzilla.redhat.com/show_bug.cgi?id=1866117#c10

Can you verify and respond that,

1. That using smaller additional trust bundle i.e. only including the specific CAs that are required when included works for you.
2. Also if you had larger trust bundle specified and updated the stub ignition i.e master.ign and worker.ign generated by the installer for UPI workflow to set the ignition.timeouts.http_response_headers value to maybe 1 minute does that work for you.

Comment 13 Abhinav Dahiya 2020-08-17 16:48:25 UTC

(In reply to David Johnston from comment #6)
> May we please have an RCA as to why this worked in 4.4 but now doesn’t work
> in 4.5? What exactly changed?

As for RCA it seems like the MCO team moved it back to installer team without any cause of increased times so probably that is not going to happen.

Comment 14 Scott Dodson 2020-08-17 19:40:53 UTC

Reading through the case they can pretty clearly show that this is a regression in MCS configuration rendering using the same install-config.yaml input, as such I'm moving this back to MCO.

4.5.5

[core@188435v6 ~]$ time curl -k https://localhost:22623/config/master
real    0m14.225s
user    0m0.017s
sys     0m0.008s


4.4.15 result:
[core@188435v6 ~]$ time curl -k -v https://localhost:22623/config/master
real    0m0.055s
user    0m0.017s
sys     0m0.013s

Comment 15 Antonio Murdaca 2020-09-02 09:07:13 UTC

working on this.

Comment 16 Antonio Murdaca 2020-09-02 09:59:10 UTC

This seems to be a pretty bad regression into the code we use to serve the ignition via the MCS, I'll have a patch to test soon.

Comment 17 Antonio Murdaca 2020-09-02 14:15:43 UTC

(In reply to Scott Dodson from comment #14)
> Reading through the case they can pretty clearly show that this is a
> regression in MCS configuration rendering using the same install-config.yaml
> input, as such I'm moving this back to MCO.
> 
> 4.5.5
> 
> [core@188435v6 ~]$ time curl -k https://localhost:22623/config/master
> real    0m14.225s
> user    0m0.017s
> sys     0m0.008s
> 
> 
> 4.4.15 result:
> [core@188435v6 ~]$ time curl -k -v https://localhost:22623/config/master
> real    0m0.055s
> user    0m0.017s
> sys     0m0.013s

I have linked to a PR getting us to

4.5 w/o fix

real	0m2.379s
user	0m0.013s
sys	0m0.008s

4.5 with fix

real	0m0.489s
user	0m0.018s
sys	0m0.007s


The above brings us closer to 4.4, accounting for the changes we made in 4.5 to start supporting ign v2 and v3

I still think 181 cas are a lot but with the PR linked it should bring us in a better perf position.

Comment 20 Micah Abbott 2020-09-10 17:49:45 UTC

Spoke with Antonio for verification steps and we decided on comparing the time of raw `curl` access to the MCS in a 4.5 cluster (w/o the fix) and 4.6 (with the fix)

I created an MC that has 100 entries in `spec.config.storage.files` and applied to the cluster.

Using a 4.5.8 cluster:

```
$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.8     True        False         32m     Cluster version is 4.5.8

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-131-226.us-west-2.compute.internal   Ready    worker   74m   v1.18.3+6c42de8
ip-10-0-151-86.us-west-2.compute.internal    Ready    master   85m   v1.18.3+6c42de8
ip-10-0-180-42.us-west-2.compute.internal    Ready    worker   74m   v1.18.3+6c42de8
ip-10-0-180-9.us-west-2.compute.internal     Ready    master   84m   v1.18.3+6c42de8
ip-10-0-206-158.us-west-2.compute.internal   Ready    master   83m   v1.18.3+6c42de8
ip-10-0-219-209.us-west-2.compute.internal   Ready    worker   74m   v1.18.3+6c42de8

$ oc apply -f ../machineConfigs/bz1866117-out.yaml 
machineconfig.machineconfiguration.openshift.io/99-so-many-files created

$ oc debug node/ip-10-0-151-86.us-west-2.compute.internal
Starting pod/ip-10-0-151-86us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.151.86
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m2.219s
user    0m0.013s
sys     0m0.007s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m2.216s
user    0m0.014s
sys     0m0.005s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m2.212s
user    0m0.013s
sys     0m0.003s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m2.211s
user    0m0.009s
sys     0m0.007s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m2.214s
user    0m0.016s
sys     0m0.003s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m2.215s
user    0m0.017s
sys     0m0.007s
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
```

Compared to the performance on a 4.6.0-fc.4 cluster:


```
$ oc get clusterversion
NAME      VERSION      AVAILABLE   PROGRESSING   SINCE   STATUS                                                                               
version   4.6.0-fc.4   True        False         15m     Cluster version is 4.6.0-fc.4

$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-128-143.us-west-2.compute.internal   Ready    master   39m   v1.19.0-rc.2+514f31a
ip-10-0-134-45.us-west-2.compute.internal    Ready    worker   28m   v1.19.0-rc.2+514f31a
ip-10-0-189-97.us-west-2.compute.internal    Ready    worker   30m   v1.19.0-rc.2+514f31a
ip-10-0-191-113.us-west-2.compute.internal   Ready    master   39m   v1.19.0-rc.2+514f31a
ip-10-0-194-243.us-west-2.compute.internal   Ready    master   39m   v1.19.0-rc.2+514f31a
ip-10-0-211-28.us-west-2.compute.internal    Ready    worker   28m   v1.19.0-rc.2+514f31a

$ oc apply -f ../machineConfigs/bz1866117-out.yaml
machineconfig.machineconfiguration.openshift.io/99-so-many-files created

$ oc debug node/ip-10-0-128-143.us-west-2.compute.internal
Starting pod/ip-10-0-128-143us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.128.143
If you don't see a command prompt, try pressing enter.
sh-4.2# chroot /host
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.118s
user    0m0.011s
sys     0m0.005s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.163s
user    0m0.016s
sys     0m0.004s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.106s
user    0m0.010s
sys     0m0.006s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.104s
user    0m0.013s
sys     0m0.003s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.127s
user    0m0.017s
sys     0m0.002s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.152s
user    0m0.012s
sys     0m0.007s
sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker

real    0m0.118s
user    0m0.015s
sys     0m0.004s
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...
```

Marking VERIFIED with 4.6.0-fc.4

Comment 21 Donny Davis 2020-09-15 14:39:59 UTC

For those that are running into this bug (like me) in a new cluster deployment process that does not use any modified certs, I wanted to capture the workaround that worked for me. 

openshift-install create install-config --dir=workaround
openshift-install create ignition-configs --dir=workaround

vi workaround/master.ign 

edit the timeouts value, something like this should suffice

timeouts":{"httpResponseHeaders": 60},

Now also set this value in the worker.ign and bootstrap.ign files as well. 

Now you can execute a deployment that will load the ignition files and not hit this timeout bug. 

openshift-install --dir=workaround create cluster --log-level=debug

Comment 22 Donny Davis 2020-09-15 14:48:33 UTC

I would also like to add that I am using IPI on RHOSP and ran into this bug and this ^^^^ workaround gets me by till this is resolved.

Comment 24 errata-xmlrpc 2020-10-27 16:23:15 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196