Bug 1866117
Summary: | Too many CA certs in additionalTrustBundles of install-config.yaml causes installation to fail | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | David Johnston <djohnsto> | |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | |
Status: | CLOSED ERRATA | QA Contact: | Micah Abbott <miabbott> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | 4.5 | CC: | adahiya, bbreard, bgilbert, donny, imcleod, jbasquil, jcallen, jcall, jima, jligon, kgarriso, miabbott, nstielau, rioliu, scuppett, skunkerk | |
Target Milestone: | --- | Flags: | miabbott:
needinfo-
miabbott: needinfo- miabbott: needinfo- |
|
Target Release: | 4.6.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1874815 (view as bug list) | Environment: | ||
Last Closed: | 2020-10-27 16:23:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1874815 |
Description
David Johnston
2020-08-04 22:22:21 UTC
The ignition is served by the machine-config-server and the ignition binary fetches it on RHCOS. we might have to allow users to configure this timeout or improve transport time by compression on server side. Secondly I think adding 181 certs to the additional trust bundle is I think over use by the user and we should recommend users to put the certs only required for things like - trusting the custom registry - proxy trust etc. And not the entire database of certificates. If RHCOS thinks nothing could be done on their side, move it back to installer and we can close or limit the size of the additional trust bundle. 181 CA certs during install seems...excessive. I would echo Abhinav's comment that users should be limiting the CA certs used during the install to those that are absolutely required during the install phase. Users can add additional CA certs into specific containers[1] or landing CA certs onto the nodes via MachineConfigs[2] as day 2 operations. I'm going to ask Benjamin or Sohan for input if we can (or should) do anything to assist with handling of a large amount of CA certs. If nothing comes up, let's move this back to Installer. [1] https://docs.openshift.com/container-platform/4.5/networking/configuring-a-custom-pki.html#certificate-injection-using-operators_configuring-a-custom-pki [2] https://docs.openshift.com/container-platform/4.5/security/certificate-types-descriptions.html#proxy-certificates_ocp-certificates Ignition provides the ignition.timeouts.http_response_headers config field, defaulting to 10 seconds. The config could specify a larger value if desired. The described times seem excessive, though. It might be worth checking whether the MCS is doing any unnecessary work during CA rendering. Passing over to the MCO team to see if there anything to be done on the MCS side re: comment #4 If nothing can be done, please send back to the Installer team so they can investigate limiting the size of the additional trust bundle (comment #2) May we please have an RCA as to why this worked in 4.4 but now doesn’t work in 4.5? What exactly changed? Looking around I see several template changes for vsphere in 4.5 and am wondering if any of them are relevant to the issue at hand? https://github.com/openshift/machine-config-operator/pull/1657/commits/fa4b14a966506b07fcc6fec2d6233bbc7a04f507 https://github.com/openshift/machine-config-operator/commit/5c11d552e3a5f87c9a57d764bd9a663b48bccbae#diff-fed36e93a0509e20f2dc96cbbd85b678 We are continuing to look at this but will pass to vsphere team to look further at the vsphere specific changes in 4.5 Has the customer verified the workaround of setting ignition.timeouts.http_response_headers to a value higher than 10s? That needs to be set in the stub ignition file that's used to boot your hosts BTW. > https://bugzilla.redhat.com/show_bug.cgi?id=1866117#c10
Can you verify and respond that,
1. That using smaller additional trust bundle i.e. only including the specific CAs that are required when included works for you.
2. Also if you had larger trust bundle specified and updated the stub ignition i.e master.ign and worker.ign generated by the installer for UPI workflow to set the ignition.timeouts.http_response_headers value to maybe 1 minute does that work for you.
(In reply to David Johnston from comment #6) > May we please have an RCA as to why this worked in 4.4 but now doesn’t work > in 4.5? What exactly changed? As for RCA it seems like the MCO team moved it back to installer team without any cause of increased times so probably that is not going to happen. Reading through the case they can pretty clearly show that this is a regression in MCS configuration rendering using the same install-config.yaml input, as such I'm moving this back to MCO. 4.5.5 [core@188435v6 ~]$ time curl -k https://localhost:22623/config/master real 0m14.225s user 0m0.017s sys 0m0.008s 4.4.15 result: [core@188435v6 ~]$ time curl -k -v https://localhost:22623/config/master real 0m0.055s user 0m0.017s sys 0m0.013s working on this. This seems to be a pretty bad regression into the code we use to serve the ignition via the MCS, I'll have a patch to test soon. (In reply to Scott Dodson from comment #14) > Reading through the case they can pretty clearly show that this is a > regression in MCS configuration rendering using the same install-config.yaml > input, as such I'm moving this back to MCO. > > 4.5.5 > > [core@188435v6 ~]$ time curl -k https://localhost:22623/config/master > real 0m14.225s > user 0m0.017s > sys 0m0.008s > > > 4.4.15 result: > [core@188435v6 ~]$ time curl -k -v https://localhost:22623/config/master > real 0m0.055s > user 0m0.017s > sys 0m0.013s I have linked to a PR getting us to 4.5 w/o fix real 0m2.379s user 0m0.013s sys 0m0.008s 4.5 with fix real 0m0.489s user 0m0.018s sys 0m0.007s The above brings us closer to 4.4, accounting for the changes we made in 4.5 to start supporting ign v2 and v3 I still think 181 cas are a lot but with the PR linked it should bring us in a better perf position. Spoke with Antonio for verification steps and we decided on comparing the time of raw `curl` access to the MCS in a 4.5 cluster (w/o the fix) and 4.6 (with the fix) I created an MC that has 100 entries in `spec.config.storage.files` and applied to the cluster. Using a 4.5.8 cluster: ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.8 True False 32m Cluster version is 4.5.8 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-131-226.us-west-2.compute.internal Ready worker 74m v1.18.3+6c42de8 ip-10-0-151-86.us-west-2.compute.internal Ready master 85m v1.18.3+6c42de8 ip-10-0-180-42.us-west-2.compute.internal Ready worker 74m v1.18.3+6c42de8 ip-10-0-180-9.us-west-2.compute.internal Ready master 84m v1.18.3+6c42de8 ip-10-0-206-158.us-west-2.compute.internal Ready master 83m v1.18.3+6c42de8 ip-10-0-219-209.us-west-2.compute.internal Ready worker 74m v1.18.3+6c42de8 $ oc apply -f ../machineConfigs/bz1866117-out.yaml machineconfig.machineconfiguration.openshift.io/99-so-many-files created $ oc debug node/ip-10-0-151-86.us-west-2.compute.internal Starting pod/ip-10-0-151-86us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.151.86 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m2.219s user 0m0.013s sys 0m0.007s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m2.216s user 0m0.014s sys 0m0.005s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m2.212s user 0m0.013s sys 0m0.003s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m2.211s user 0m0.009s sys 0m0.007s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m2.214s user 0m0.016s sys 0m0.003s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m2.215s user 0m0.017s sys 0m0.007s sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ``` Compared to the performance on a 4.6.0-fc.4 cluster: ``` $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-fc.4 True False 15m Cluster version is 4.6.0-fc.4 $ oc get nodes NAME STATUS ROLES AGE VERSION ip-10-0-128-143.us-west-2.compute.internal Ready master 39m v1.19.0-rc.2+514f31a ip-10-0-134-45.us-west-2.compute.internal Ready worker 28m v1.19.0-rc.2+514f31a ip-10-0-189-97.us-west-2.compute.internal Ready worker 30m v1.19.0-rc.2+514f31a ip-10-0-191-113.us-west-2.compute.internal Ready master 39m v1.19.0-rc.2+514f31a ip-10-0-194-243.us-west-2.compute.internal Ready master 39m v1.19.0-rc.2+514f31a ip-10-0-211-28.us-west-2.compute.internal Ready worker 28m v1.19.0-rc.2+514f31a $ oc apply -f ../machineConfigs/bz1866117-out.yaml machineconfig.machineconfiguration.openshift.io/99-so-many-files created $ oc debug node/ip-10-0-128-143.us-west-2.compute.internal Starting pod/ip-10-0-128-143us-west-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.128.143 If you don't see a command prompt, try pressing enter. sh-4.2# chroot /host sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.118s user 0m0.011s sys 0m0.005s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.163s user 0m0.016s sys 0m0.004s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.106s user 0m0.010s sys 0m0.006s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.104s user 0m0.013s sys 0m0.003s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.127s user 0m0.017s sys 0m0.002s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.152s user 0m0.012s sys 0m0.007s sh-4.4# time curl -s -k -o /dev/null https://localhost:22623/config/worker real 0m0.118s user 0m0.015s sys 0m0.004s sh-4.4# exit exit sh-4.2# exit exit Removing debug pod ... ``` Marking VERIFIED with 4.6.0-fc.4 For those that are running into this bug (like me) in a new cluster deployment process that does not use any modified certs, I wanted to capture the workaround that worked for me. openshift-install create install-config --dir=workaround openshift-install create ignition-configs --dir=workaround vi workaround/master.ign edit the timeouts value, something like this should suffice timeouts":{"httpResponseHeaders": 60}, Now also set this value in the worker.ign and bootstrap.ign files as well. Now you can execute a deployment that will load the ignition files and not hit this timeout bug. openshift-install --dir=workaround create cluster --log-level=debug I would also like to add that I am using IPI on RHOSP and ran into this bug and this ^^^^ workaround gets me by till this is resolved. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |