+++ This bug was initially created as a clone of Bug #1794313 +++ Description of problem: If you don't specify "ovirt_insecure: true" in your oVirt credentials config, the installation will eventually fail apparently because machine-controller container (and maybe some others as well) does not trust engine's CA. Even though the CA is trusted by the bastion's operating system. Version-Release number of the following components: ./openshift-install version ./openshift-install unreleased-master-2320-g6791d02a6fadedd44f9263fb72f9f65dbd51bfe0-dirty built from commit 6791d02a6fadedd44f9263fb72f9f65dbd51bfe0 release image registry.svc.ci.openshift.org/ovirt/ovirt-release@sha256:c46483c4bfd9418226d3bbf46e15b7905dfefcccfe899b652db3a8c88b522b96 How reproducible: I tried it only once but I believe this behaviour is consistent. Steps to Reproduce: 1. Make sure your bastion machine (the one from where you conduct the installation) trusts your engine's CA. If your engine is your bastion, then it's easy as running this: ln -sf /etc/pki/ovirt-engine/ca.pem /etc/pki/ca-trust/source/anchors/ && update-ca-trust 2. Now just follow the installation steps with one specific. When you're setting up your ovirt credentials file, completely omit the line that says "ovirt_insecure: true". It should default to false. Mine looks like this: cat ~/.ovirt/ovirt-config.yaml ovirt_url: https://<engine_fqdn>/ovirt-engine/api ovirt_username: admin@internal ovirt_password: <pass> 3. Try to install OCP4 and monitor the progress. Actual results: The installation got pretty far and most of the cluster operators came up, not all though: http://pastebin.test.redhat.com/828699 Also workers nodes were not created. Expected results: The installation is finished successfully. Additional info: openshift-install output: http://pastebin.test.redhat.com/828698 Logs from authentication: http://pastebin.test.redhat.com/828702 Logs from console: http://pastebin.test.redhat.com/828704 Logs from ingress: http://pastebin.test.redhat.com Logs from monitoring: http://pastebin.test.redhat.com/828707/828706 oc get pods -n openshift-machine-api: http://pastebin.test.redhat.com/828772 cluster-autoscaler-operator: http://pastebin.test.redhat.com/828766 machine-api-operator: http://pastebin.test.redhat.com/828770 And most importantly here's the error message about untrusted CA: machine-api-controllers: http://pastebin.test.redhat.com/828779 http://pastebin.test.redhat.com/828768
I have tried verifying this with openshift-install-linux-4.4.0-0.nightly-2020-03-12-052849, but some operators still did not come up: oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication Unknown Unknown True 64m cloud-credential 4.4.0-0.nightly-2020-03-12-052849 True False False 69m cluster-autoscaler 4.4.0-0.nightly-2020-03-12-052849 True False False 53m console 4.4.0-0.nightly-2020-03-12-052849 Unknown True False 55m dns 4.4.0-0.nightly-2020-03-12-052849 True False False 61m etcd 4.4.0-0.nightly-2020-03-12-052849 True False False 60m image-registry 4.4.0-0.nightly-2020-03-12-052849 True False False 53m ingress unknown False True True 54m insights 4.4.0-0.nightly-2020-03-12-052849 True False False 54m kube-apiserver 4.4.0-0.nightly-2020-03-12-052849 True False False 60m kube-controller-manager 4.4.0-0.nightly-2020-03-12-052849 True False False 60m kube-scheduler 4.4.0-0.nightly-2020-03-12-052849 True False False 61m kube-storage-version-migrator 4.4.0-0.nightly-2020-03-12-052849 False False False 64m machine-api 4.4.0-0.nightly-2020-03-12-052849 True False False 61m machine-config 4.4.0-0.nightly-2020-03-12-052849 True False False 61m marketplace 4.4.0-0.nightly-2020-03-12-052849 True False False 54m monitoring False True True 48m network 4.4.0-0.nightly-2020-03-12-052849 True False False 65m node-tuning 4.4.0-0.nightly-2020-03-12-052849 True False False 64m openshift-apiserver 4.4.0-0.nightly-2020-03-12-052849 True False False 56m openshift-controller-manager 4.4.0-0.nightly-2020-03-12-052849 True False False 54m openshift-samples 4.4.0-0.nightly-2020-03-12-052849 True False False 52m operator-lifecycle-manager 4.4.0-0.nightly-2020-03-12-052849 True False False 62m operator-lifecycle-manager-catalog 4.4.0-0.nightly-2020-03-12-052849 True False False 62m operator-lifecycle-manager-packageserver 4.4.0-0.nightly-2020-03-12-052849 True False False 60m service-ca 4.4.0-0.nightly-2020-03-12-052849 True False False 64m service-catalog-apiserver 4.4.0-0.nightly-2020-03-12-052849 True False False 64m service-catalog-controller-manager 4.4.0-0.nightly-2020-03-12-052849 True False False 64m storage 4.4.0-0.nightly-2020-03-12-052849 True False False 54m The only (maybe?) relevant log message I could find is this: oc logs pod/ingress-operator-74578cc864-rxtjz -c ingress-operator | grep ERROR | grep certificate_controller 2020-03-12T12:22:09.137Z ERROR operator.init.controller-runtime.controller controller/controller.go:218 Reconciler error {"controller": "certificate_controller", "request": "openshift-ingress-operator/default", "error": "failed to lookup wildcard cert: secrets \"router-certs-default\" not found", "errorCauses": [{"error": "failed to lookup wildcard cert: secrets \"router-certs-default\" not found"}]} 2020-03-12T12:22:43.309Z ERROR operator.init.controller-runtime.controller controller/controller.go:218 Reconciler error {"controller": "certificate_controller", "request": "openshift-ingress-operator/default", "error": "failed to publish router CA: failed to ensure \"default-ingress-cert\" in \"openshift-config-managed\" was published: Post https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps: read tcp 10.129.0.21:59996->172.30.0.1:443: read: connection reset by peer", "errorCauses": [{"error": "failed to publish router CA: failed to ensure \"default-ingress-cert\" in \"openshift-config-managed\" was published: Post https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps: read tcp 10.129.0.21:59996->172.30.0.1:443: read: connection reset by peer"}]} However, I tried deploying cluster with the same installer in the same environment just one hour before this attempt and all went smoothly. The only difference was that for the second attempt I specified ovirt_insecure: false in ~/.ovirt/ovirt-config.yaml. Once I figure out a place where to store must-gather logs, I'll post them here.
Just one additional information. After the failed attempt, I have once again tried to deploy OCP using the same installer into the same env, just with ovirt_insecure: true. The deployment was once again successful.
Verified with: openshift-install-linux-4.4.0-0.nightly-2020-03-22-130538 rhvm-4.3.9.0-0.1.el7.noarch Verification steps: 1. Have a oVirt credentials like this: ovirt_url: https://<engine_fqdn>/ovirt-engine/api ovirt_username: admin@internal ovirt_password: "<engine_password>" ovirt_ca_bundle: |- <content of /etc/pki/ovirt-engine> 2. Run openshift-install create cluster
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581