1811760 – [ovirt] Some cluster operators fail to come up because RHV CA is not trusted by a pod

Bug 1811760 - [ovirt] Some cluster operators fail to come up because RHV CA is not trusted by a pod

Summary: [ovirt] Some cluster operators fail to come up because RHV CA is not trusted ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Roy Golan
QA Contact:	Jan Zmeskal
Docs Contact:
URL:
Whiteboard:
Depends On:	1794313
Blocks:
TreeView+	depends on / blocked

Reported:	2020-03-09 17:16 UTC by Roy Golan
Modified:	2020-05-13 22:01 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1794313
Environment:
Last Closed:	2020-05-13 22:00:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cloud-credential-operator pull 165	None	closed	Bug 1811760: Some cluster operators fail to come up because RHV CA is not trusted by a pod	2020-10-21 13:01:56 UTC
Github	openshift cluster-api-provider-ovirt pull 42	None	closed	[release-4.4] Bug 1811760: Some cluster operators fail to come up because RHV CA is not trusted by a pod	2020-10-21 13:01:56 UTC
Github	openshift installer pull 3267	None	closed	Bug 1811760: Some cluster operators fail to come up because RHV CA is not trusted by a pod	2020-10-21 13:02:10 UTC
Red Hat Product Errata	RHBA-2020:0581	None	None	None	2020-05-13 22:01:00 UTC

Description Roy Golan 2020-03-09 17:16:30 UTC

+++ This bug was initially created as a clone of Bug #1794313 +++

Description of problem:
If you don't specify "ovirt_insecure: true" in your oVirt credentials config, the installation will eventually fail apparently because machine-controller container (and maybe some others as well) does not trust engine's CA. Even though the CA is trusted by the bastion's operating system.

Version-Release number of the following components:
./openshift-install version
./openshift-install unreleased-master-2320-g6791d02a6fadedd44f9263fb72f9f65dbd51bfe0-dirty
built from commit 6791d02a6fadedd44f9263fb72f9f65dbd51bfe0
release image registry.svc.ci.openshift.org/ovirt/ovirt-release@sha256:c46483c4bfd9418226d3bbf46e15b7905dfefcccfe899b652db3a8c88b522b96

How reproducible:
I tried it only once but I believe this behaviour is consistent.

Steps to Reproduce:
1. Make sure your bastion machine (the one from where you conduct the installation) trusts your engine's CA. If your engine is your bastion, then it's easy as running this:
ln -sf /etc/pki/ovirt-engine/ca.pem /etc/pki/ca-trust/source/anchors/ && update-ca-trust

2. Now just follow the installation steps with one specific. When you're setting up your ovirt credentials file, completely omit the line that says "ovirt_insecure: true". It should default to false. Mine looks like this:
cat ~/.ovirt/ovirt-config.yaml 
ovirt_url: https://<engine_fqdn>/ovirt-engine/api
ovirt_username: admin@internal
ovirt_password: <pass>

3. Try to install OCP4 and monitor the progress.

Actual results:
The installation got pretty far and most of the cluster operators came up, not all though: http://pastebin.test.redhat.com/828699
Also workers nodes were not created. 

Expected results:
The installation is finished successfully. 

Additional info:
openshift-install output: http://pastebin.test.redhat.com/828698 
Logs from authentication: http://pastebin.test.redhat.com/828702 
Logs from console: http://pastebin.test.redhat.com/828704           
Logs from ingress: http://pastebin.test.redhat.com
Logs from monitoring: http://pastebin.test.redhat.com/828707/828706
oc get pods -n openshift-machine-api: http://pastebin.test.redhat.com/828772
cluster-autoscaler-operator: http://pastebin.test.redhat.com/828766 
machine-api-operator: http://pastebin.test.redhat.com/828770
And most importantly here's the error message about untrusted CA:
machine-api-controllers: http://pastebin.test.redhat.com/828779 http://pastebin.test.redhat.com/828768

Comment 3 Jan Zmeskal 2020-03-12 13:34:50 UTC

I have tried verifying this with openshift-install-linux-4.4.0-0.nightly-2020-03-12-052849, but some operators still did not come up:

oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                                                                 Unknown     Unknown       True       64m
cloud-credential                           4.4.0-0.nightly-2020-03-12-052849   True        False         False      69m
cluster-autoscaler                         4.4.0-0.nightly-2020-03-12-052849   True        False         False      53m
console                                    4.4.0-0.nightly-2020-03-12-052849   Unknown     True          False      55m
dns                                        4.4.0-0.nightly-2020-03-12-052849   True        False         False      61m
etcd                                       4.4.0-0.nightly-2020-03-12-052849   True        False         False      60m
image-registry                             4.4.0-0.nightly-2020-03-12-052849   True        False         False      53m
ingress                                    unknown                             False       True          True       54m
insights                                   4.4.0-0.nightly-2020-03-12-052849   True        False         False      54m
kube-apiserver                             4.4.0-0.nightly-2020-03-12-052849   True        False         False      60m
kube-controller-manager                    4.4.0-0.nightly-2020-03-12-052849   True        False         False      60m
kube-scheduler                             4.4.0-0.nightly-2020-03-12-052849   True        False         False      61m
kube-storage-version-migrator              4.4.0-0.nightly-2020-03-12-052849   False       False         False      64m
machine-api                                4.4.0-0.nightly-2020-03-12-052849   True        False         False      61m
machine-config                             4.4.0-0.nightly-2020-03-12-052849   True        False         False      61m
marketplace                                4.4.0-0.nightly-2020-03-12-052849   True        False         False      54m
monitoring                                                                     False       True          True       48m
network                                    4.4.0-0.nightly-2020-03-12-052849   True        False         False      65m
node-tuning                                4.4.0-0.nightly-2020-03-12-052849   True        False         False      64m
openshift-apiserver                        4.4.0-0.nightly-2020-03-12-052849   True        False         False      56m
openshift-controller-manager               4.4.0-0.nightly-2020-03-12-052849   True        False         False      54m
openshift-samples                          4.4.0-0.nightly-2020-03-12-052849   True        False         False      52m
operator-lifecycle-manager                 4.4.0-0.nightly-2020-03-12-052849   True        False         False      62m
operator-lifecycle-manager-catalog         4.4.0-0.nightly-2020-03-12-052849   True        False         False      62m
operator-lifecycle-manager-packageserver   4.4.0-0.nightly-2020-03-12-052849   True        False         False      60m
service-ca                                 4.4.0-0.nightly-2020-03-12-052849   True        False         False      64m
service-catalog-apiserver                  4.4.0-0.nightly-2020-03-12-052849   True        False         False      64m
service-catalog-controller-manager         4.4.0-0.nightly-2020-03-12-052849   True        False         False      64m
storage                                    4.4.0-0.nightly-2020-03-12-052849   True        False         False      54m

The only (maybe?) relevant log message I could find is this:
oc logs pod/ingress-operator-74578cc864-rxtjz -c ingress-operator | grep ERROR | grep certificate_controller
2020-03-12T12:22:09.137Z	ERROR	operator.init.controller-runtime.controller	controller/controller.go:218	Reconciler error	{"controller": "certificate_controller", "request": "openshift-ingress-operator/default", "error": "failed to lookup wildcard cert: secrets \"router-certs-default\" not found", "errorCauses": [{"error": "failed to lookup wildcard cert: secrets \"router-certs-default\" not found"}]}
2020-03-12T12:22:43.309Z	ERROR	operator.init.controller-runtime.controller	controller/controller.go:218	Reconciler error	{"controller": "certificate_controller", "request": "openshift-ingress-operator/default", "error": "failed to publish router CA: failed to ensure \"default-ingress-cert\" in \"openshift-config-managed\" was published: Post https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps: read tcp 10.129.0.21:59996->172.30.0.1:443: read: connection reset by peer", "errorCauses": [{"error": "failed to publish router CA: failed to ensure \"default-ingress-cert\" in \"openshift-config-managed\" was published: Post https://172.30.0.1:443/api/v1/namespaces/openshift-config-managed/configmaps: read tcp 10.129.0.21:59996->172.30.0.1:443: read: connection reset by peer"}]}

However, I tried deploying cluster with the same installer in the same environment just one hour before this attempt and all went smoothly. The only difference was that for the second attempt I specified ovirt_insecure: false in ~/.ovirt/ovirt-config.yaml.

Once I figure out a place where to store must-gather logs, I'll post them here.

Comment 5 Jan Zmeskal 2020-03-12 14:31:28 UTC

Just one additional information. After the failed attempt, I have once again tried to deploy OCP using the same installer into the same env, just with ovirt_insecure: true. The deployment was once again successful.

Comment 8 Jan Zmeskal 2020-03-25 09:25:45 UTC

Verified with:
openshift-install-linux-4.4.0-0.nightly-2020-03-22-130538
rhvm-4.3.9.0-0.1.el7.noarch

Verification steps:
1. Have a oVirt credentials like this:
ovirt_url: https://<engine_fqdn>/ovirt-engine/api
ovirt_username: admin@internal
ovirt_password: "<engine_password>"
ovirt_ca_bundle: |-
  <content of /etc/pki/ovirt-engine>
2. Run openshift-install create cluster

Comment 10 errata-xmlrpc 2020-05-13 22:00:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.