Bug 1806593

Summary:	Cannot deploy from stage with OCP 4.3: error pinging docker registry registry.stage.redhat.io: Get https://registry.stage.redhat.io/v2/: x509: certificate signed by unknown authority
Product:	OpenShift Container Platform	Reporter:	Petr Balogh <pbalogh>
Component:	Node	Assignee:	Miloslav Trmač <mitr>
Status:	CLOSED NOTABUG	QA Contact:	Sunil Choudhary <schoudha>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.3.0	CC:	aos-bugs, bparees, dkiselev, eparis, jkaur, jokerman, mitr, mpatel, obulatov, rphillips, umohnani, wking
Target Milestone:	---	Keywords:	Automation, AutomationBlocker
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-04-02 20:22:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Petr Balogh 2020-02-24 15:16:50 UTC

Description of problem:

I have discussion here:
http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00337.html

As was suggested I opened have opened ticket here:
https://projects.engineering.redhat.com/browse/REGISTRY-97

But it doesn't look like issue with registry but something changed in OCP 4.2 and 4.3. In order to get proper information what has changed I am opening this BZ.


Version-Release number of selected component (if applicable):
v4.3.1 GA


As mentioned in mail thread I didn't face any issue with OCP 4.2

Issue from mail thread:
I've tried to run our automation we created for OCS testing from the stage but I see now this ERROR on OCP 4.3:
Get https://registry.stage.redhat.io/v2/: x509: certificate signed by unknown authority
 
Here is the end of describe of one of pod failing to come up:
 
Events:
  Type     Reason     Age                    From                Message
  ----     ------     ----                   ----                -------
  Normal   Scheduled  <unknown>              default-scheduler   Successfully assigned openshift-storage/ocs-operator-7d97b7fd8d-mdt6k to compute-2
  Normal   Pulling    67m (x4 over 68m)      kubelet, compute-2  Pulling image "registry.stage.redhat.io/ocs4/ocs-rhel8-operator sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be"
  Warning  Failed     67m (x4 over 68m)      kubelet, compute-2  Failed to pull image "registry.stage.redhat.io/ocs4/ocs-rhel8-operator sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be": rpc error: code = Unknown desc = error pinging docker registry registry.stage.redhat.io: Get https://registry.stage.redhat.io/v2/: x509: certificate signed by unknown authority
  Warning  Failed     67m (x4 over 68m)      kubelet, compute-2  Error: ErrImagePull
  Warning  Failed     58m (x43 over 68m)     kubelet, compute-2  Error: ImagePullBackOff
  Normal   BackOff    3m33s (x284 over 68m)  kubelet, compute-2  Back-off pulling image "registry.stage.redhat.io/ocs4/ocs-rhel8-operator sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be"
 
We adding the certificate in this function:
https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/deployment/deployment.py#L107
 
This is the certificate we use which worked well:
https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/templates/ocp-deployment/stage-ca.crt
 
Attaching stage-registry-config-configmap.yaml file (part of it below)  where I see ApiVersions v1 and in the error I see registry.stage.redhat.io/v2/ , so this V2 seems odd to me and maybe can be related. Can someone with more experience shed some light on it please?
 
 

--- apiVersion: v1
items:
- apiVersion: v1
  data:
    registry.stage.redhat.io: |
      -----BEGIN CERTIFICATE-----
OMITTED 
      -----END CERTIFICATE-----
  kind: ConfigMap
  metadata:
    creationTimestamp: "2020-02-19T16:55:01Z"
    name: stage-registry-config
    namespace: openshift-config
    resourceVersion: "18929"
    selfLink: /api/v1/namespaces/openshift-config/configmaps/stage-registry-config
    uid: 88057a0b-dcfb-44d4-851d-5c02a927d7dd
- apiVersion: v1
  data:
    registry.stage.redhat.io: |
      -----BEGIN CERTIFICATE-----
OMITTED ...
 
 
 
I've just tried to deploy with OCP 4.2 with the same approach and it passed so looks like really something changed in OCP 4.3.
Jenkins job with OCP 4.2:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/
 
Can someone please help me to identify the issue?


How reproducible:


Steps to Reproduce:
1. Deploy OCP 4.3 on VmWare in internal network
2. Try to deploy OCS 4.2 from stage following description from config.
3. We have the most information available and tracked in our infra ticket: https://projects.engineering.redhat.com/browse/OCSQE-121?filter=34238

Actual results:
We are not able to deploy from stage with OCP 4.3

Expected results:
Be able to deploy from stage without cert issue.


Additional info:
4.2 job passed here:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/

This is another reproduce of BZ:

Must gather logs from second reproduce:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/
oc get pod -n openshift-storage

noobaa-operator-76595d5d95-7rrlj 0/1 ImagePullBackOff 0 126m
ocs-operator-7d97b7fd8d-vsmwl 0/1 ImagePullBackOff 0 126m
rook-ceph-operator-d7b89f965-r6cll 0/1 ImagePullBackOff 0 126m

Comment 4 Petr Balogh 2020-02-26 09:23:17 UTC

We have just before release of OCS 4.2.2 so I cannot provide you the clusters at the moment. Maybe tomorrow after the release I can spin some for you if you will ping me on hangout chat and we will agree on some timeframe you will have this env.

This is job which passed and ran tier1 execution after the successful deployment:
https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/consoleFull

You can see:
CSV is in installing phase without the issue:
17:00:34 - MainThread - ocs_ci.ocs.ocp - WARNING - Failed to get resource: ocs-operator.v4.2.2 of kind: csv, selector: None, Error: Error during execution of command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml.
Error is Error from server (NotFound): clusterserviceversions.operators.coreos.com "ocs-operator.v4.2.2" not found

17:00:34 - MainThread - ocs_ci.ocs.ocp - WARNING - Number of attempts to get resource reached!
17:00:34 - MainThread - ocs_ci.ocs.ocp - INFO - Cannot find resource object ocs-operator.v4.2.2
17:00:34 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration
17:00:39 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml
17:00:39 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending!
17:00:39 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration
17:00:44 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml
17:00:45 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending!
17:00:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration
17:00:50 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml
17:00:50 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending!
17:00:50 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration
17:00:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml
17:00:56 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: InstallReady!
17:00:56 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration
17:01:01 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml
17:01:01 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Installing!
17:01:01 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration
17:01:06 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml

And finally succeeded:
17:01:46 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Succeeded!
17:01:46 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ocs.openshift.io/v1


Some must gather collected in for example this failed test case:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/

Here you should have available much more must-gather data after other failed test cases:
http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/

I think that this is the good prove that it's worked with 4.2 like a charm. And you have must gather data from both versions. If you are the new to the code, can you please ask someone experienced or owner of the code to clarify there?


As I already wrote, I have tried to get some more details in this mail thread:
http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00337.html

But no luck.

Thanks, Petr

Comment 5 Miloslav Trmač 2020-02-26 21:37:52 UTC

Thanks, this confirms my analysis that AdditionalTrustedCAs were never implemented in the machine-config-operator: Compare (4.2) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-ffd318f6852f3939f2f8a54abd10023e.yaml and (4.3) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-d3a505a9c3cde231b04cdd0f6541dfd9.yaml, both have Ignition data for a completely empty /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt .

---

AFAICT, the way it _did_ work on 4.2, was that the CA configuration was:
- read by cluster-image-registry-operator in https://github.com/openshift/cluster-image-registry-operator/blob/6d953d3f8a25ba8137ea7f776ed6d3f102080ce0/pkg/resource/caconfig.go#L102-L119
- inserted into an openshift-image-registry/image-registry-certificates config map, as you can see in http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-image-registry/core/configmaps/image-registry-certificates.yaml
- and that config map was in turn, turned into /etc/docker/certs.d/*/*.crt files via https://github.com/openshift/cluster-image-registry-operator/blob/master/bindata/nodecadaemon.yaml [1] == http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-image-registry/apps/daemonsets/node-ca.yaml .  (This should be possible to confirm by looking for the node-ca - created file on nodes).

[1] What the… ultimately seems to have been added in https://github.com/openshift/cluster-image-registry-operator/pull/72 .

---

And as for why it works on 4.2 and not 4.3: On 4.2, http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/imageregistry.operator.openshift.io/configs/cluster.yaml shows the ImageRegistry operator CRD to be set up to be "Managed" with backing storage, but on 4.3 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/imageregistry.operator.openshift.io/configs/cluster.yaml shows the CRD to be “Removed”.

When this CRD is “removed”, the node-ca daemonset is not set up, and the operator seems not to be maintaining the image-registry-certificates config map.

---

Open questions:
- Validate this hypothesis by finding the node-ca-created certificate on nodes
- Explain how/why the registry deployment differs between 4.2 and 4.3.
- (Actually implement AdditionalTrustedCAs, and then drop at least that part from cluster-image-registry-operator.)

Comment 6 Petr Balogh 2020-03-02 12:03:07 UTC

Hello Miloslav,


Once you started to talking about image registry and managementState I remembered we had also issue when we do some verification after OCS deployment related to image registry.

In 4.3 we found out that no image-registry was there on VmWare. The issue we opened in our ocs-ci repo is this:
https://github.com/red-hat-storage/ocs-ci/issues/1436

See this URL in the documentation:
https://docs.openshift.com/container-platform/4.3/registry/configuring-registry-operator.html#registry-removed_configuring-registry-operator

I think this has to be related with what you've found.
We currently have to do this patch:
oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState": "Managed"}}'

and then the image registry is there but currently we do after deployment of OCS.

So do you think that if we will do the patch just after OCP deployment that we should be OK?
Or do you suggest any other solution?

Thanks

Comment 8 Miloslav Trmač 2020-03-03 19:35:08 UTC

(In reply to Petr Balogh from comment #6)
> See this URL in the documentation:
> https://docs.openshift.com/container-platform/4.3/registry/configuring-
> registry-operator.html#registry-removed_configuring-registry-operator
> 
> I think this has to be related with what you've found.
> We currently have to do this patch:
> oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p
> '{"spec":{"managementState": "Managed"}}'
> 
> and then the image registry is there but currently we do after deployment of
> OCS.
> 
> So do you think that if we will do the patch just after OCP deployment that
> we should be OK?
> Or do you suggest any other solution?

Ultimately the _correct fix_ is, I think, to make the AdditionalTrustedCAs independent of the cluster-image-registry-operator state. That might mean making node-ca independent of the rest of cluster-image-registry-operator, or, I think more likely, moving this functionality into the machine-config-operator.

Cc: Ben and Urvashi, who have been involved in the API / current implementation, for their opinion (please see also comment#5 ).  That’s definitely going to take at least a few days to get done.


---

As far as “why this worked in 4.2”, I think that’s now conclusively attributed to the node-ca DaemonSet.

As for an immediate workaround, if you can enable the image registry _before_ relying on stage-ca.crt registry, I think that should work right now. (But note that in the 4.2 must-gather, the registry uses registry-cephfs-rwx-pvc , so if the relevant Ceph infrastructure is hosted on the stage registry, that might not work, not sure).

Comment 9 Ben Parees 2020-03-03 20:16:55 UTC

I think Oleg recently fixed this or has a bug related to it, to make the node-ca daemon independent of the registry.

The only reason the node-ca daemon is tied to the registry operator today is because the registry needed to drive the CAs that the node trusted (so the node would trust the internal registry), but there's no technical reason to keep the two tied together.

Comment 10 Oleg Bulatov 2020-03-05 17:07:09 UTC

I haven't fix it, but we track this bug with the node ca daemon at https://bugzilla.redhat.com/show_bug.cgi?id=1807471.

Comment 11 Miloslav Trmač 2020-03-05 18:09:21 UTC

OK, that would work. (Petr, note the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1807471#c3 ).


Is the image-registry-operator the right place to maintain AdditionalTrustedCAs, in principle? Wouldn’t users who don’t want/need an integrated registry be tempted to just disable that operator, and all its operands, entirely? I’m perfectly happy to have this bug solved in a different place, but then again Proxy.Spec.TrustedCA and a cloudConfig CA are maintained by the MCO. How should we, and the users, think about the separation of duties?

Comment 12 Ryan Phillips 2020-04-02 20:22:30 UTC

Closing with work around.

Comment 13 Danila Kiselev 2020-06-19 10:44:45 UTC

*** Bug 1846625 has been marked as a duplicate of this bug. ***