Description of problem: I have discussion here: http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00337.html As was suggested I opened have opened ticket here: https://projects.engineering.redhat.com/browse/REGISTRY-97 But it doesn't look like issue with registry but something changed in OCP 4.2 and 4.3. In order to get proper information what has changed I am opening this BZ. Version-Release number of selected component (if applicable): v4.3.1 GA As mentioned in mail thread I didn't face any issue with OCP 4.2 Issue from mail thread: I've tried to run our automation we created for OCS testing from the stage but I see now this ERROR on OCP 4.3: Get https://registry.stage.redhat.io/v2/: x509: certificate signed by unknown authority Here is the end of describe of one of pod failing to come up: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Scheduled <unknown> default-scheduler Successfully assigned openshift-storage/ocs-operator-7d97b7fd8d-mdt6k to compute-2 Normal Pulling 67m (x4 over 68m) kubelet, compute-2 Pulling image "registry.stage.redhat.io/ocs4/ocs-rhel8-operator sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be" Warning Failed 67m (x4 over 68m) kubelet, compute-2 Failed to pull image "registry.stage.redhat.io/ocs4/ocs-rhel8-operator sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be": rpc error: code = Unknown desc = error pinging docker registry registry.stage.redhat.io: Get https://registry.stage.redhat.io/v2/: x509: certificate signed by unknown authority Warning Failed 67m (x4 over 68m) kubelet, compute-2 Error: ErrImagePull Warning Failed 58m (x43 over 68m) kubelet, compute-2 Error: ImagePullBackOff Normal BackOff 3m33s (x284 over 68m) kubelet, compute-2 Back-off pulling image "registry.stage.redhat.io/ocs4/ocs-rhel8-operator sha256:54202499a2a7e87f5fa6afa28fe621fdb53b6e076c8417eda21c60816004c1be" We adding the certificate in this function: https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/deployment/deployment.py#L107 This is the certificate we use which worked well: https://github.com/red-hat-storage/ocs-ci/blob/master/ocs_ci/templates/ocp-deployment/stage-ca.crt Attaching stage-registry-config-configmap.yaml file (part of it below) where I see ApiVersions v1 and in the error I see registry.stage.redhat.io/v2/ , so this V2 seems odd to me and maybe can be related. Can someone with more experience shed some light on it please? --- apiVersion: v1 items: - apiVersion: v1 data: registry.stage.redhat.io: | -----BEGIN CERTIFICATE----- OMITTED -----END CERTIFICATE----- kind: ConfigMap metadata: creationTimestamp: "2020-02-19T16:55:01Z" name: stage-registry-config namespace: openshift-config resourceVersion: "18929" selfLink: /api/v1/namespaces/openshift-config/configmaps/stage-registry-config uid: 88057a0b-dcfb-44d4-851d-5c02a927d7dd - apiVersion: v1 data: registry.stage.redhat.io: | -----BEGIN CERTIFICATE----- OMITTED ... I've just tried to deploy with OCP 4.2 with the same approach and it passed so looks like really something changed in OCP 4.3. Jenkins job with OCP 4.2: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/ Can someone please help me to identify the issue? How reproducible: Steps to Reproduce: 1. Deploy OCP 4.3 on VmWare in internal network 2. Try to deploy OCS 4.2 from stage following description from config. 3. We have the most information available and tracked in our infra ticket: https://projects.engineering.redhat.com/browse/OCSQE-121?filter=34238 Actual results: We are not able to deploy from stage with OCP 4.3 Expected results: Be able to deploy from stage without cert issue. Additional info: 4.2 job passed here: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/ This is another reproduce of BZ: Must gather logs from second reproduce: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ oc get pod -n openshift-storage noobaa-operator-76595d5d95-7rrlj 0/1 ImagePullBackOff 0 126m ocs-operator-7d97b7fd8d-vsmwl 0/1 ImagePullBackOff 0 126m rook-ceph-operator-d7b89f965-r6cll 0/1 ImagePullBackOff 0 126m
We have just before release of OCS 4.2.2 so I cannot provide you the clusters at the moment. Maybe tomorrow after the release I can spin some for you if you will ping me on hangout chat and we will agree on some timeframe you will have this env. This is job which passed and ran tier1 execution after the successful deployment: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/consoleFull You can see: CSV is in installing phase without the issue: 17:00:34 - MainThread - ocs_ci.ocs.ocp - WARNING - Failed to get resource: ocs-operator.v4.2.2 of kind: csv, selector: None, Error: Error during execution of command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml. Error is Error from server (NotFound): clusterserviceversions.operators.coreos.com "ocs-operator.v4.2.2" not found 17:00:34 - MainThread - ocs_ci.ocs.ocp - WARNING - Number of attempts to get resource reached! 17:00:34 - MainThread - ocs_ci.ocs.ocp - INFO - Cannot find resource object ocs-operator.v4.2.2 17:00:34 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:39 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:39 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending! 17:00:39 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:44 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:45 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending! 17:00:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:50 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:50 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending! 17:00:50 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:56 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: InstallReady! 17:00:56 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:01:01 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:01:01 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Installing! 17:01:01 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:01:06 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml And finally succeeded: 17:01:46 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Succeeded! 17:01:46 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ocs.openshift.io/v1 Some must gather collected in for example this failed test case: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/ Here you should have available much more must-gather data after other failed test cases: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/ I think that this is the good prove that it's worked with 4.2 like a charm. And you have must gather data from both versions. If you are the new to the code, can you please ask someone experienced or owner of the code to clarify there? As I already wrote, I have tried to get some more details in this mail thread: http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00337.html But no luck. Thanks, Petr
Thanks, this confirms my analysis that AdditionalTrustedCAs were never implemented in the machine-config-operator: Compare (4.2) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-ffd318f6852f3939f2f8a54abd10023e.yaml and (4.3) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-d3a505a9c3cde231b04cdd0f6541dfd9.yaml, both have Ignition data for a completely empty /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt . --- AFAICT, the way it _did_ work on 4.2, was that the CA configuration was: - read by cluster-image-registry-operator in https://github.com/openshift/cluster-image-registry-operator/blob/6d953d3f8a25ba8137ea7f776ed6d3f102080ce0/pkg/resource/caconfig.go#L102-L119 - inserted into an openshift-image-registry/image-registry-certificates config map, as you can see in http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-image-registry/core/configmaps/image-registry-certificates.yaml - and that config map was in turn, turned into /etc/docker/certs.d/*/*.crt files via https://github.com/openshift/cluster-image-registry-operator/blob/master/bindata/nodecadaemon.yaml [1] == http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-image-registry/apps/daemonsets/node-ca.yaml . (This should be possible to confirm by looking for the node-ca - created file on nodes). [1] What the… ultimately seems to have been added in https://github.com/openshift/cluster-image-registry-operator/pull/72 . --- And as for why it works on 4.2 and not 4.3: On 4.2, http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/imageregistry.operator.openshift.io/configs/cluster.yaml shows the ImageRegistry operator CRD to be set up to be "Managed" with backing storage, but on 4.3 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/imageregistry.operator.openshift.io/configs/cluster.yaml shows the CRD to be “Removed”. When this CRD is “removed”, the node-ca daemonset is not set up, and the operator seems not to be maintaining the image-registry-certificates config map. --- Open questions: - Validate this hypothesis by finding the node-ca-created certificate on nodes - Explain how/why the registry deployment differs between 4.2 and 4.3. - (Actually implement AdditionalTrustedCAs, and then drop at least that part from cluster-image-registry-operator.)
Hello Miloslav, Once you started to talking about image registry and managementState I remembered we had also issue when we do some verification after OCS deployment related to image registry. In 4.3 we found out that no image-registry was there on VmWare. The issue we opened in our ocs-ci repo is this: https://github.com/red-hat-storage/ocs-ci/issues/1436 See this URL in the documentation: https://docs.openshift.com/container-platform/4.3/registry/configuring-registry-operator.html#registry-removed_configuring-registry-operator I think this has to be related with what you've found. We currently have to do this patch: oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState": "Managed"}}' and then the image registry is there but currently we do after deployment of OCS. So do you think that if we will do the patch just after OCP deployment that we should be OK? Or do you suggest any other solution? Thanks
(In reply to Petr Balogh from comment #6) > See this URL in the documentation: > https://docs.openshift.com/container-platform/4.3/registry/configuring- > registry-operator.html#registry-removed_configuring-registry-operator > > I think this has to be related with what you've found. > We currently have to do this patch: > oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p > '{"spec":{"managementState": "Managed"}}' > > and then the image registry is there but currently we do after deployment of > OCS. > > So do you think that if we will do the patch just after OCP deployment that > we should be OK? > Or do you suggest any other solution? Ultimately the _correct fix_ is, I think, to make the AdditionalTrustedCAs independent of the cluster-image-registry-operator state. That might mean making node-ca independent of the rest of cluster-image-registry-operator, or, I think more likely, moving this functionality into the machine-config-operator. Cc: Ben and Urvashi, who have been involved in the API / current implementation, for their opinion (please see also comment#5 ). That’s definitely going to take at least a few days to get done. --- As far as “why this worked in 4.2”, I think that’s now conclusively attributed to the node-ca DaemonSet. As for an immediate workaround, if you can enable the image registry _before_ relying on stage-ca.crt registry, I think that should work right now. (But note that in the 4.2 must-gather, the registry uses registry-cephfs-rwx-pvc , so if the relevant Ceph infrastructure is hosted on the stage registry, that might not work, not sure).
I think Oleg recently fixed this or has a bug related to it, to make the node-ca daemon independent of the registry. The only reason the node-ca daemon is tied to the registry operator today is because the registry needed to drive the CAs that the node trusted (so the node would trust the internal registry), but there's no technical reason to keep the two tied together.
I haven't fix it, but we track this bug with the node ca daemon at https://bugzilla.redhat.com/show_bug.cgi?id=1807471.
OK, that would work. (Petr, note the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1807471#c3 ). Is the image-registry-operator the right place to maintain AdditionalTrustedCAs, in principle? Wouldn’t users who don’t want/need an integrated registry be tempted to just disable that operator, and all its operands, entirely? I’m perfectly happy to have this bug solved in a different place, but then again Proxy.Spec.TrustedCA and a cloudConfig CA are maintained by the MCO. How should we, and the users, think about the separation of duties?
Closing with work around.
*** Bug 1846625 has been marked as a duplicate of this bug. ***