Bug 1806593
Summary: | Cannot deploy from stage with OCP 4.3: error pinging docker registry registry.stage.redhat.io: Get https://registry.stage.redhat.io/v2/: x509: certificate signed by unknown authority | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Petr Balogh <pbalogh> |
Component: | Node | Assignee: | Miloslav Trmač <mitr> |
Status: | CLOSED NOTABUG | QA Contact: | Sunil Choudhary <schoudha> |
Severity: | urgent | Docs Contact: | |
Priority: | urgent | ||
Version: | 4.3.0 | CC: | aos-bugs, bparees, dkiselev, eparis, jkaur, jokerman, mitr, mpatel, obulatov, rphillips, umohnani, wking |
Target Milestone: | --- | Keywords: | Automation, AutomationBlocker |
Target Release: | 4.5.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-04-02 20:22:30 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Petr Balogh
2020-02-24 15:16:50 UTC
We have just before release of OCS 4.2.2 so I cannot provide you the clusters at the moment. Maybe tomorrow after the release I can spin some for you if you will ping me on hangout chat and we will agree on some timeframe you will have this env. This is job which passed and ran tier1 execution after the successful deployment: https://ocs4-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/qe-deploy-ocs-cluster/4828/consoleFull You can see: CSV is in installing phase without the issue: 17:00:34 - MainThread - ocs_ci.ocs.ocp - WARNING - Failed to get resource: ocs-operator.v4.2.2 of kind: csv, selector: None, Error: Error during execution of command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml. Error is Error from server (NotFound): clusterserviceversions.operators.coreos.com "ocs-operator.v4.2.2" not found 17:00:34 - MainThread - ocs_ci.ocs.ocp - WARNING - Number of attempts to get resource reached! 17:00:34 - MainThread - ocs_ci.ocs.ocp - INFO - Cannot find resource object ocs-operator.v4.2.2 17:00:34 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:39 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:39 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending! 17:00:39 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:44 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:45 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending! 17:00:45 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:50 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:50 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Pending! 17:00:50 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:00:55 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:00:56 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: InstallReady! 17:00:56 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:01:01 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml 17:01:01 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Installing! 17:01:01 - MainThread - ocs_ci.utility.utils - INFO - Going to sleep for 5 seconds before next iteration 17:01:06 - MainThread - ocs_ci.utility.utils - INFO - Executing command: oc -n openshift-storage --kubeconfig /home/jenkins/current-cluster-dir/openshift-cluster-dir/auth/kubeconfig get csv ocs-operator.v4.2.2 -n openshift-storage -o yaml And finally succeeded: 17:01:46 - MainThread - ocs_ci.ocs.ocp - INFO - Resource ocs-operator.v4.2.2 is in phase: Succeeded! 17:01:46 - MainThread - ocs_ci.utility.templating - INFO - apiVersion: ocs.openshift.io/v1 Some must gather collected in for example this failed test case: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/ Here you should have available much more must-gather data after other failed test cases: http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/ I think that this is the good prove that it's worked with 4.2 like a charm. And you have must gather data from both versions. If you are the new to the code, can you please ask someone experienced or owner of the code to clarify there? As I already wrote, I have tried to get some more details in this mail thread: http://post-office.corp.redhat.com/archives/aos-devel/2020-February/msg00337.html But no luck. Thanks, Petr Thanks, this confirms my analysis that AdditionalTrustedCAs were never implemented in the machine-config-operator: Compare (4.2) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-ffd318f6852f3939f2f8a54abd10023e.yaml and (4.3) http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/machineconfiguration.openshift.io/machineconfigs/rendered-worker-d3a505a9c3cde231b04cdd0f6541dfd9.yaml, both have Ignition data for a completely empty /etc/pki/ca-trust/source/anchors/openshift-config-user-ca-bundle.crt . --- AFAICT, the way it _did_ work on 4.2, was that the CA configuration was: - read by cluster-image-registry-operator in https://github.com/openshift/cluster-image-registry-operator/blob/6d953d3f8a25ba8137ea7f776ed6d3f102080ce0/pkg/resource/caconfig.go#L102-L119 - inserted into an openshift-image-registry/image-registry-certificates config map, as you can see in http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-image-registry/core/configmaps/image-registry-certificates.yaml - and that config map was in turn, turned into /etc/docker/certs.d/*/*.crt files via https://github.com/openshift/cluster-image-registry-operator/blob/master/bindata/nodecadaemon.yaml [1] == http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/namespaces/openshift-image-registry/apps/daemonsets/node-ca.yaml . (This should be possible to confirm by looking for the node-ca - created file on nodes). [1] What the… ultimately seems to have been added in https://github.com/openshift/cluster-image-registry-operator/pull/72 . --- And as for why it works on 4.2 and not 4.3: On 4.2, http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200220T160208/logs/failed_testcase_ocs_logs_1582218699/test_must_gather_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/imageregistry.operator.openshift.io/configs/cluster.yaml shows the ImageRegistry operator CRD to be set up to be "Managed" with backing storage, but on 4.3 http://magna002.ceph.redhat.com/ocsci-jenkins/openshift-clusters/pbalogh-stage/pbalogh-stage_20200221T110518/logs/failed_testcase_ocs_logs_1582284003/deployment_ocs_logs/ocp_must_gather/quay-io-openshift-origin-must-gather-sha256-ee4eae4c297a6f0c80de95d12266c61f7348349a3e72d909a294644e8371e3aa/cluster-scoped-resources/imageregistry.operator.openshift.io/configs/cluster.yaml shows the CRD to be “Removed”. When this CRD is “removed”, the node-ca daemonset is not set up, and the operator seems not to be maintaining the image-registry-certificates config map. --- Open questions: - Validate this hypothesis by finding the node-ca-created certificate on nodes - Explain how/why the registry deployment differs between 4.2 and 4.3. - (Actually implement AdditionalTrustedCAs, and then drop at least that part from cluster-image-registry-operator.) Hello Miloslav, Once you started to talking about image registry and managementState I remembered we had also issue when we do some verification after OCS deployment related to image registry. In 4.3 we found out that no image-registry was there on VmWare. The issue we opened in our ocs-ci repo is this: https://github.com/red-hat-storage/ocs-ci/issues/1436 See this URL in the documentation: https://docs.openshift.com/container-platform/4.3/registry/configuring-registry-operator.html#registry-removed_configuring-registry-operator I think this has to be related with what you've found. We currently have to do this patch: oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p '{"spec":{"managementState": "Managed"}}' and then the image registry is there but currently we do after deployment of OCS. So do you think that if we will do the patch just after OCP deployment that we should be OK? Or do you suggest any other solution? Thanks (In reply to Petr Balogh from comment #6) > See this URL in the documentation: > https://docs.openshift.com/container-platform/4.3/registry/configuring- > registry-operator.html#registry-removed_configuring-registry-operator > > I think this has to be related with what you've found. > We currently have to do this patch: > oc patch configs.imageregistry.operator.openshift.io/cluster --type merge -p > '{"spec":{"managementState": "Managed"}}' > > and then the image registry is there but currently we do after deployment of > OCS. > > So do you think that if we will do the patch just after OCP deployment that > we should be OK? > Or do you suggest any other solution? Ultimately the _correct fix_ is, I think, to make the AdditionalTrustedCAs independent of the cluster-image-registry-operator state. That might mean making node-ca independent of the rest of cluster-image-registry-operator, or, I think more likely, moving this functionality into the machine-config-operator. Cc: Ben and Urvashi, who have been involved in the API / current implementation, for their opinion (please see also comment#5 ). That’s definitely going to take at least a few days to get done. --- As far as “why this worked in 4.2”, I think that’s now conclusively attributed to the node-ca DaemonSet. As for an immediate workaround, if you can enable the image registry _before_ relying on stage-ca.crt registry, I think that should work right now. (But note that in the 4.2 must-gather, the registry uses registry-cephfs-rwx-pvc , so if the relevant Ceph infrastructure is hosted on the stage registry, that might not work, not sure). I think Oleg recently fixed this or has a bug related to it, to make the node-ca daemon independent of the registry. The only reason the node-ca daemon is tied to the registry operator today is because the registry needed to drive the CAs that the node trusted (so the node would trust the internal registry), but there's no technical reason to keep the two tied together. I haven't fix it, but we track this bug with the node ca daemon at https://bugzilla.redhat.com/show_bug.cgi?id=1807471. OK, that would work. (Petr, note the workaround in https://bugzilla.redhat.com/show_bug.cgi?id=1807471#c3 ). Is the image-registry-operator the right place to maintain AdditionalTrustedCAs, in principle? Wouldn’t users who don’t want/need an integrated registry be tempted to just disable that operator, and all its operands, entirely? I’m perfectly happy to have this bug solved in a different place, but then again Proxy.Spec.TrustedCA and a cloudConfig CA are maintained by the MCO. How should we, and the users, think about the separation of duties? Closing with work around. *** Bug 1846625 has been marked as a duplicate of this bug. *** |