Hide Forgot
Description of problem: After OCP 4 installation on AWS, I checked pods and found image-registry pods have issues. The error messages are "Error: secrets "image-registry-private-configuration" not found" and there is not the secret in the project. How can I create the secret? Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Create a cluster with installer 2. 3. Actual results: image-registry pod fail to run. Expected results: image-registry pod should run without errors. Additional info:
The secret should be created automatically, please provide the registry operator pod logs and the registry config resource yaml.
Hi Ben, I just recreated the cluster again to make sure this issue. It turned out I can reproduce the issue all the time. I will attach operator pod logs but what is registry config resource yaml file? -Jooho
Created attachment 1525927 [details] image registry operator log
you can retrieve the registry config via: $ oc get configs.imageregistry.operator.openshift.io/instance -o yaml
Thanks. Here is the data. ~~~ apiVersion: imageregistry.operator.openshift.io/v1 kind: Config metadata: creationTimestamp: 2019-02-01T18:49:33Z finalizers: - imageregistry.operator.openshift.io/finalizer generation: 1 name: instance resourceVersion: "18351" selfLink: /apis/imageregistry.operator.openshift.io/v1/configs/instance uid: 1d2d64a3-2652-11e9-b5fa-0207f390ebee spec: httpSecret: 61ad07f5cdb8105fc626806f9bfb0172702534ef81f7137c02cc3804431364fd41b2d6a1357768b99392784e337eecd61a29bbe61ede42e1e20721c53caf4482 logging: 2 managementState: Managed proxy: {} replicas: 1 requests: read: {} write: {} storage: s3: {} status: conditions: - lastTransitionTime: 2019-02-01T18:49:33Z message: Deployment does not have available replicas status: "False" type: Available - lastTransitionTime: 2019-02-01T18:49:33Z message: 'Unable to apply resources: unable to sync secrets: timed out waiting for the condition' status: "True" type: Progressing - lastTransitionTime: 2019-02-01T18:49:33Z status: "False" type: Failing - lastTransitionTime: 2019-02-01T18:49:33Z status: "False" type: Removed generations: null internalRegistryHostname: "" observedGeneration: 1 readyReplicas: 0 storage: {} storageManaged: false version: "" ~~~
This cluster seems significantly old, we've fixed several issues related to this since v4.0.0-0.148.0.0-dirty Please install a newer cluster version (looks like .153 is the latest).
I use openshift installer 0.11.0 that is the latest. Should I change the cluster version manually? If so, please let me know how to change it. Thanks, Jooho Lee
> I use openshift installer 0.11.0 that is the latest. > Should I change the cluster version manually? > If so, please let me know how to change it. That would be a question for the install team, maybe Trevor can answer it. On origin it's simply a matter of using the master openshift-installer branch. Not sure what the OCP process is.
Look like the OpenShift try to updating the operatator but it failed with this error: ``` Unable to apply resources: unable to apply objects: failed to create object *v1.Image, Name=cluster: images.config.openshift.io "cluster" is invalid: metadata.resourceVersion: Invalid value: 0x0: must be specified for an update ``` is there a way to change it manually? -Jooho
No, that is an issue in the operator itself which is fixed in newer releases. The only resolution is to move to a newer openshift release.
@Trevor, I tried to use the latest version after building the tool on my end. this is using 4.0.0-24-g29c4cc2-dirty but still have the issue.. Moreover, it has more issues . 1. Cluster Setting show following messages: Could not retrieve updates. Unable to retrieve available updates: Get http://localhost:8080/graph: dial tcp [::1]:8080: connect: connection refused 2.machine-config-operator failed with following errors. FailingFailed when progressing towards 3.11.0-543-g6c3e3e6a-dirty because: error syncing: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready. status: (total: 3, updated: 0, unavailable: 1) -Jooho
> I use openshift installer 0.11.0 that is the latest. That sould have installed 4.0.0-0.2. And we just cut installer v0.12.0 pinning update payload 4.0.0-0.3. How were you getting v4.0.0-0.148.0.0-dirty? Anyhow, try 0.12.0, which monitors CVO progress and should make debugging operator issues easier.
I think the 0.148.0 is the operator version. How/why our operator version information is different from the payload version i'm not sure (I need to learn more about our release process...).
Ah, v4.0.0-0.148.0.0-dirty is the operator version. Installer v0.12.0's pinned update payload 4.0.0-0.3 looks like it pins registry operator v4.0.0-0.150.0.0-dirty [1]. In comment 6, Ben places the fix between 148 and 153, so I'm not sure if it has the fix or not. Ben, was the fix in [2]? [1]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-4.0/3674/artifacts/release-e2e-aws/clusteroperators.json [2]: https://github.com/openshift/cluster-image-registry-operator/pull/170
> How/why our operator version information is different from the payload version i'm not sure (I need to learn more about our release process...). I dunno where operator versions come from. Clayton picks update-payload versions when he pushes to quay.io. The installer team picks installer versions when we push intaller tags to GitHub.
The operator versions come from something the ART team does when they tag the commits in dist-git during the release build process. (The operator code itself uses the current git tag to determine its version) I'm surprised 0.12 is picking up such an old version still, given that 0.153 was the latest 4 days ago and 0.12 was just created? Is there a lengthy QE/vetting process? In any case I believe this in particular was the credential minter issue in which the credential minter creds/secrets were getting deleted by garbage collection because the cred minter operator was tagging ownerrefs that crossed namespaces. Devan fixed it (fix went into the credential operator) so he'd have to tell us which specific dist-git tag included the fix.
> I'm surprised 0.12 is picking up such an old version still, given that 0.153 was the latest 4 days ago and 0.12 was just created? Is there a lengthy QE/vetting process? There is a QE process (although less this week with the new year). But that update payload was selected on Friday when it was fairly young, and the OCP builds haven't passed CI since then [1]. I expect the release pipeline will tighten up as we get used to ART releases. [1]: https://openshift-release.svc.ci.openshift.org
I tested with 0.12.0 but still this issue is around. ~~~ I0206 22:13:05.498224 1 main.go:24] Cluster Image Registry Operator Version: v4.0.0-0.150.0.0-dirty ~~~
This is event logs ~~~ LAST SEEN TYPE REASON KIND MESSAGE 19m Normal Scheduled Pod Successfully assigned openshift-image-registry/cluster-image-registry-operator-6d6b45bfdf-qq76v to ip-10-0-36-239.us-east-2.compute.internal 19m Normal Pulling Pod pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:419bbec1250bbd9214b18059d1621f28d3cdcac5a7e757cf3ada69a7e0b55679" 19m Normal Pulled Pod Successfully pulled image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:419bbec1250bbd9214b18059d1621f28d3cdcac5a7e757cf3ada69a7e0b55679" 19m Normal Created Pod Created container 19m Normal Started Pod Started container 20m Warning FailedCreate ReplicaSet Error creating: No API token found for service account "cluster-image-registry-operator", retry after the token is automatically created and added to the service account 19m Normal SuccessfulCreate ReplicaSet Created pod: cluster-image-registry-operator-6d6b45bfdf-qq76v 20m Normal ScalingReplicaSet Deployment Scaled up replica set cluster-image-registry-operator-6d6b45bfdf to 1 ~~~
> In any case I believe this in particular was the credential minter issue in which the credential minter creds/secrets were getting deleted by garbage collection because the cred minter operator was tagging ownerrefs that crossed namespaces. That sounds like [1]. Recent credentials operator master commits: $ git log --first-parent --oneline -7 origin/master 0798caf Merge pull request #28 from dgoodwin/updated-docs 0d85290 Merge pull request #27 from dgoodwin/oom-memory-bump 36a39e8 Merge pull request #25 from joelddiaz/byo-aws-verbs 8567ac5 Merge pull request #26 from dgoodwin/secret-hotloop df8cdda Merge pull request #16 from Miciah/NE-140-openshift-ingress-change-namespace-and-permission 7b5bec4 Merge pull request #23 from joelddiaz/update-controller-runtime 94ce207 Merge pull request #15 from joelddiaz/cr-conditions $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-30-145955 | grep credential cloud-credential-operator https://github.com/openshift/cloud-credential-operator 94ce2075731d1d031f0e36664e49887c13c75ca5 so yeah, 2019-01-30-145955 was too early. Sampling [2]: $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-01-31-184459 | grep credential cloud-credential-operator https://github.com/openshift/cloud-credential-operator 94ce2075731d1d031f0e36664e49887c13c75ca5 $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-06-035427 error: image does not exist $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-06-214833 error: image does not exist $ oc adm release info --commits registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-02-06-225535 error: image does not exist So it looks like there are no OCP builds with the fixed credential operator, and recent OCP update payloads are missing entirely despite showing up in [2]. The ART team understood the problem, but it sounded unresolved as of yesterday. I'm not sure if there's a tracking issue for it or not. [1]: https://github.com/openshift/cloud-credential-operator/commit/22b0b0781a799b83765a589bad1a74e200932862 [2]: https://openshift-release.svc.ci.openshift.org/
Can't reproduce this bug with installer version name=openshift/ose-installer release=1 version=v4.0.6 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-02-28-054829 True False 37m Cluster version is 4.0.0-0.nightly-2019-02-28-054829 $ oc get secret -n openshift-image-registry | grep image cluster-image-registry-operator-dockercfg-hqbkg kubernetes.io/dockercfg 1 50m cluster-image-registry-operator-token-dw8cf kubernetes.io/service-account-token 3 52m cluster-image-registry-operator-token-mk5b8 kubernetes.io/service-account-token 3 52m image-registry-private-configuration Opaque 2 51m image-registry-tls kubernetes.io/tls 2 51m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758