During cluster installation this debug message is emitted just prior to the completion of install:
"DEBUG Still waiting for the cluster to initialize: Could not update servicemonitor "openshift-svcat-apiserver-operator/openshift-svcat-apiserver-operator" (245 of 297): the server does not recognize this resource, check extension API servers"
fixed by https://github.com/openshift/cluster-svcat-apiserver-operator/pull/31
I didn't encounter this issue when building the OCP 4.0 via the installer. Verify it.
Cluster version is: 4.0.0-0.nightly-2019-03-06-074438
I saw this again in CI just now :
level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (294 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"
That cluster included svcat#31:
$ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1568/pull-ci-openshift-installer-master-e2e-aws/5060/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags | select(.name == "cluster-svcat-apiserver-operator").annotations'
$ git log --first-parent --format='%ad %h %d %s' --date=iso -10 0e8e9ffc5e8
2019-03-29 13:38:06 -0700 0e8e9ffc (up-qtCV2teSdU/release-4.0, up-qtCV2teSdU/master, origin/release-4.0, origin/master, origin/HEAD) Merge pull request #42 from jboyd01/bump-and-crd-cleanup
2019-03-22 07:43:48 -0700 8f2edb80 (HEAD -> master) Merge pull request #41 from jboyd01/probes
2019-03-15 07:32:49 -0700 84360159 Merge pull request #40 from jboyd01/apiservices-proto
2019-03-13 07:53:37 -0700 63455274 Merge pull request #39 from jboyd01/version
2019-03-07 07:12:35 -0800 292e1eac Merge pull request #37 from jboyd01/new-namespace
2019-03-07 07:07:07 -0800 0ac33d71 Merge pull request #35 from jboyd01/unsupported-config
2019-03-05 21:16:51 -0500 e1a4b4ae Merge pull request #33 from jboyd01/allow-null-management
2019-03-05 15:51:27 -0800 b764df9e Merge pull request #36 from jboyd01/priority-class
2019-03-05 13:01:21 -0800 a064d1f1 Merge pull request #34 from jboyd01/image-pull-policy
2019-02-24 04:26:15 +0100 547648cb Merge pull request #31 from jboyd01/fix-servicemonitor-runlevel
So I'm punting back to ASSIGNED, although feel free to open a new issue if this turns out to be a separate problem that just happens to have the same symptoms.
Created attachment 1553920 [details]
Occurrences of this error in CI from 2019-04-08T19:15 to 2019-04-09T18:36 UTC
This occurred in 3 of our 344 failures (0%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours. Generated with :
$ deck-build-log-plot 'failed to initialize the cluster: Could not update servicemonitor'
3 failed to initialize the cluster: Could not update servicemonitor
1 https://github.com/operator-framework/operator-lifecycle-manager/pull/759 ci-op-jm8w2mcj
1 https://github.com/openshift/installer/pull/1568 ci-op-19x7pmlp
1 https://github.com/openshift/installer/pull/1567 ci-op-njv384hk
That is some nice gitfu @Trevor! Your error actually looks a bit different though. This PR's issue was specific to "the server does not recognize this resource" and also specific to the service catalog apiserver operator. The greps you have here are all "timed out waiting" and are spread across 3 diff operators (openshift image registry, kube api server, service catalog api server)
This PR's issue was caused by the fact I was trying to create the service monitor at the wrong run level prior to the prometheus servicemonitor crd being created. That was fixed, the issue you identified here is different.
pr-logs/pull/operator-framework_operator-lifecycle-manager/759/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1529/build-log.txt:level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-image-registry/image-registry\" (285 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"
pr-logs/pull/openshift_installer/1568/pull-ci-openshift-installer-master-e2e-aws/5060/build-log.txt:level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (294 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"
pr-logs/pull/openshift_installer/1567/pull-ci-openshift-installer-master-e2e-aws/5057/build-log.txt:level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (288 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"
I could be wrong about it being a totally diff error (in the original PR the error msg could be truncated - - maybe they are all timeout out waiting for the condition). I'll do some more digging, I'm thinking the prometheus CRDs haven't been created.
> I could be wrong about it being a totally diff error...
Feel free to move this back to VERIFIED and spin the new report off into its own bug if you suspect it is different. We can always close it as a dup if it turns out to be the same after futher digging (or stay here until we have more evidence that it's a different issue).
For the failure specific to the service catalog apiserver operator (installer/pull/1567) I see the monitoring cluster operator is in a failing state indicating it failed to roll out. Therefore servicemonitor CRD won't be available. In https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1567/pull-ci-openshift-installer-master-e2e-aws/5057/artifacts/e2e-aws/clusteroperators.json:
"message": "Failed to rollout the stack. Error: running task Updating Cluster Monitoring Operator failed: reconciling Cluster Monitoring Operator ServiceMonitor failed: creating ServiceMonitor object failed: the server could not find the requested resource (post servicemonitors.monitoring.coreos.com)",
"message": "Rolling out the stack.",
For openshift_installer/1568 the monitoring cluster operator isn't listed, ie it hasn't been started by the CVO. I take that to mean monitoring hasn't been installed and therefore it's CRDs aren't recognized. There are other core failures identified in there indicating issues with the kube-apiserver, kube-controller-manager, kube-scheduler static pods and also machine-config is in a failure state.
For operator-framework_operator-lifecycle-manager/759 (https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-lifecycle-manager/759/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1529/artifacts/e2e-aws-olm/clusteroperators.json) the monitoring co isn't listed again, the scheduler is failing with a msg about no static pods.
I don't have understanding why the error about servicemonitor is reported by the installer while there are other cluster operators in a failed state that are (at least in two of these cases) most likely causing the servicemonitor error. Is the servicemonitor error just one of several/many errors but its the only one reported? From what I am seeing it is misleading and a red herring.
@Trevor: could this be an installer issue not reporting all errors? If not installer, I'm inclined to think CVO.
> I don't have understanding why the error about servicemonitor is reported by the installer while there are other cluster operators in a failed state that are (at least in two of these cases) most likely causing the servicemonitor error. Is the servicemonitor error just one of several/many errors but its the only one reported? From what I am seeing it is misleading and a red herring.
Thanks for digging :). That makes sense to me, and I'll move this back to VERIFIED. More discussion in bug 1691513 about improving logging in these situations, and work in progress in [1,2] to make the CVO more informative. I'll check for issues with the core components you mention and follow-up with them if necessary.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.