1680201 – Install message: Still waiting for the cluster to initialize: Could not update servicemonitor

Bug 1680201 - Install message: Still waiting for the cluster to initialize: Could not update servicemonitor

Summary: Install message: Still waiting for the cluster to initialize: Could not updat...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Service Catalog
Sub Component:
Version:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.1.0
Assignee:	Dan Geoffroy
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-02-23 01:57 UTC by Jay Boyd
Modified:	2019-06-04 10:44 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-06-04 10:44:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Occurrences of this error in CI from 2019-04-08T19:15 to 2019-04-09T18:36 UTC (250.80 KB, image/svg+xml) 2019-04-09 19:17 UTC, W. Trevor King	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2019:0758	0	None	None	None	2019-06-04 10:44:33 UTC

Description Jay Boyd 2019-02-23 01:57:23 UTC

During cluster installation this debug message is emitted just prior to the completion of install:

"DEBUG Still waiting for the cluster to initialize: Could not update servicemonitor "openshift-svcat-apiserver-operator/openshift-svcat-apiserver-operator" (245 of 297): the server does not recognize this resource, check extension API servers"

Comment 1 Jay Boyd 2019-02-26 13:37:03 UTC

fixed by https://github.com/openshift/cluster-svcat-apiserver-operator/pull/31

Comment 3 Jian Zhang 2019-03-08 10:16:42 UTC

I didn't encounter this issue when building the OCP 4.0 via the installer. Verify it.

Cluster version is: 4.0.0-0.nightly-2019-03-06-074438

Comment 5 W. Trevor King 2019-04-09 19:12:36 UTC

I saw this again in CI just now [1]:

  level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (294 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"

That cluster included svcat#31:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1568/pull-ci-openshift-installer-master-e2e-aws/5060/artifacts/release-latest/release-payload-latest/image-references | jq -r '.spec.tags[] | select(.name == "cluster-svcat-apiserver-operator").annotations'
  {
    "io.openshift.build.commit.id": "0e8e9ffc5e874be13ee99c471216f930575c083b",
    "io.openshift.build.commit.ref": "master",
    "io.openshift.build.source-location": "https://github.com/openshift/cluster-svcat-apiserver-operator"
  }
  $ git log --first-parent --format='%ad %h %d %s' --date=iso -10 0e8e9ffc5e8
  2019-03-29 13:38:06 -0700 0e8e9ffc  (up-qtCV2teSdU/release-4.0, up-qtCV2teSdU/master, origin/release-4.0, origin/master, origin/HEAD) Merge pull request #42 from jboyd01/bump-and-crd-cleanup
  2019-03-22 07:43:48 -0700 8f2edb80  (HEAD -> master) Merge pull request #41 from jboyd01/probes
  2019-03-15 07:32:49 -0700 84360159  Merge pull request #40 from jboyd01/apiservices-proto
  2019-03-13 07:53:37 -0700 63455274  Merge pull request #39 from jboyd01/version
  2019-03-07 07:12:35 -0800 292e1eac  Merge pull request #37 from jboyd01/new-namespace
  2019-03-07 07:07:07 -0800 0ac33d71  Merge pull request #35 from jboyd01/unsupported-config
  2019-03-05 21:16:51 -0500 e1a4b4ae  Merge pull request #33 from jboyd01/allow-null-management
  2019-03-05 15:51:27 -0800 b764df9e  Merge pull request #36 from jboyd01/priority-class
  2019-03-05 13:01:21 -0800 a064d1f1  Merge pull request #34 from jboyd01/image-pull-policy
  2019-02-24 04:26:15 +0100 547648cb  Merge pull request #31 from jboyd01/fix-servicemonitor-runlevel

So I'm punting back to ASSIGNED, although feel free to open a new issue if this turns out to be a separate problem that just happens to have the same symptoms.

[1]: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_installer/1568/pull-ci-openshift-installer-master-e2e-aws/5060

Comment 6 W. Trevor King 2019-04-09 19:17:35 UTC

Created attachment 1553920 [details]
Occurrences of this error in CI from 2019-04-08T19:15 to 2019-04-09T18:36 UTC

This occurred in 3 of our 344 failures (0%) in *-e2e-aws* jobs across the whole CI system over the past 23 hours.  Generated with [1]:

  $ deck-build-log-plot 'failed to initialize the cluster: Could not update servicemonitor'
  3	failed to initialize the cluster: Could not update servicemonitor
  	1	https://github.com/operator-framework/operator-lifecycle-manager/pull/759	ci-op-jm8w2mcj
  	1	https://github.com/openshift/installer/pull/1568	ci-op-19x7pmlp
  	1	https://github.com/openshift/installer/pull/1567	ci-op-njv384hk

[1]: https://github.com/wking/openshift-release/tree/debug-scripts/deck-build-log

Comment 7 Jay Boyd 2019-04-09 20:06:44 UTC

That is some nice gitfu @Trevor!  Your error actually looks a bit different though.  This PR's issue was specific to "the server does not recognize this resource" and also specific to the service catalog apiserver operator.  The greps you have here are all "timed out waiting" and are spread across 3 diff operators (openshift image registry, kube api server, service catalog api server)

This PR's issue was caused by the fact I was trying to create the service monitor at the wrong run level prior to the prometheus servicemonitor crd being created.  That was fixed, the issue you identified here is different.


pr-logs/pull/operator-framework_operator-lifecycle-manager/759/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1529/build-log.txt:level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-image-registry/image-registry\" (285 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"

pr-logs/pull/openshift_installer/1568/pull-ci-openshift-installer-master-e2e-aws/5060/build-log.txt:level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-kube-apiserver-operator/kube-apiserver-operator\" (294 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"

pr-logs/pull/openshift_installer/1567/pull-ci-openshift-installer-master-e2e-aws/5057/build-log.txt:level=fatal msg="failed to initialize the cluster: Could not update servicemonitor \"openshift-service-catalog-apiserver-operator/openshift-service-catalog-apiserver-operator\" (288 of 307): the server does not recognize this resource, check extension API servers: timed out waiting for the condition"

Comment 8 Jay Boyd 2019-04-09 20:14:42 UTC

I could be wrong about it being a totally diff error (in the original PR the error msg could be truncated - - maybe they are all timeout out waiting for the condition).   I'll do some more digging, I'm thinking the prometheus CRDs haven't been created.

Comment 9 W. Trevor King 2019-04-09 20:20:29 UTC

> I could be wrong about it being a totally diff error...

Feel free to move this back to VERIFIED and spin the new report off into its own bug if you suspect it is different.  We can always close it as a dup if it turns out to be the same after futher digging (or stay here until we have more evidence that it's a different issue).

Comment 10 Jay Boyd 2019-04-09 21:01:25 UTC

For the failure specific to the service catalog apiserver operator (installer/pull/1567) I see the monitoring cluster operator is in a failing state indicating it failed to roll out.  Therefore servicemonitor CRD won't be available.  In https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/1567/pull-ci-openshift-installer-master-e2e-aws/5057/artifacts/e2e-aws/clusteroperators.json:

            "apiVersion": "config.openshift.io/v1",
            "kind": "ClusterOperator",
            "metadata": {
                "creationTimestamp": "2019-04-09T17:27:04Z",
                "generation": 1,
                "name": "monitoring",
                "resourceVersion": "27677",
                "selfLink": "/apis/config.openshift.io/v1/clusteroperators/monitoring",
                "uid": "b119f846-5aec-11e9-9843-12f409cedc68"
            },
            "spec": {},
            "status": {
                "conditions": [
                    {
                        "lastTransitionTime": "2019-04-09T17:37:11Z",
                        "message": "Failed to rollout the stack. Error: running task Updating Cluster Monitoring Operator failed: reconciling Cluster Monitoring Operator ServiceMonitor failed: creating ServiceMonitor object failed: the server could not find the requested resource (post servicemonitors.monitoring.coreos.com)",
                        "status": "True",
                        "type": "Failing"
                    },
                    {
                        "lastTransitionTime": "2019-04-09T17:32:06Z",
                        "status": "False",
                        "type": "Available"
                    },
                    {
                        "lastTransitionTime": "2019-04-09T17:52:27Z",
                        "message": "Rolling out the stack.",
                        "status": "True",
                        "type": "Progressing"
                    }
                ],



For openshift_installer/1568  the monitoring cluster operator isn't listed, ie it hasn't been started by the CVO.  I take that to mean monitoring hasn't been installed and therefore it's CRDs aren't recognized.  There are other core failures identified in there indicating issues with the kube-apiserver, kube-controller-manager, kube-scheduler static pods and also machine-config is in a failure state.


For operator-framework_operator-lifecycle-manager/759  (https://storage.googleapis.com/origin-ci-test/pr-logs/pull/operator-framework_operator-lifecycle-manager/759/pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-aws-olm/1529/artifacts/e2e-aws-olm/clusteroperators.json)  the monitoring co isn't listed again, the scheduler is failing with a msg about no static pods.


I don't have understanding why the error about servicemonitor is reported by the installer while there are other cluster operators in a failed state that are (at least in two of these cases) most likely causing the servicemonitor error. Is the servicemonitor error just one of several/many errors but its the only one reported?  From what I am seeing it is misleading and a red herring.

@Trevor:  could this be an installer issue not reporting all errors?  If not installer, I'm inclined to think CVO.

Comment 11 W. Trevor King 2019-04-09 21:10:20 UTC

> I don't have understanding why the error about servicemonitor is reported by the installer while there are other cluster operators in a failed state that are (at least in two of these cases) most likely causing the servicemonitor error. Is the servicemonitor error just one of several/many errors but its the only one reported?  From what I am seeing it is misleading and a red herring.

Thanks for digging :).  That makes sense to me, and I'll move this back to VERIFIED.  More discussion in bug 1691513 about improving logging in these situations, and work in progress in [1,2] to make the CVO more informative.  I'll check for issues with the core components you mention and follow-up with them if necessary.

[1]: https://github.com/openshift/cluster-version-operator/pull/152
[2]: https://github.com/openshift/cluster-version-operator/pull/158

Comment 14 errata-xmlrpc 2019-06-04 10:44:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758

Note You need to log in before you can comment on or make changes to this bug.