1886111 – UpdatingopenshiftStateMetricsFailed: DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas

Bug 1886111 - UpdatingopenshiftStateMetricsFailed: DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas

Summary: UpdatingopenshiftStateMetricsFailed: DeploymentRollout of openshift-monitorin...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Sergiusz Urbaniak
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-07 16:49 UTC by slowrie
Modified:	2021-02-24 15:24 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-02-24 15:23:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-state-metrics pull 61	0	None	closed	Bug 1886111: Revert "Merge pull request #59 from paulfantom/klog"	2021-01-18 09:08:22 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:24:25 UTC

Description slowrie 2020-10-07 16:49:20 UTC

Description of problem:

at https://api.ci-op-9259g0x6-067ff.origin-ci-int-aws.dev.rhcloud.com:6443..."
level=info msg="API v1.19.0-rc.2.1075+6a59bc4c1d0117-dirty up"
level=info msg="Waiting up to 30m0s for bootstrapping to complete..."
level=info msg="Destroying the bootstrap resources..."
level=info msg="Waiting up to 40m0s for the cluster at https://api.ci-op-9259g0x6-067ff.origin-ci-int-aws.dev.rhcloud.com:6443 to initialize..."
E1007 14:44:45.487611      37 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: Get "https://api.ci-op-9259g0x6-067ff.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=22717&timeoutSeconds=359&watch=true": dial tcp 54.237.212.88:6443: connect: connection refused
E1007 14:48:32.230350      37 reflector.go:307] k8s.io/client-go/tools/watch/informerwatcher.go:146: Failed to watch *v1.ClusterVersion: Get "https://api.ci-op-9259g0x6-067ff.origin-ci-int-aws.dev.rhcloud.com:6443/apis/config.openshift.io/v1/clusterversions?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dversion&resourceVersion=23451&timeoutSeconds=469&watch=true": dial tcp 52.86.173.82:6443: connect: connection refused
level=info msg="Cluster operator insights Disabled is False with AsExpected: "
level=info msg="Cluster operator monitoring Available is False with : "
level=info msg="Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack."
level=error msg="Cluster operator monitoring Degraded is True with UpdatingopenshiftStateMetricsFailed: Failed to rollout the stack. Error: running task Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas"
level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: marketplace, monitoring"

Version-Release number of selected component (if applicable):

4.7

https://search.ci.openshift.org/?search=UpdatingopenshiftStateMetricsFailed&maxAge=48h&context=1&type=bug%2Bjunit&name=4.7&maxMatches=5&maxBytes=20971520&groupBy=job

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25585/pull-ci-openshift-origin-master-e2e-aws-disruptive/1313842655171448832

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1313852661115654144

Comment 1 W. Trevor King 2020-10-07 18:03:18 UTC

Digging into the promotion failure [1] from comment 0:

   level=error msg="Cluster operator monitoring Degraded is True with UpdatingopenshiftStateMetricsFailed: Failed to rollout the stack. Error: running task Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas" 

Looking at the Deployment:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1313852661115654144/artifacts/e2e-gcp/deployments.json | gunzip | jq -r '.items[] | select(.metadata.name == "openshift-state-metrics").status'
{
  "conditions": [
    {
      "lastTransitionTime": "2020-10-07T15:02:45Z",
      "lastUpdateTime": "2020-10-07T15:02:45Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2020-10-07T15:12:46Z",
      "lastUpdateTime": "2020-10-07T15:12:46Z",
      "message": "ReplicaSet \"openshift-state-metrics-7d5967f58\" has timed out progressing.",
      "reason": "ProgressDeadlineExceeded",
      "status": "False",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 9,
  "replicas": 1,
  "unavailableReplicas": 1,
  "updatedReplicas": 1
}

Looking at the pod:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1313852661115654144/artifacts/e2e-gcp/pods.json | jq -r '.items[] | select(.metadata.name | startswith("openshift-state-metrics-")).status.containerStatuses[] | select(.name == "openshift-state-metrics")'
{
  "containerID": "cri-o://ee31975808c3296ed0ebf072ae4d5377114ad9d284d01f22a6cd52813acbedf8",
  "image": "registry.svc.ci.openshift.org/ocp/4.7-2020-10-07-143852@sha256:c260f540eccbba3fb29f262061839e0f528c9d689a88ddd52c3e0c3c1a0dfba0",
  "imageID": "registry.svc.ci.openshift.org/ocp/4.7-2020-10-07-143852@sha256:c260f540eccbba3fb29f262061839e0f528c9d689a88ddd52c3e0c3c1a0dfba0",
  "lastState": {
    "terminated": {
      "containerID": "cri-o://ee31975808c3296ed0ebf072ae4d5377114ad9d284d01f22a6cd52813acbedf8",
      "exitCode": 2,
      "finishedAt": "2020-10-07T15:41:38Z",
      "reason": "Error",
      "startedAt": "2020-10-07T15:41:38Z"
    }
  },
  "name": "openshift-state-metrics",
  "ready": false,
  "restartCount": 12,
  "started": false,
  "state": {
    "waiting": {
      "message": "back-off 5m0s restarting failed container=openshift-state-metrics pod=openshift-state-metrics-7d5967f58-kp4wb_openshift-monitoring(e2f9178d-0e26-46bb-812e-afd9feb8e431)",
      "reason": "CrashLoopBackOff"
    }
  }
}

Pod logs:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1313852661115654144/artifacts/e2e-gcp/pods/openshift-monitoring_openshift-state-metrics-7d5967f58-kp4wb_openshift-state-metrics_previous.log
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x30 pc=0x13e3c44]

goroutine 1 [running]:
github.com/openshift/openshift-state-metrics/pkg/options.(*Options).AddFlags(0xc00020bf40)
	/go/src/github.com/openshift/openshift-state-metrics/pkg/options/options.go:44 +0x104
main.main()
	/go/src/github.com/openshift/openshift-state-metrics/main.go:46 +0xba

Ah.  So that's pretty clear.  Presumably the 'logtostderr' access needs adjusting after [2].  In the meantime, let's revert...

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-gcp-4.7/1313852661115654144
[2]: https://github.com/openshift/openshift-state-metrics/pull/59

Comment 8 errata-xmlrpc 2021-02-24 15:23:52 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Note You need to log in before you can comment on or make changes to this bug.