1877481 – Failed to initialize the cluster: Cluster operator machine-config is still updating

Bug 1877481 - Failed to initialize the cluster: Cluster operator machine-config is still updating

Summary: Failed to initialize the cluster: Cluster operator machine-config is still up...

Keywords:
Status:	CLOSED DUPLICATE of bug 1877984
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	kube-apiserver
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Stefan Schimanski
QA Contact:	Ke Wang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-09-09 17:46 UTC by Jeffrey Sica
Modified:	2020-11-05 02:19 UTC (History)
CC List:	13 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-09-11 00:07:55 UTC
Target Upstream Version:
Embargoed:
Flags:	jsica: needinfo- jsica: needinfo-

Attachments	(Terms of Use)

Description Jeffrey Sica 2020-09-09 17:46:28 UTC

Description of problem:

In Service Delivery we're observing 4.6 nightly clusters consistently failing to install with the installer error 

"failed to initialize the cluster: Cluster operator machine-config is still updating"


Version-Release number of the following components:
openshift-v4.6.0-0.nightly-2020-09-04-064335

How reproducible:
Consistently: https://testgrid.k8s.io/redhat-osde2e-int-aws#osde2e-int-aws-e2e-osd-default-plus-two-nightly

Steps to Reproduce:
1. Provision an OSD cluster in Int using a 4.6 nightly
2. Sits at installing till the cluster install times out
3. Go check the installer logs

Actual results:
"Failed to initialize the cluster: Cluster operator machine-config is still updating" 

Additional Information:
https://testgrid.k8s.io/redhat-osde2e-int-aws#osde2e-int-aws-e2e-osd-default-plus-two-nightly

Example failure: 
https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/osde2e-int-aws-e2e-osd-default-plus-two-nightly/1303423056479260672

Comment 1 Matthew Staebler 2020-09-09 18:13:21 UTC

Here is a snippet from an example install log.

time="2020-09-09T16:17:44Z" level=info msg="Waiting up to 30m0s for the cluster at https://api.osde2e-ukg38.m4m1.i1.devshift.org:6443 to initialize..."
time="2020-09-09T16:18:15Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 1% complete"
time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337"
time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: downloading update"
time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337"
time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 5% complete"
time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 12% complete"
time="2020-09-09T16:19:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 80% complete"
time="2020-09-09T16:19:17Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 94% complete"
time="2020-09-09T16:20:17Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 96% complete"
time="2020-09-09T16:20:32Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 97% complete"
time="2020-09-09T16:21:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 97% complete, waiting on authentication, cloud-credential, cluster-autoscaler, console, image-registry, ingress, kube-storage-version-migrator, machine-api, machine-config, monitoring"
time="2020-09-09T16:24:32Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 98% complete"
time="2020-09-09T16:25:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 98% complete, waiting on authentication, cloud-credential, console, image-registry, ingress, kube-storage-version-migrator, machine-config, monitoring"
time="2020-09-09T16:26:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 98% complete"
time="2020-09-09T16:27:17Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 99% complete"
time="2020-09-09T16:28:32Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.251.78:6443/.well-known/oauth-authorization-server endpoint data"
time="2020-09-09T16:33:02Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2"
time="2020-09-09T16:33:21Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2"
time="2020-09-09T16:39:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 100% complete, waiting on machine-config"
time="2020-09-09T16:44:47Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is still updating"
time="2020-09-09T16:47:44Z" level=info msg="Cluster operator insights Disabled is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired"
time="2020-09-09T16:47:44Z" level=info msg="Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired"
time="2020-09-09T16:47:44Z" level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-09-09-111337"
time="2020-09-09T16:47:44Z" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.6.0-0.nightly-2020-09-09-111337: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-178-115.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9a757a7b1d612460c62cc443aeaa3926\\\\\\\" not found\\\", Node ip-10-0-172-77.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9a757a7b1d612460c62cc443aeaa3926\\\\\\\" not found\\\", Node ip-10-0-251-78.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9a757a7b1d612460c62cc443aeaa3926\\\\\\\" not found\\\"\", retrying"
time="2020-09-09T16:47:44Z" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.6.0-0.nightly-2020-09-09-111337"
time="2020-09-09T16:47:44Z" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating"

Comment 8 Kirsten Garrison 2020-09-09 20:30:20 UTC

Looking through the must-gather I see in clusteroperators.yaml & clusterversion that auth degraded before the MCO errors above
```
status:
    availableUpdates: null
    conditions:
    - lastTransitionTime: "2020-09-09T18:45:16Z"
      status: "False"
      type: Available
    - lastTransitionTime: "2020-09-09T19:06:12Z"
      message: 'Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: got ''404 Not Found'' status while trying to GET the OAuth well-known https://10.0.158.104:6443/.well-known/oauth-authorization-server endpoint data'
      reason: ClusterOperatorDegraded
      status: "True"
      type: Failing
    - lastTransitionTime: "2020-09-09T18:45:16Z"
      message: 'Unable to apply 4.6.0-0.nightly-2020-09-09-111337: the cluster operator authentication is degraded'
      reason: ClusterOperatorDegraded
      status: "True"
      type: Progressing
    - lastTransitionTime: "2020-09-09T18:45:16Z"
      message: 'Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2020-09-09-111337 not found in the "stable-4.6" channel'
      reason: VersionNotFound
      status: "False"
```

Will keep looking through the must gather but passing to auth to investigate the failure. 

But would ask that you let us all know how this OSD deployment differs from a standard OCP deployment.

Also can you update your ci runs to include full mustgathers (as OCP ci does) - I'm unable to tell what errors you are seeing in past runs as I see consistent failing runs clicking back through to May?

Comment 9 Kirsten Garrison 2020-09-09 21:12:11 UTC

The MCO error looks like: https://bugzilla.redhat.com/show_bug.cgi?id=1835233

Comment 10 Kirsten Garrison 2020-09-09 21:18:29 UTC

Adding to https://bugzilla.redhat.com/show_bug.cgi?id=1877481#c2  do you do anything additional configuration at install time?

Additionally, when did you see these auth/mco/etc... errors start in your CI runs (we have no must gathers other than the single one attached here so we can't tell).

Comment 11 Maru Newby 2020-09-09 23:07:37 UTC

FYI the auth operator reporting degraded is most commonly a canary rather than an indication that the operator is the source of the problem. In this case, the auth operator reporting `need at least 3 kube-apiservers, got 2"` is an indication that one of the kube apiservers is not available. During an install or upgrade, a common reason for that symptom is that one of the apiservers is in the process of being restarted as part of a staged rollout initiated in response to a configuration change.

Currently each rollout of the 3 kube apiservers can take up to 15 minutes which is a longer duration than previous releases. I don't see an obvious reason for failure other than the install not giving enough time for the configuration changes to be applied via staged rollout. I suggest that whoever is responsible for the job configuration compare the timeout used for the common presubmits (e.g. e2e-gcp-upgrade) with that used for the periodic in question. If they differ substantially, that could be configuration that job timeout duration is the problem.

Assigning to the apiserver experts in case there's something I've missed.

Comment 16 Greg Sheremeta 2020-09-10 14:45:27 UTC

> do you do anything additional configuration at install time?

We do. What is the best way to give that to you -- should we share the install-config?

> Also can you update your ci runs to include full mustgathers

I'll try to make that happen

Comment 17 W. Trevor King 2020-09-10 17:45:35 UTC

Poking around in comment 6's must-gather:

$ for X in cluster-scoped-resources/core/nodes/*; do yaml2json < "${X}" | jq -r '.metadata.name as $name | .metadata.annotations["machine.openshift.io/machine"] as $machine | .status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .status + " " + $name + " " + $machine'; done | sort
2020-09-09T18:51:28Z True ip-10-0-160-77.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-master-1
2020-09-09T18:51:49Z True ip-10-0-226-2.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-master-2
2020-09-09T18:51:54Z True ip-10-0-158-104.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-master-0
2020-09-09T19:00:43Z True ip-10-0-254-204.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-2n8gt
2020-09-09T19:00:56Z True ip-10-0-247-111.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-rx4n4
2020-09-09T19:01:04Z True ip-10-0-246-132.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-p8kdj
2020-09-09T19:01:43Z True ip-10-0-133-135.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-bb77f
$ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-09-09T18:45:16Z Available=False : 
2020-09-09T18:45:16Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.6.0-0.nightly-2020-09-09-111337: the cluster operator authentication is degraded
2020-09-09T18:45:16Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2020-09-09-111337 not found in the "stable-4.6" channel
2020-09-09T19:06:12Z Failing=True ClusterOperatorDegraded: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.158.104:6443/.well-known/oauth-authorization-server endpoint data
$ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/authentication.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-09-09T18:51:56Z Available=False WellKnown_NotReady: WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2
2020-09-09T18:51:56Z Upgradeable=True AsExpected: 
2020-09-09T18:53:56Z Degraded=True WellKnownReadyController_SyncError: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2
2020-09-09T19:02:11Z Progressing=False AsExpected: 
$ for X in namespaces/openshift-kube-apiserver/pods/kube-apiserver-*/kube-apiserver-*.yaml; do yaml2json < "${X}" | jq -r '.status.phase + " " + .metadata.name'; done
Pending kube-apiserver-ip-10-0-158-104.ec2.internal
Running kube-apiserver-ip-10-0-160-77.ec2.internal
Running kube-apiserver-ip-10-0-226-2.ec2.internal
$ yaml2json <namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-158-104.ec2.internal/kube-apiserver-ip-10-0-158-104.ec2.internal.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort
2020-09-09T18:56:23Z PodScheduled=True : 
2020-09-09T19:06:18Z ContainersReady=False ContainersNotReady: containers with unready status: [kube-apiserver kube-apiserver-cert-syncer kube-apiserver-cert-regeneration-controller kube-apiserver-insecure-readyz kube-apiserver-check-endpoints]
2020-09-09T19:06:18Z Initialized=False ContainersNotInitialized: containers with incomplete status: [setup]
2020-09-09T19:06:18Z Ready=False ContainersNotReady: containers with unready status: [kube-apiserver kube-apiserver-cert-syncer kube-apiserver-cert-regeneration-controller kube-apiserver-insecure-readyz kube-apiserver-check-endpoints]
$ yaml2json <namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-158-104.ec2.internal/kube-apiserver-ip-10-0-158-104.ec2.internal.yaml | jq -r .status.initContainerStatuses
[
  {
    "name": "setup",
    "ready": "false",
    "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbbdc954cefdfc297e03ef3b9b36211b8f7378378a89e071585e70b1121161d2",
    "lastState": {},
    "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbbdc954cefdfc297e03ef3b9b36211b8f7378378a89e071585e70b1121161d2",
    "state": {
      "running": {
        "startedAt": "2020-09-09T19:06:19Z"
      }
    },
    "containerID": "cri-o://42324264d083b6652df85702d882fe50742241290088ec8cb17e1a9a6d60dadd",
    "restartCount": "0"
  }
]
$ cat namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-158-104.ec2.internal/setup/setup/logs/current.log 
2020-09-09T19:06:19.539684924Z Fixing audit permissions.2020-09-09T19:06:19.541907798Z Waiting for port :6443 and :6080 to be released.2020-09-09T19:06:19.549586347Z .2020-09-09T19:06:20.560841176Z .2020-09-09T19:06:21.565801200Z .2020-09-09T19:06:22.570561340Z .2020-09-09T19:06:23.575362818Z .2020-09-09T19:06:24.583315221Z .2020-09-09T19:06:25.587639733Z .2020-09-09T19:06:26.592153247Z .2020-09-09T19:06:27.596626613Z .2020-09-09T19:06:28.601329499Z .2020-09-09T19:06:29.606262278Z .2020-09-09T19:06:30.610849437Z .2020-09-09T19:06:31.617428753Z .2020-09-09T19:06:32.622060791Z .2020-09-09T19:06:33.626518033Z .2020-09-09T19:06:34.632391083Z .2020-09-09T19:06:35.637264109Z .2020-09-09T19:06:36.641862868Z .2020-09-09T19:06:37.647128745Z .2020-09-09T19:06:38.652486458Z .2020-09-09T19:06:39.657341762Z .2020-09-09T19:06:40.662505710Z .2020-09-09T19:06:41.667058974Z .2020-09-09T19:06:42.671613051Z .2020-09-09T19:06:43.676154458Z .2020-09-09T19:06:44.697024774Z .2020-09-09T19:06:45.694892277Z .2020-09-09T19:06:46.699285060Z .2020-09-09T19:06:47.704090711Z .2020-09-09T19:06:48.708667064Z .2020-09-09T19:06:49.714727402Z .2020-09-09T19:06:50.720228979Z .2020-09-09T19:06:51.725158628Z .

So... something is hanging on to these ports on that node, and the setup container is just hanging out waiting for them to get free?  Kernel issue?  Or maybe not, because "Waiting for port :6443 and :6080 to be released" shows up in a dozen existing bugs, but none of them are still open, and none of them were reported against 4.6.

Also, not sure what's writing these logs, but newlines are cheap ;).

And it's also not clear to me why auth wants all three kube-apiserver pods up, I'd expect the loadbalancer to mean we only needed a single, healthy kube-apiserver pod to make consumers like the auth operator happy.

Comment 18 Stefan Schimanski 2020-09-10 20:04:22 UTC

> And it's also not clear to me why auth wants all three kube-apiserver pods up, I'd expect the loadbalancer to mean we only needed a single, healthy kube-apiserver pod to make consumers like the auth operator happy.

No, for oauth to work we have to make sure all of them serve the oauth .well-known endpoint.

Comment 19 W. Trevor King 2020-09-10 20:07:35 UTC

Also seems like lots of CI hits:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=need+at+least+3+kube-apiservers%2C+got+2&maxAge=24h&type=junit&groupBy=job' | grep 'failures match' | sort
pull-ci-openshift-cloud-credential-operator-master-e2e-aws-upgrade - 2 runs, 50% failed, 100% of failures match
...
pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-upgrade - 18 runs, 33% failed, 133% of failures match
rehearse-11729-pull-ci-openshift-cluster-etcd-operator-master-e2e - 2 runs, 50% failed, 100% of failures match
rehearse-11731-pull-ci-openshift-cluster-authentication-operator-master-e2e-upgrade - 6 runs, 33% failed, 50% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 9 runs, 89% failed, 13% of failures match
release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 9 runs, 22% failed, 50% of failures match
release-openshift-ocp-installer-e2e-ovirt-upgrade-4.5-stable-to-4.6-ci - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-disruptive-4.6 - 1 runs, 100% failed, 100% of failures match
release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 12 runs, 17% failed, 300% of failures match
release-openshift-origin-installer-e2e-aws-upgrade - 81 runs, 19% failed, 133% of failures match
release-openshift-origin-installer-e2e-azure-4.6 - 2 runs, 50% failed, 100% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 4 runs, 100% failed, 25% of failures match
release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 4 runs, 100% failed, 25% of failures match
release-openshift-origin-installer-e2e-gcp-upgrade - 21 runs, 38% failed, 100% of failures match

Not sure how many of those also failed install, but we are certainly making auth skittish in CI.

Comment 20 Greg Sheremeta 2020-09-10 23:55:56 UTC

root cause, OSD clusters-service is setting
networkType: OpenshiftSDN

changing this to
networkType: OpenShiftSDN
results in a successful install.

OpenshiftSDN is ok in 4.5, but not 4.6.

Why?

Comment 21 Greg Sheremeta 2020-09-11 00:07:55 UTC

no longer an SD Blocker, and caused by a bad install-config. We may open a fresh BZ to look into why `networkType: OpenshiftSDN` in the install-config breaks in 4.6 but not 4.5.

closing.

Comment 22 Greg Sheremeta 2020-09-11 00:57:10 UTC

new bug = BZ 1877984

Comment 23 Scott Dodson 2020-09-17 17:05:28 UTC


*** This bug has been marked as a duplicate of bug 1877984 ***

Note You need to log in before you can comment on or make changes to this bug.