Description of problem: In Service Delivery we're observing 4.6 nightly clusters consistently failing to install with the installer error "failed to initialize the cluster: Cluster operator machine-config is still updating" Version-Release number of the following components: openshift-v4.6.0-0.nightly-2020-09-04-064335 How reproducible: Consistently: https://testgrid.k8s.io/redhat-osde2e-int-aws#osde2e-int-aws-e2e-osd-default-plus-two-nightly Steps to Reproduce: 1. Provision an OSD cluster in Int using a 4.6 nightly 2. Sits at installing till the cluster install times out 3. Go check the installer logs Actual results: "Failed to initialize the cluster: Cluster operator machine-config is still updating" Additional Information: https://testgrid.k8s.io/redhat-osde2e-int-aws#osde2e-int-aws-e2e-osd-default-plus-two-nightly Example failure: https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/osde2e-int-aws-e2e-osd-default-plus-two-nightly/1303423056479260672
Here is a snippet from an example install log. time="2020-09-09T16:17:44Z" level=info msg="Waiting up to 30m0s for the cluster at https://api.osde2e-ukg38.m4m1.i1.devshift.org:6443 to initialize..." time="2020-09-09T16:18:15Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 1% complete" time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337" time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: downloading update" time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337" time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 5% complete" time="2020-09-09T16:18:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 12% complete" time="2020-09-09T16:19:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 80% complete" time="2020-09-09T16:19:17Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 94% complete" time="2020-09-09T16:20:17Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 96% complete" time="2020-09-09T16:20:32Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 97% complete" time="2020-09-09T16:21:47Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 97% complete, waiting on authentication, cloud-credential, cluster-autoscaler, console, image-registry, ingress, kube-storage-version-migrator, machine-api, machine-config, monitoring" time="2020-09-09T16:24:32Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 98% complete" time="2020-09-09T16:25:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 98% complete, waiting on authentication, cloud-credential, console, image-registry, ingress, kube-storage-version-migrator, machine-config, monitoring" time="2020-09-09T16:26:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 98% complete" time="2020-09-09T16:27:17Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 99% complete" time="2020-09-09T16:28:32Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.251.78:6443/.well-known/oauth-authorization-server endpoint data" time="2020-09-09T16:33:02Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2" time="2020-09-09T16:33:21Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2" time="2020-09-09T16:39:02Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.6.0-0.nightly-2020-09-09-111337: 100% complete, waiting on machine-config" time="2020-09-09T16:44:47Z" level=debug msg="Still waiting for the cluster to initialize: Cluster operator machine-config is still updating" time="2020-09-09T16:47:44Z" level=info msg="Cluster operator insights Disabled is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired" time="2020-09-09T16:47:44Z" level=info msg="Cluster operator insights UploadDegraded is True with NotAuthorized: Reporting was not allowed: your Red Hat account is not enabled for remote support or your token has expired" time="2020-09-09T16:47:44Z" level=info msg="Cluster operator machine-config Progressing is True with : Working towards 4.6.0-0.nightly-2020-09-09-111337" time="2020-09-09T16:47:44Z" level=error msg="Cluster operator machine-config Degraded is True with RequiredPoolsFailed: Unable to apply 4.6.0-0.nightly-2020-09-09-111337: timed out waiting for the condition during syncRequiredMachineConfigPools: pool master has not progressed to latest configuration: configuration status for pool master is empty: pool is degraded because nodes fail with \"3 nodes are reporting degraded status on sync\": \"Node ip-10-0-178-115.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9a757a7b1d612460c62cc443aeaa3926\\\\\\\" not found\\\", Node ip-10-0-172-77.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9a757a7b1d612460c62cc443aeaa3926\\\\\\\" not found\\\", Node ip-10-0-251-78.ec2.internal is reporting: \\\"machineconfig.machineconfiguration.openshift.io \\\\\\\"rendered-master-9a757a7b1d612460c62cc443aeaa3926\\\\\\\" not found\\\"\", retrying" time="2020-09-09T16:47:44Z" level=info msg="Cluster operator machine-config Available is False with : Cluster not available for 4.6.0-0.nightly-2020-09-09-111337" time="2020-09-09T16:47:44Z" level=fatal msg="failed to initialize the cluster: Cluster operator machine-config is still updating"
Looking through the must-gather I see in clusteroperators.yaml & clusterversion that auth degraded before the MCO errors above ``` status: availableUpdates: null conditions: - lastTransitionTime: "2020-09-09T18:45:16Z" status: "False" type: Available - lastTransitionTime: "2020-09-09T19:06:12Z" message: 'Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: got ''404 Not Found'' status while trying to GET the OAuth well-known https://10.0.158.104:6443/.well-known/oauth-authorization-server endpoint data' reason: ClusterOperatorDegraded status: "True" type: Failing - lastTransitionTime: "2020-09-09T18:45:16Z" message: 'Unable to apply 4.6.0-0.nightly-2020-09-09-111337: the cluster operator authentication is degraded' reason: ClusterOperatorDegraded status: "True" type: Progressing - lastTransitionTime: "2020-09-09T18:45:16Z" message: 'Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2020-09-09-111337 not found in the "stable-4.6" channel' reason: VersionNotFound status: "False" ``` Will keep looking through the must gather but passing to auth to investigate the failure. But would ask that you let us all know how this OSD deployment differs from a standard OCP deployment. Also can you update your ci runs to include full mustgathers (as OCP ci does) - I'm unable to tell what errors you are seeing in past runs as I see consistent failing runs clicking back through to May?
The MCO error looks like: https://bugzilla.redhat.com/show_bug.cgi?id=1835233
Adding to https://bugzilla.redhat.com/show_bug.cgi?id=1877481#c2 do you do anything additional configuration at install time? Additionally, when did you see these auth/mco/etc... errors start in your CI runs (we have no must gathers other than the single one attached here so we can't tell).
FYI the auth operator reporting degraded is most commonly a canary rather than an indication that the operator is the source of the problem. In this case, the auth operator reporting `need at least 3 kube-apiservers, got 2"` is an indication that one of the kube apiservers is not available. During an install or upgrade, a common reason for that symptom is that one of the apiservers is in the process of being restarted as part of a staged rollout initiated in response to a configuration change. Currently each rollout of the 3 kube apiservers can take up to 15 minutes which is a longer duration than previous releases. I don't see an obvious reason for failure other than the install not giving enough time for the configuration changes to be applied via staged rollout. I suggest that whoever is responsible for the job configuration compare the timeout used for the common presubmits (e.g. e2e-gcp-upgrade) with that used for the periodic in question. If they differ substantially, that could be configuration that job timeout duration is the problem. Assigning to the apiserver experts in case there's something I've missed.
> do you do anything additional configuration at install time? We do. What is the best way to give that to you -- should we share the install-config? > Also can you update your ci runs to include full mustgathers I'll try to make that happen
Poking around in comment 6's must-gather: $ for X in cluster-scoped-resources/core/nodes/*; do yaml2json < "${X}" | jq -r '.metadata.name as $name | .metadata.annotations["machine.openshift.io/machine"] as $machine | .status.conditions[] | select(.type == "Ready") | .lastTransitionTime + " " + .status + " " + $name + " " + $machine'; done | sort 2020-09-09T18:51:28Z True ip-10-0-160-77.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-master-1 2020-09-09T18:51:49Z True ip-10-0-226-2.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-master-2 2020-09-09T18:51:54Z True ip-10-0-158-104.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-master-0 2020-09-09T19:00:43Z True ip-10-0-254-204.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-2n8gt 2020-09-09T19:00:56Z True ip-10-0-247-111.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-rx4n4 2020-09-09T19:01:04Z True ip-10-0-246-132.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-p8kdj 2020-09-09T19:01:43Z True ip-10-0-133-135.ec2.internal openshift-machine-api/osde2e-ukg38-6nc77-worker-us-east-1a-bb77f $ yaml2json <cluster-scoped-resources/config.openshift.io/clusterversions/version.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2020-09-09T18:45:16Z Available=False : 2020-09-09T18:45:16Z Progressing=True ClusterOperatorDegraded: Unable to apply 4.6.0-0.nightly-2020-09-09-111337: the cluster operator authentication is degraded 2020-09-09T18:45:16Z RetrievedUpdates=False VersionNotFound: Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2020-09-09-111337 not found in the "stable-4.6" channel 2020-09-09T19:06:12Z Failing=True ClusterOperatorDegraded: Cluster operator authentication is reporting a failure: WellKnownReadyControllerDegraded: got '404 Not Found' status while trying to GET the OAuth well-known https://10.0.158.104:6443/.well-known/oauth-authorization-server endpoint data $ yaml2json <cluster-scoped-resources/config.openshift.io/clusteroperators/authentication.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2020-09-09T18:51:56Z Available=False WellKnown_NotReady: WellKnownAvailable: The well-known endpoint is not yet available: need at least 3 kube-apiservers, got 2 2020-09-09T18:51:56Z Upgradeable=True AsExpected: 2020-09-09T18:53:56Z Degraded=True WellKnownReadyController_SyncError: WellKnownReadyControllerDegraded: need at least 3 kube-apiservers, got 2 2020-09-09T19:02:11Z Progressing=False AsExpected: $ for X in namespaces/openshift-kube-apiserver/pods/kube-apiserver-*/kube-apiserver-*.yaml; do yaml2json < "${X}" | jq -r '.status.phase + " " + .metadata.name'; done Pending kube-apiserver-ip-10-0-158-104.ec2.internal Running kube-apiserver-ip-10-0-160-77.ec2.internal Running kube-apiserver-ip-10-0-226-2.ec2.internal $ yaml2json <namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-158-104.ec2.internal/kube-apiserver-ip-10-0-158-104.ec2.internal.yaml | jq -r '.status.conditions[] | .lastTransitionTime + " " + .type + "=" + .status + " " + .reason + ": " + .message' | sort 2020-09-09T18:56:23Z PodScheduled=True : 2020-09-09T19:06:18Z ContainersReady=False ContainersNotReady: containers with unready status: [kube-apiserver kube-apiserver-cert-syncer kube-apiserver-cert-regeneration-controller kube-apiserver-insecure-readyz kube-apiserver-check-endpoints] 2020-09-09T19:06:18Z Initialized=False ContainersNotInitialized: containers with incomplete status: [setup] 2020-09-09T19:06:18Z Ready=False ContainersNotReady: containers with unready status: [kube-apiserver kube-apiserver-cert-syncer kube-apiserver-cert-regeneration-controller kube-apiserver-insecure-readyz kube-apiserver-check-endpoints] $ yaml2json <namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-158-104.ec2.internal/kube-apiserver-ip-10-0-158-104.ec2.internal.yaml | jq -r .status.initContainerStatuses [ { "name": "setup", "ready": "false", "imageID": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbbdc954cefdfc297e03ef3b9b36211b8f7378378a89e071585e70b1121161d2", "lastState": {}, "image": "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bbbdc954cefdfc297e03ef3b9b36211b8f7378378a89e071585e70b1121161d2", "state": { "running": { "startedAt": "2020-09-09T19:06:19Z" } }, "containerID": "cri-o://42324264d083b6652df85702d882fe50742241290088ec8cb17e1a9a6d60dadd", "restartCount": "0" } ] $ cat namespaces/openshift-kube-apiserver/pods/kube-apiserver-ip-10-0-158-104.ec2.internal/setup/setup/logs/current.log 2020-09-09T19:06:19.539684924Z Fixing audit permissions.2020-09-09T19:06:19.541907798Z Waiting for port :6443 and :6080 to be released.2020-09-09T19:06:19.549586347Z .2020-09-09T19:06:20.560841176Z .2020-09-09T19:06:21.565801200Z .2020-09-09T19:06:22.570561340Z .2020-09-09T19:06:23.575362818Z .2020-09-09T19:06:24.583315221Z .2020-09-09T19:06:25.587639733Z .2020-09-09T19:06:26.592153247Z .2020-09-09T19:06:27.596626613Z .2020-09-09T19:06:28.601329499Z .2020-09-09T19:06:29.606262278Z .2020-09-09T19:06:30.610849437Z .2020-09-09T19:06:31.617428753Z .2020-09-09T19:06:32.622060791Z .2020-09-09T19:06:33.626518033Z .2020-09-09T19:06:34.632391083Z .2020-09-09T19:06:35.637264109Z .2020-09-09T19:06:36.641862868Z .2020-09-09T19:06:37.647128745Z .2020-09-09T19:06:38.652486458Z .2020-09-09T19:06:39.657341762Z .2020-09-09T19:06:40.662505710Z .2020-09-09T19:06:41.667058974Z .2020-09-09T19:06:42.671613051Z .2020-09-09T19:06:43.676154458Z .2020-09-09T19:06:44.697024774Z .2020-09-09T19:06:45.694892277Z .2020-09-09T19:06:46.699285060Z .2020-09-09T19:06:47.704090711Z .2020-09-09T19:06:48.708667064Z .2020-09-09T19:06:49.714727402Z .2020-09-09T19:06:50.720228979Z .2020-09-09T19:06:51.725158628Z . So... something is hanging on to these ports on that node, and the setup container is just hanging out waiting for them to get free? Kernel issue? Or maybe not, because "Waiting for port :6443 and :6080 to be released" shows up in a dozen existing bugs, but none of them are still open, and none of them were reported against 4.6. Also, not sure what's writing these logs, but newlines are cheap ;). And it's also not clear to me why auth wants all three kube-apiserver pods up, I'd expect the loadbalancer to mean we only needed a single, healthy kube-apiserver pod to make consumers like the auth operator happy.
> And it's also not clear to me why auth wants all three kube-apiserver pods up, I'd expect the loadbalancer to mean we only needed a single, healthy kube-apiserver pod to make consumers like the auth operator happy. No, for oauth to work we have to make sure all of them serve the oauth .well-known endpoint.
Also seems like lots of CI hits: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=need+at+least+3+kube-apiservers%2C+got+2&maxAge=24h&type=junit&groupBy=job' | grep 'failures match' | sort pull-ci-openshift-cloud-credential-operator-master-e2e-aws-upgrade - 2 runs, 50% failed, 100% of failures match ... pull-ci-operator-framework-operator-lifecycle-manager-master-e2e-upgrade - 18 runs, 33% failed, 133% of failures match rehearse-11729-pull-ci-openshift-cluster-etcd-operator-master-e2e - 2 runs, 50% failed, 100% of failures match rehearse-11731-pull-ci-openshift-cluster-authentication-operator-master-e2e-upgrade - 6 runs, 33% failed, 50% of failures match release-openshift-ocp-installer-e2e-azure-ovn-4.6 - 9 runs, 89% failed, 13% of failures match release-openshift-ocp-installer-e2e-gcp-rt-4.6 - 9 runs, 22% failed, 50% of failures match release-openshift-ocp-installer-e2e-ovirt-upgrade-4.5-stable-to-4.6-ci - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-disruptive-4.6 - 1 runs, 100% failed, 100% of failures match release-openshift-origin-installer-e2e-aws-upgrade-4.5-stable-to-4.6-ci - 12 runs, 17% failed, 300% of failures match release-openshift-origin-installer-e2e-aws-upgrade - 81 runs, 19% failed, 133% of failures match release-openshift-origin-installer-e2e-azure-4.6 - 2 runs, 50% failed, 100% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.5-stable-to-4.6-ci - 4 runs, 100% failed, 25% of failures match release-openshift-origin-installer-e2e-azure-upgrade-4.6 - 4 runs, 100% failed, 25% of failures match release-openshift-origin-installer-e2e-gcp-upgrade - 21 runs, 38% failed, 100% of failures match Not sure how many of those also failed install, but we are certainly making auth skittish in CI.
root cause, OSD clusters-service is setting networkType: OpenshiftSDN changing this to networkType: OpenShiftSDN results in a successful install. OpenshiftSDN is ok in 4.5, but not 4.6. Why?
no longer an SD Blocker, and caused by a bad install-config. We may open a fresh BZ to look into why `networkType: OpenshiftSDN` in the install-config breaks in 4.6 but not 4.5. closing.
new bug = BZ 1877984
*** This bug has been marked as a duplicate of bug 1877984 ***