Description of problem: Unhealthy status of machineconfiguration.openshift.io/state=Degraded and unable to establish ssh connection with cluster after deploy. How reproducible: 1. Build 47.330 (ami-0a0438a5e62dc0f9f) + registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 (always) 2. Released version v0.14.0 (always) 3. Build 400.7.20190312.0 (ami-0eecbb884c8b35b1e) + registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 (partially) . machineconfiguration.openshift.io/state=Done , but unable to establish ssh connection Steps to Reproduce: 1. Install cluster 2. Login to cluster 3. Run "oc describe node|grep machineconfig" 4. Create the follow yaml file to add your public ssh key to remote machines apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig metadata: labels: machineconfiguration.openshift.io/role: master name: 00-master-ssh-02 spec: config: passwd: users: - name: core sshAuthorizedKeys: | ssh-rsa run "oc create -f 00-master-ssh-update-02.yaml" 5. Create bastion ssh pod: "curl https://raw.githubusercontent.com/eparis/ssh-bastion/master/deploy/deploy.sh | bash" 6. Try establish connection: ssh -i ~/id_rsa -o StrictHostKeyChecking=no -o ProxyCommand='ssh -i ~/id_rsa -A -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -W %h:%p core@$(oc get service -n openshift-ssh-bastion ssh-bastion -o jsonpath="{.status.loadBalancer.ingress[0]..hostname}")' core.compute.internal Actual results: 3. [cloud-user@preserve-olnester-workstation openshift]$ oc describe node|grep machineconfig machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e machineconfiguration.openshift.io/state=Degraded machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67 machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67 machineconfiguration.openshift.io/state=Degraded machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e machineconfiguration.openshift.io/state=Degraded machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67 machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67 machineconfiguration.openshift.io/state=Degraded machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67 machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67 machineconfiguration.openshift.io/state=Degraded machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e machineconfiguration.openshift.io/state=Degraded 6. [cloud-user@preserve-olnester-workstation ~]$ ssh -i ~/id_rsa -o StrictHostKeyChecking=no -o ProxyCommand='ssh -i ~/id_rsa -A -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -W %h:%p core@$(oc get service -n openshift-ssh-bastion ssh-bastion -o jsonpath="{.status.loadBalancer.ingress[0]..hostname}")' core.compute.internal ssh_exchange_identification: Connection closed by remote host ssh_exchange_identification: Connection closed by remote host Expected results: 3. machineconfiguration.openshift.io/state=Done 6. SSH connection is established Additional info: Here are logs from machine-config-daemon: [cloud-user@preserve-olnester-workstation openshift]$ oc logs machine-config-daemon-4tm5l -n openshift-machine-config-operator I0313 11:17:05.160638 5791 start.go:52] Version: 4.0.16-1-dirty I0313 11:17:05.161210 5791 start.go:88] starting node writer I0313 11:17:05.166939 5791 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0313 11:17:05.246718 5791 daemon.go:175] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 (47.330) I0313 11:17:05.249128 5791 daemon.go:247] Managing node: ip-10-0-133-188.us-east-2.compute.internal I0313 11:17:05.261483 5791 node.go:39] Setting initial node config: worker-f774332d8e959cfd954ac6f98faa3660 I0313 11:17:05.277596 5791 start.go:146] Calling chroot("/rootfs") I0313 11:17:05.277632 5791 run.go:22] Running captured: rpm-ostree status I0313 11:17:05.305453 5791 daemon.go:577] State: idle AutomaticUpdates: disabled Deployments: * pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 CustomOrigin: Managed by pivot tool Version: 47.330 (2019-02-23T04:17:13Z) pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:aff3ff28e3caa6ffe171cabad0048ab140825059b53c5da1587f14c5597f6823 CustomOrigin: Provisioned from oscontainer Version: 400.7.20190312.0 (2019-03-12T14:19:09Z) I0313 11:17:05.305504 5791 daemon.go:477] In bootstrap mode I0313 11:17:05.316144 5791 daemon.go:505] Current+desired config: worker-f774332d8e959cfd954ac6f98faa3660 I0313 11:17:05.345993 5791 daemon.go:613] No bootstrap pivot required; unlinking bootstrap node annotations I0313 11:17:05.347989 5791 daemon.go:640] Validated on-disk state I0313 11:17:05.348008 5791 daemon.go:672] In desired config worker-f774332d8e959cfd954ac6f98faa3660 I0313 11:17:05.348031 5791 start.go:165] Starting MachineConfigDaemon I0313 11:17:05.348041 5791 daemon.go:314] Enabling Kubelet Healthz Monitor E0313 11:18:00.502169 5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug="" E0313 13:01:28.406586 5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=31, ErrCode=NO_ERROR, debug="" E0313 13:09:00.971241 5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug="" E0313 13:11:22.625361 5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
I hit similar issue too in my cluster (47.330 + 4.0.0-0.nightly-2019-03-04-234414) # oc logs machine-config-daemon-kl5f5 -n openshift-machine-config-operator I0313 09:16:14.520551 21098 start.go:52] Version: 4.0.15-1-dirty I0313 09:16:14.520825 21098 start.go:88] starting node writer I0313 09:16:14.526389 21098 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json I0313 09:16:14.596006 21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330) I0313 09:16:14.596812 21098 daemon.go:247] Managing node: ip-10-0-151-157.us-east-2.compute.internal I0313 09:16:14.621930 21098 node.go:39] Setting initial node config: master-1fd62473a6bce230df3d90a1f6109081 I0313 09:16:14.644060 21098 start.go:146] Calling chroot("/rootfs") I0313 09:16:14.644095 21098 run.go:22] Running captured: rpm-ostree status I0313 09:16:14.671146 21098 daemon.go:577] State: idle AutomaticUpdates: disabled Deployments: * pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b CustomOrigin: Provisioned from oscontainer Version: 47.330 (2019-02-23T04:17:13Z) I0313 09:16:14.671197 21098 daemon.go:477] In bootstrap mode I0313 09:16:14.683197 21098 daemon.go:505] Current+desired config: master-1fd62473a6bce230df3d90a1f6109081 I0313 09:16:14.686404 21098 daemon.go:609] Bootstrap pivot required to: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 I0313 09:16:14.686534 21098 update.go:674] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 pivot.service: I0313 09:16:20.041972 21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 pivot.service: time="2019-03-13T09:16:20Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" pivot.service: W0313 09:16:20.390391 21151 run.go:45] skopeo failed: exit status 1; retrying... pivot.service: I0313 09:16:30.390660 21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 pivot.service: time="2019-03-13T09:16:30Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" pivot.service: W0313 09:16:30.693978 21151 run.go:45] skopeo failed: exit status 1; retrying... pivot.service: I0313 09:16:50.694228 21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 pivot.service: time="2019-03-13T09:16:50Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" pivot.service: W0313 09:16:50.995356 21151 run.go:45] skopeo failed: exit status 1; retrying... pivot.service: I0313 09:17:30.995530 21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 pivot.service: time="2019-03-13T09:17:31Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" pivot.service: W0313 09:17:31.315975 21151 run.go:45] skopeo failed: exit status 1; retrying... pivot.service: I0313 09:18:51.316209 21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 pivot.service: time="2019-03-13T09:18:51Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" pivot.service: W0313 09:18:51.664982 21151 run.go:45] skopeo failed: exit status 1; retrying... pivot.service: F0313 09:18:51.665013 21151 run.go:53] skopeo: timed out waiting for the condition E0313 09:18:51.674651 21098 daemon.go:446] Fatal error checking initial state of node: Checking initial state: Failed to run pivot: error queuing start job; got failed E0313 09:18:51.674908 21098 writer.go:91] Marking degraded due to: Checking initial state: Failed to run pivot: error queuing start job; got failed I0313 09:18:51.717659 21098 daemon.go:448] Entering degraded state; going to sleep This would make all machine get into `Degrade` state, cluster get into unhealty state. # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.0.0-0.nightly-2019-03-04-234414 True False 7h9m Error while reconciling 4.0.0-0.nightly-2019-03-04-234414: the cluster operator machine-config is failing
> pivot.service: I0313 09:17:30.995530 21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 > pivot.service: time="2019-03-13T09:17:31Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized" Sounds like a pull-secret issue. Maybe like bug 1686556, where non-kubelet tooling isn't looking in the right place for the pull secret.
> I0313 09:16:14.596006 21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330) RHCOS 47.330 is too old. Try running with a newer RHCOS (I haven't looked up exactly which version pivot#41 landed in though).
(In reply to W. Trevor King from comment #3) > > I0313 09:16:14.596006 21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330) > > RHCOS 47.330 is too old. Try running with a newer RHCOS (I haven't looked > up exactly which version pivot#41 landed in though). Just like the initial report, 400.7.20190312.0 rhcos would have no such error. Yeah, probably this issue would not happen using newer rhcos, but after I talked with Oleg, we could reproduce this bug with released beta2. How to fix it for released beta2? As what I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1677198#c8, QE is running beta2 release version -> beta3 pre-release candidate upgrade testing, this issue seem like bring a big trouble for our testing.