Bug 1688321 - Unhealthy status of machineconfiguration.openshift.io/state=Degraded
Summary: Unhealthy status of machineconfiguration.openshift.io/state=Degraded
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.1.0
Assignee: Steve Milner
QA Contact: Micah Abbott
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-13 14:35 UTC by Oleg Nesterov
Modified: 2019-03-14 18:47 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-03-14 18:47:01 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift pivot pull 41 0 'None' closed Use kubelet auth if available 2021-02-02 21:41:08 UTC

Description Oleg Nesterov 2019-03-13 14:35:08 UTC
Description of problem:
Unhealthy status of machineconfiguration.openshift.io/state=Degraded and unable to establish ssh connection with cluster after deploy.

How reproducible:
1. Build 47.330 (ami-0a0438a5e62dc0f9f) + registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 (always)
2. Released version v0.14.0 (always)
3. Build 400.7.20190312.0 (ami-0eecbb884c8b35b1e) + registry.svc.ci.openshift.org/ocp/release:4.0.0-0.nightly-2019-03-06-074438 (partially) . machineconfiguration.openshift.io/state=Done , but unable to establish ssh connection

Steps to Reproduce:
1. Install cluster
2. Login to cluster
3. Run "oc describe node|grep machineconfig"
4. Create the follow yaml file to add your public ssh key to remote machines

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 00-master-ssh-02
spec:
  config:
    passwd:
      users:
      - name: core
        sshAuthorizedKeys: |
          ssh-rsa 

run "oc create -f 00-master-ssh-update-02.yaml"
5. Create bastion ssh pod: "curl https://raw.githubusercontent.com/eparis/ssh-bastion/master/deploy/deploy.sh | bash"

6. Try establish connection:
ssh -i ~/id_rsa -o StrictHostKeyChecking=no -o ProxyCommand='ssh -i ~/id_rsa -A -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -W %h:%p core@$(oc get service -n openshift-ssh-bastion ssh-bastion -o jsonpath="{.status.loadBalancer.ingress[0]..hostname}")' core.compute.internal


Actual results:
3. [cloud-user@preserve-olnester-workstation openshift]$ oc describe node|grep machineconfig
                    machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/desiredConfig=master-a9186123f0b543f64cc430e1c11a9f67
                    machineconfiguration.openshift.io/state=Degraded
                    machineconfiguration.openshift.io/currentConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/desiredConfig=worker-17a762de663485fd7cca47e768e3c31e
                    machineconfiguration.openshift.io/state=Degraded

6. [cloud-user@preserve-olnester-workstation ~]$ ssh -i ~/id_rsa -o StrictHostKeyChecking=no -o ProxyCommand='ssh -i ~/id_rsa -A -o StrictHostKeyChecking=no -o ServerAliveInterval=30 -W %h:%p core@$(oc get service -n openshift-ssh-bastion ssh-bastion -o jsonpath="{.status.loadBalancer.ingress[0]..hostname}")' core.compute.internal
ssh_exchange_identification: Connection closed by remote host
ssh_exchange_identification: Connection closed by remote host


Expected results:
3. machineconfiguration.openshift.io/state=Done
6. SSH connection is established 

Additional info:
Here are logs from  machine-config-daemon:

[cloud-user@preserve-olnester-workstation openshift]$ oc logs machine-config-daemon-4tm5l -n openshift-machine-config-operator
I0313 11:17:05.160638    5791 start.go:52] Version: 4.0.16-1-dirty
I0313 11:17:05.161210    5791 start.go:88] starting node writer
I0313 11:17:05.166939    5791 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0313 11:17:05.246718    5791 daemon.go:175] Booted osImageURL: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 (47.330)
I0313 11:17:05.249128    5791 daemon.go:247] Managing node: ip-10-0-133-188.us-east-2.compute.internal
I0313 11:17:05.261483    5791 node.go:39] Setting initial node config: worker-f774332d8e959cfd954ac6f98faa3660
I0313 11:17:05.277596    5791 start.go:146] Calling chroot("/rootfs")
I0313 11:17:05.277632    5791 run.go:22] Running captured: rpm-ostree status
I0313 11:17:05.305453    5791 daemon.go:577] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
              CustomOrigin: Managed by pivot tool
                   Version: 47.330 (2019-02-23T04:17:13Z)

  pivot://docker-registry-default.cloud.registry.upshift.redhat.com/redhat-coreos/maipo@sha256:aff3ff28e3caa6ffe171cabad0048ab140825059b53c5da1587f14c5597f6823
              CustomOrigin: Provisioned from oscontainer
                   Version: 400.7.20190312.0 (2019-03-12T14:19:09Z)
I0313 11:17:05.305504    5791 daemon.go:477] In bootstrap mode
I0313 11:17:05.316144    5791 daemon.go:505] Current+desired config: worker-f774332d8e959cfd954ac6f98faa3660
I0313 11:17:05.345993    5791 daemon.go:613] No bootstrap pivot required; unlinking bootstrap node annotations
I0313 11:17:05.347989    5791 daemon.go:640] Validated on-disk state
I0313 11:17:05.348008    5791 daemon.go:672] In desired config worker-f774332d8e959cfd954ac6f98faa3660
I0313 11:17:05.348031    5791 start.go:165] Starting MachineConfigDaemon
I0313 11:17:05.348041    5791 daemon.go:314] Enabling Kubelet Healthz Monitor
E0313 11:18:00.502169    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=11, ErrCode=NO_ERROR, debug=""
E0313 13:01:28.406586    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=31, ErrCode=NO_ERROR, debug=""
E0313 13:09:00.971241    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""
E0313 13:11:22.625361    5791 streamwatcher.go:109] Unable to decode an event from the watch stream: http2: server sent GOAWAY and closed the connection; LastStreamID=1, ErrCode=NO_ERROR, debug=""

Comment 1 Johnny Liu 2019-03-13 16:36:26 UTC
I hit similar issue too in my cluster (47.330 + 4.0.0-0.nightly-2019-03-04-234414)

# oc logs machine-config-daemon-kl5f5 -n openshift-machine-config-operator
I0313 09:16:14.520551   21098 start.go:52] Version: 4.0.15-1-dirty
I0313 09:16:14.520825   21098 start.go:88] starting node writer
I0313 09:16:14.526389   21098 run.go:22] Running captured: chroot /rootfs rpm-ostree status --json
I0313 09:16:14.596006   21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)
I0313 09:16:14.596812   21098 daemon.go:247] Managing node: ip-10-0-151-157.us-east-2.compute.internal
I0313 09:16:14.621930   21098 node.go:39] Setting initial node config: master-1fd62473a6bce230df3d90a1f6109081
I0313 09:16:14.644060   21098 start.go:146] Calling chroot("/rootfs")
I0313 09:16:14.644095   21098 run.go:22] Running captured: rpm-ostree status
I0313 09:16:14.671146   21098 daemon.go:577] State: idle
AutomaticUpdates: disabled
Deployments:
* pivot://registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b
              CustomOrigin: Provisioned from oscontainer
                   Version: 47.330 (2019-02-23T04:17:13Z)
I0313 09:16:14.671197   21098 daemon.go:477] In bootstrap mode
I0313 09:16:14.683197   21098 daemon.go:505] Current+desired config: master-1fd62473a6bce230df3d90a1f6109081
I0313 09:16:14.686404   21098 daemon.go:609] Bootstrap pivot required to: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
I0313 09:16:14.686534   21098 update.go:674] Updating OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: I0313 09:16:20.041972   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:16:20Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:16:20.390391   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:16:30.390660   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:16:30Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:16:30.693978   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:16:50.694228   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:16:50Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:16:50.995356   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:17:30.995530   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:17:31Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:17:31.315975   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: I0313 09:18:51.316209   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
pivot.service: time="2019-03-13T09:18:51Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"
pivot.service: W0313 09:18:51.664982   21151 run.go:45] skopeo failed: exit status 1; retrying...
pivot.service: F0313 09:18:51.665013   21151 run.go:53] skopeo: timed out waiting for the condition
E0313 09:18:51.674651   21098 daemon.go:446] Fatal error checking initial state of node: Checking initial state: Failed to run pivot: error queuing start job; got failed
E0313 09:18:51.674908   21098 writer.go:91] Marking degraded due to: Checking initial state: Failed to run pivot: error queuing start job; got failed
I0313 09:18:51.717659   21098 daemon.go:448] Entering degraded state; going to sleep


This would make all machine get into `Degrade` state, cluster get into unhealty state.
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.nightly-2019-03-04-234414   True        False         7h9m    Error while reconciling 4.0.0-0.nightly-2019-03-04-234414: the cluster operator machine-config is failing

Comment 2 W. Trevor King 2019-03-13 16:50:20 UTC
> pivot.service: I0313 09:17:30.995530   21151 run.go:16] Running: skopeo inspect docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181
> pivot.service: time="2019-03-13T09:17:31Z" level=fatal msg="Error reading manifest sha256:399582f711226ab1a0e76d8928ec55436dea9f8dc60976c10790d308b9d92181 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: unauthorized: access to the requested resource is not authorized"

Sounds like a pull-secret issue.  Maybe like bug 1686556, where non-kubelet tooling isn't looking in the right place for the pull secret.

Comment 3 W. Trevor King 2019-03-13 16:54:59 UTC
> I0313 09:16:14.596006   21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)

RHCOS 47.330 is too old.  Try running with a newer RHCOS (I haven't looked up exactly which version pivot#41 landed in though).

Comment 4 Johnny Liu 2019-03-13 17:27:17 UTC
(In reply to W. Trevor King from comment #3)
> > I0313 09:16:14.596006   21098 daemon.go:175] Booted osImageURL: registry.svc.ci.openshift.org/rhcos/maipo@sha256:1262533e31a427917f94babeef2774c98373409897863ae742ff04120f32f79b (47.330)
> 
> RHCOS 47.330 is too old.  Try running with a newer RHCOS (I haven't looked
> up exactly which version pivot#41 landed in though).

Just like the initial report, 400.7.20190312.0 rhcos would have no such error. Yeah, probably this issue would not happen using newer rhcos, but after I talked with Oleg, we could reproduce this bug with released beta2. How to fix it for released beta2? As what I mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1677198#c8, QE is running beta2 release version -> beta3 pre-release candidate upgrade testing, this issue seem like bring a big trouble for our testing.


Note You need to log in before you can comment on or make changes to this bug.