Bug 1702832

Summary: [upgrade] clusteroperator/machine-config changed Degraded to True: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1)
Product: OpenShift Container Platform Reporter: W. Trevor King <wking>
Component: Machine Config OperatorAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Micah Abbott <miabbott>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.1.0CC: ccoleman, crawford, miabbott, piqin, sponnaga, trankin, wsun
Target Milestone: ---Keywords: BetaBlocker, TestBlocker, Upgrades
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:47:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
MCD Logs none

Description W. Trevor King 2019-04-24 21:45:38 UTC
Description of problem:

Upgrade jobs frequently (~44% over the past 24 hours) have errors like:

  $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head -n1
  Apr 24 19:25:24.870 E clusteroperator/machine-config changed Degraded to True: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 1, updated: 1, unavailable: 1): Unable to apply 4.1.0-0.ci-2019-04-24-181848: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 1, updated: 1, unavailable: 1)

and:

  $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head -n1
  Apr 24 19:22:27.698 E clusteroperator/machine-config changed Degraded to True: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1): Failed to resync 4.1.0-0.ci-2019-04-24-165200 because: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1)

I'm having difficulty drilling into this one, with nothing in:

  $ for LOG in $(curl -s https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/pods/ | grep -o 'https://[^"]*machine-config[^"]*\.gz'); do echo -e "\n\n$LOG\n\n"; curl -s "${LOG}" | gunzip; done | less -S

jumping out at me except some issues from OpenShift/Kubernetes API-server hiccups.  It's possible that that's all there is to it, but if so, it would be nice if the machine-config operator laid the blame more obviously in its ClusterOperator Degraded message (in job 706 and 707, those ClusterOperators are all happy again by the time the teardown container comes around to collect them).

This may be a dup of bug 1695721, but without more details on the underlying cause, it's hard to know.

Comment 1 Antonio Murdaca 2019-04-24 22:52:09 UTC
(In reply to W. Trevor King from comment #0)
> Description of problem:
> 
> Upgrade jobs frequently (~44% over the past 24 hours) have errors like:
> 
>   $ curl -s
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head
> -n1
>   Apr 24 19:25:24.870 E clusteroperator/machine-config changed Degraded to
> True: error pool master is not ready, retrying. Status: (pool degraded:
> false total: 3, ready 1, updated: 1, unavailable: 1): Unable to apply
> 4.1.0-0.ci-2019-04-24-181848: error pool master is not ready, retrying.
> Status: (pool degraded: false total: 3, ready 1, updated: 1, unavailable: 1)
> 
> and:
> 
>   $ curl -s
> https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head
> -n1
>   Apr 24 19:22:27.698 E clusteroperator/machine-config changed Degraded to
> True: error pool master is not ready, retrying. Status: (pool degraded:
> false total: 3, ready 0, updated: 0, unavailable: 1): Failed to resync
> 4.1.0-0.ci-2019-04-24-165200 because: error pool master is not ready,
> retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0,
> unavailable: 1)
> 
> I'm having difficulty drilling into this one, with nothing in:
> 
>   $ for LOG in $(curl -s
> https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-
> openshift-origin-installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/
> pods/ | grep -o 'https://[^"]*machine-config[^"]*\.gz'); do echo -e
> "\n\n$LOG\n\n"; curl -s "${LOG}" | gunzip; done | less -S
> 
> jumping out at me except some issues from OpenShift/Kubernetes API-server
> hiccups.  It's possible that that's all there is to it, but if so, it would
> be nice if the machine-config operator laid the blame more obviously in its
> ClusterOperator Degraded message (in job 706 and 707, those ClusterOperators
> are all happy again by the time the teardown container comes around to
> collect them).

I would be perfect and awesome to have such a way to bubble up the blame to some other operator but since the MCO is really the last one (and loudest) in the initial chain how would we know who is really failing to point the finger?

Comment 2 W. Trevor King 2019-04-24 23:54:30 UTC
> I would be perfect and awesome to have such a way to bubble up the blame to some other operator but since the MCO is really the last one (and loudest) in the initial chain how would we know who is really failing to point the finger?

Well, highlighting the node you're marking invalid would at least help narrow the search ;).  By the time the cluster is getting torn down, all the nodes were happy again (or I'm looking in the wrong place):

  $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/nodes.json | jq -r '[.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.status + "\t" + .conditions.Ready.lastTransitionTime + "\t" + .metadata.name] | sort[]'
  True	2019-04-24T18:40:35Z	ip-10-0-140-230.ec2.internal
  True	2019-04-24T18:45:22Z	ip-10-0-163-80.ec2.internal
  True	2019-04-24T19:23:54Z	ip-10-0-141-28.ec2.internal
  True	2019-04-24T19:24:07Z	ip-10-0-175-155.ec2.internal
  True	2019-04-24T19:27:50Z	ip-10-0-149-41.ec2.internal
  True	2019-04-24T19:28:12Z	ip-10-0-158-68.ec2.internal

Comment 3 Antonio Murdaca 2019-04-25 11:57:23 UTC
(In reply to W. Trevor King from comment #2)
> > I would be perfect and awesome to have such a way to bubble up the blame to some other operator but since the MCO is really the last one (and loudest) in the initial chain how would we know who is really failing to point the finger?
> 
> Well, highlighting the node you're marking invalid would at least help
> narrow the search ;).  By the time the cluster is getting torn down, all the
> nodes were happy again (or I'm looking in the wrong place):

Latest MCO code has the node failing in the clusterobject and machineconfgipool when that happens and logs also.
That's also a transient error, it's just waiting for the master pool to roll out and it's thus reporting Degraded (as it should be right?). That reconciles and we flip Degraded=False then.
If it's not transient again, MCO CO and MCP objects statuses should contain the bad node

> 
>   $ curl -s
> https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-
> installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/nodes.json | jq -r
> '[.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}]
> | from_entries) | .conditions.Ready.status + "\t" +
> .conditions.Ready.lastTransitionTime + "\t" + .metadata.name] | sort[]'
>   True	2019-04-24T18:40:35Z	ip-10-0-140-230.ec2.internal
>   True	2019-04-24T18:45:22Z	ip-10-0-163-80.ec2.internal
>   True	2019-04-24T19:23:54Z	ip-10-0-141-28.ec2.internal
>   True	2019-04-24T19:24:07Z	ip-10-0-175-155.ec2.internal
>   True	2019-04-24T19:27:50Z	ip-10-0-149-41.ec2.internal
>   True	2019-04-24T19:28:12Z	ip-10-0-158-68.ec2.internal

The nodes being up doesn't link with MCO being permanently degraded, the master pool for instance could have issues rendering the MC for the pool.

Comment 6 Qin Ping 2019-05-05 10:00:27 UTC
Hit this issue when upgrading from 4.1.0-0.nightly-2019-05-04-070249 to 4.1.0-0.nightly-2019-05-04-210601.

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.1.0-0.nightly-2019-05-04-210601   True        True          125m    Unable to apply 4.1.0-0.nightly-2019-05-04-210601: the cluster operator machine-config has not yet successfully rolled out

$ oc get clusteroperator machine-config -ojson |jq .status
{
  "conditions": [
    {
      "lastTransitionTime": "2019-05-05T08:26:38Z",
      "message": "Cluster not available for 4.1.0-0.nightly-2019-05-04-210601",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2019-05-05T08:06:30Z",
      "message": "Working towards 4.1.0-0.nightly-2019-05-04-210601",
      "status": "True",
      "type": "Progressing"
    },
    {
      "lastTransitionTime": "2019-05-05T08:26:38Z",
      "message": "Unable to apply 4.1.0-0.nightly-2019-05-04-210601: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)",
      "reason": "timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)",
      "status": "True",
      "type": "Degraded"
    }
  ],
  "extension": {
    "master": "pool is degraded because nodes fail with \"1 nodes are reporting degraded status on sync\": \"Node ip-172-31-133-232.us-east-2.compute.internal is reporting: \\\"unexpected on-disk state validating against rendered-master-8984b6368bfe1d46d30d7374a15a8261\\\"\"",
    "worker": "all 2 nodes are at latest configuration rendered-worker-947820780bdec5ffd4008a707cb83310"
  },
  "relatedObjects": [
    {
      "group": "",
      "name": "openshift-machine-config-operator",
      "resource": "namespaces"
    }
  ],
  "versions": [
    {
      "name": "operator",
      "version": "4.1.0-0.nightly-2019-05-04-070249"
    }
  ]
}

$ oc get nodes
NAME                                           STATUS                     ROLES    AGE     VERSION
ip-172-31-130-175.us-east-2.compute.internal   Ready                      worker   3h55m   v1.13.4+05c6cd644
ip-172-31-133-232.us-east-2.compute.internal   Ready,SchedulingDisabled   master   4h      v1.13.4+05c6cd644
ip-172-31-144-17.us-east-2.compute.internal    Ready                      worker   3h54m   v1.13.4+05c6cd644
ip-172-31-150-202.us-east-2.compute.internal   Ready                      master   4h      v1.13.4+05c6cd644
ip-172-31-163-24.us-east-2.compute.internal    Ready                      master   4h      v1.13.4+05c6cd644

kubelet log from the node ip-172-31-133-232.us-east-2.compute.internal
May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.639154    1791 reconciler.go:181] operationExecutor.UnmountVolume started for volume "host" (UniqueName: "kubernetes.io/host-path/3cc8d4c5-6f15-11e9-a073-0addadce9b60-host") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60")
May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.639226    1791 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-cqg9d" (UniqueName: "kubernetes.io/secret/3cc8d4c5-6f15-11e9-a073-0addadce9b60-default-token-cqg9d") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60")
May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.639373    1791 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/3cc8d4c5-6f15-11e9-a073-0addadce9b60-host" (OuterVolumeSpecName: "host") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60"). InnerVolumeSpecName "host". PluginName "kubernetes.io/host-path", VolumeGidValue ""
May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.644324    1791 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/3cc8d4c5-6f15-11e9-a073-0addadce9b60-default-token-cqg9d" (OuterVolumeSpecName: "default-token-cqg9d") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60"). InnerVolumeSpecName "default-token-cqg9d". PluginName "kubernetes.io/secret", VolumeGidValue ""
May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.746346    1791 reconciler.go:301] Volume detached for volume "host" (UniqueName: "kubernetes.io/host-path/3cc8d4c5-6f15-11e9-a073-0addadce9b60-host") on node "ip-172-31-133-232.us-east-2.compute.internal" DevicePath ""
May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.746396    1791 reconciler.go:301] Volume detached for volume "default-token-cqg9d" (UniqueName: "kubernetes.io/secret/3cc8d4c5-6f15-11e9-a073-0addadce9b60-default-token-cqg9d") on node "ip-172-31-133-232.us-east-2.compute.internal" DevicePath ""
May 05 09:10:18 ip-172-31-133-232 hyperkube[1791]: W0505 09:10:18.586968    1791 pod_container_deletor.go:75] Container "d45fe80d79bee1992c5d8dfbd9976d9d22211c2e535a808095c7db3aa8b4ba96" not found in pod's containers


$ oc describe nodes |grep machineconfig
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-947820780bdec5ffd4008a707cb83310
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-947820780bdec5ffd4008a707cb83310
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-8984b6368bfe1d46d30d7374a15a8261
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-ee069029e0286524133849a11ae49886
                    machineconfiguration.openshift.io/reason: unexpected on-disk state validating against rendered-master-8984b6368bfe1d46d30d7374a15a8261
                    machineconfiguration.openshift.io/state: Degraded
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-947820780bdec5ffd4008a707cb83310
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-947820780bdec5ffd4008a707cb83310
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-ee069029e0286524133849a11ae49886
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-ee069029e0286524133849a11ae49886
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done
                    machineconfiguration.openshift.io/currentConfig: rendered-master-ee069029e0286524133849a11ae49886
                    machineconfiguration.openshift.io/desiredConfig: rendered-master-ee069029e0286524133849a11ae49886
                    machineconfiguration.openshift.io/reason: 
                    machineconfiguration.openshift.io/state: Done

Comment 7 Antonio Murdaca 2019-05-05 18:43:24 UTC
This is a new one, not about this BZ. The potential fix for what you're reporting is here https://github.com/openshift/machine-config-operator/pull/704/commits/2510cff1c728a79d2cf3aeaa9ea6cda6cd8c6a6c

Comment 8 Antonio Murdaca 2019-05-05 18:47:10 UTC
I'm moving the BZ to modified and the BZ to follow for that last upgrade error is this https://bugzilla.redhat.com/show_bug.cgi?id=1706606

Comment 11 Alex Crawford 2019-05-07 14:49:52 UTC
I just upgraded my cluster from 4.1.0-0.nightly-2019-05-06-011159 to 4.1.0-0.nightly-2019-05-06-073916 and I see MCO failing very similarly:

    Unable to apply 4.1.0-0.nightly-2019-05-06-073916: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 1)

Is 4.1.0-0.nightly-2019-05-06-073916 expected to have the fix?

Comment 12 Antonio Murdaca 2019-05-07 14:52:27 UTC
Alex, what you're getting is https://bugzilla.redhat.com/show_bug.cgi?id=1706606

Comment 13 Alex Crawford 2019-05-07 14:55:45 UTC
Created attachment 1565263 [details]
MCD Logs

I narrowed down which instance of MCD failed and have attached the logs. It looks like it's upset that the contents of /etc/crio/crio.conf have diverged from what it expected.

Comment 14 Micah Abbott 2019-05-07 18:46:53 UTC
Combing over the latest failures in the `release-openshift-origin-installer-e2e-aws-upgrade` job...looks like the original failure isn't being seen anymore:

$ for i in 1266 1268 1276 1278 1280 1281 1284 1286 1290 1292 1294 1303; do curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/$i/ | grep machine-config | head -n1; done
May 07 01:40:01.831 E ns/openshift-machine-config-operator pod/machine-config-daemon-gkvkz node/ip-10-0-147-7.ec2.internal container=machine-config-daemon container exited with code -1 (Error): 
May 07 02:55:01.800 E ns/openshift-machine-config-operator pod/machine-config-operator-59669c4789-cm7pz node/ip-10-0-148-164.ec2.internal container=machine-config-operator container exited with code 2 (Error): 
May 07 09:54:56.482 E ns/openshift-machine-config-operator pod/machine-config-controller-69cdf48bf6-6xqsb node/ip-10-0-172-79.ec2.internal container=machine-config-controller container exited with code 2 (Error): 
May 07 11:26:22.635 E ns/openshift-machine-config-operator pod/machine-config-operator-5995bc78bb-b87t7 node/ip-10-0-132-201.ec2.internal container=machine-config-operator container exited with code 2 (Error): 
May 07 11:37:56.364 E ns/openshift-machine-config-operator pod/machine-config-operator-5995bc78bb-xzh4s node/ip-10-0-170-251.ec2.internal container=machine-config-operator container exited with code 2 (Error): 
May 07 13:04:19.003 E ns/openshift-machine-config-operator pod/machine-config-operator-5995bc78bb-nt4tf node/ip-10-0-161-194.ec2.internal container=machine-config-operator container exited with code 2 (Error): 
May 07 14:25:29.870 E ns/openshift-machine-config-operator pod/machine-config-daemon-grg5t node/ip-10-0-163-28.ec2.internal container=machine-config-daemon container exited with code 255 (Error): 
May 07 14:25:35.076 E ns/openshift-machine-config-operator pod/machine-config-daemon-w7zmh node/ip-10-0-167-63.ec2.internal container=machine-config-daemon container exited with code -1 (Error): 
May 07 17:25:04.008 E ns/openshift-machine-config-operator pod/machine-config-controller-676998fb78-9l5zp node/ip-10-0-138-208.ec2.internal container=machine-config-controller container exited with code 2 (Error): 


...which leads me to believe the 10m wait enabled in PR 663 seems to be mitigating the original error.  Moving to VERIFIED

Comment 16 errata-xmlrpc 2019-06-04 10:47:56 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758