Description of problem: Upgrade jobs frequently (~44% over the past 24 hours) have errors like: $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head -n1 Apr 24 19:25:24.870 E clusteroperator/machine-config changed Degraded to True: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 1, updated: 1, unavailable: 1): Unable to apply 4.1.0-0.ci-2019-04-24-181848: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 1, updated: 1, unavailable: 1) and: $ curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head -n1 Apr 24 19:22:27.698 E clusteroperator/machine-config changed Degraded to True: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1): Failed to resync 4.1.0-0.ci-2019-04-24-165200 because: error pool master is not ready, retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, unavailable: 1) I'm having difficulty drilling into this one, with nothing in: $ for LOG in $(curl -s https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/pods/ | grep -o 'https://[^"]*machine-config[^"]*\.gz'); do echo -e "\n\n$LOG\n\n"; curl -s "${LOG}" | gunzip; done | less -S jumping out at me except some issues from OpenShift/Kubernetes API-server hiccups. It's possible that that's all there is to it, but if so, it would be nice if the machine-config operator laid the blame more obviously in its ClusterOperator Degraded message (in job 706 and 707, those ClusterOperators are all happy again by the time the teardown container comes around to collect them). This may be a dup of bug 1695721, but without more details on the underlying cause, it's hard to know.
(In reply to W. Trevor King from comment #0) > Description of problem: > > Upgrade jobs frequently (~44% over the past 24 hours) have errors like: > > $ curl -s > https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release- > openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head > -n1 > Apr 24 19:25:24.870 E clusteroperator/machine-config changed Degraded to > True: error pool master is not ready, retrying. Status: (pool degraded: > false total: 3, ready 1, updated: 1, unavailable: 1): Unable to apply > 4.1.0-0.ci-2019-04-24-181848: error pool master is not ready, retrying. > Status: (pool degraded: false total: 3, ready 1, updated: 1, unavailable: 1) > > and: > > $ curl -s > https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release- > openshift-origin-installer-e2e-aws-upgrade/706 | grep machine-config | head > -n1 > Apr 24 19:22:27.698 E clusteroperator/machine-config changed Degraded to > True: error pool master is not ready, retrying. Status: (pool degraded: > false total: 3, ready 0, updated: 0, unavailable: 1): Failed to resync > 4.1.0-0.ci-2019-04-24-165200 because: error pool master is not ready, > retrying. Status: (pool degraded: false total: 3, ready 0, updated: 0, > unavailable: 1) > > I'm having difficulty drilling into this one, with nothing in: > > $ for LOG in $(curl -s > https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release- > openshift-origin-installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/ > pods/ | grep -o 'https://[^"]*machine-config[^"]*\.gz'); do echo -e > "\n\n$LOG\n\n"; curl -s "${LOG}" | gunzip; done | less -S > > jumping out at me except some issues from OpenShift/Kubernetes API-server > hiccups. It's possible that that's all there is to it, but if so, it would > be nice if the machine-config operator laid the blame more obviously in its > ClusterOperator Degraded message (in job 706 and 707, those ClusterOperators > are all happy again by the time the teardown container comes around to > collect them). I would be perfect and awesome to have such a way to bubble up the blame to some other operator but since the MCO is really the last one (and loudest) in the initial chain how would we know who is really failing to point the finger?
> I would be perfect and awesome to have such a way to bubble up the blame to some other operator but since the MCO is really the last one (and loudest) in the initial chain how would we know who is really failing to point the finger? Well, highlighting the node you're marking invalid would at least help narrow the search ;). By the time the cluster is getting torn down, all the nodes were happy again (or I'm looking in the wrong place): $ curl -s https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/nodes.json | jq -r '[.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] | from_entries) | .conditions.Ready.status + "\t" + .conditions.Ready.lastTransitionTime + "\t" + .metadata.name] | sort[]' True 2019-04-24T18:40:35Z ip-10-0-140-230.ec2.internal True 2019-04-24T18:45:22Z ip-10-0-163-80.ec2.internal True 2019-04-24T19:23:54Z ip-10-0-141-28.ec2.internal True 2019-04-24T19:24:07Z ip-10-0-175-155.ec2.internal True 2019-04-24T19:27:50Z ip-10-0-149-41.ec2.internal True 2019-04-24T19:28:12Z ip-10-0-158-68.ec2.internal
(In reply to W. Trevor King from comment #2) > > I would be perfect and awesome to have such a way to bubble up the blame to some other operator but since the MCO is really the last one (and loudest) in the initial chain how would we know who is really failing to point the finger? > > Well, highlighting the node you're marking invalid would at least help > narrow the search ;). By the time the cluster is getting torn down, all the > nodes were happy again (or I'm looking in the wrong place): Latest MCO code has the node failing in the clusterobject and machineconfgipool when that happens and logs also. That's also a transient error, it's just waiting for the master pool to roll out and it's thus reporting Degraded (as it should be right?). That reconciles and we flip Degraded=False then. If it's not transient again, MCO CO and MCP objects statuses should contain the bad node > > $ curl -s > https://storage.googleapis.com/origin-ci-test/logs/release-openshift-origin- > installer-e2e-aws-upgrade/706/artifacts/e2e-aws-upgrade/nodes.json | jq -r > '[.items[] | .conditions = ([.status.conditions[] | {key: .type, value: .}] > | from_entries) | .conditions.Ready.status + "\t" + > .conditions.Ready.lastTransitionTime + "\t" + .metadata.name] | sort[]' > True 2019-04-24T18:40:35Z ip-10-0-140-230.ec2.internal > True 2019-04-24T18:45:22Z ip-10-0-163-80.ec2.internal > True 2019-04-24T19:23:54Z ip-10-0-141-28.ec2.internal > True 2019-04-24T19:24:07Z ip-10-0-175-155.ec2.internal > True 2019-04-24T19:27:50Z ip-10-0-149-41.ec2.internal > True 2019-04-24T19:28:12Z ip-10-0-158-68.ec2.internal The nodes being up doesn't link with MCO being permanently degraded, the master pool for instance could have issues rendering the MC for the pool.
https://github.com/openshift/machine-config-operator/pull/663
Hit this issue when upgrading from 4.1.0-0.nightly-2019-05-04-070249 to 4.1.0-0.nightly-2019-05-04-210601. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.1.0-0.nightly-2019-05-04-210601 True True 125m Unable to apply 4.1.0-0.nightly-2019-05-04-210601: the cluster operator machine-config has not yet successfully rolled out $ oc get clusteroperator machine-config -ojson |jq .status { "conditions": [ { "lastTransitionTime": "2019-05-05T08:26:38Z", "message": "Cluster not available for 4.1.0-0.nightly-2019-05-04-210601", "status": "False", "type": "Available" }, { "lastTransitionTime": "2019-05-05T08:06:30Z", "message": "Working towards 4.1.0-0.nightly-2019-05-04-210601", "status": "True", "type": "Progressing" }, { "lastTransitionTime": "2019-05-05T08:26:38Z", "message": "Unable to apply 4.1.0-0.nightly-2019-05-04-210601: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)", "reason": "timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1)", "status": "True", "type": "Degraded" } ], "extension": { "master": "pool is degraded because nodes fail with \"1 nodes are reporting degraded status on sync\": \"Node ip-172-31-133-232.us-east-2.compute.internal is reporting: \\\"unexpected on-disk state validating against rendered-master-8984b6368bfe1d46d30d7374a15a8261\\\"\"", "worker": "all 2 nodes are at latest configuration rendered-worker-947820780bdec5ffd4008a707cb83310" }, "relatedObjects": [ { "group": "", "name": "openshift-machine-config-operator", "resource": "namespaces" } ], "versions": [ { "name": "operator", "version": "4.1.0-0.nightly-2019-05-04-070249" } ] } $ oc get nodes NAME STATUS ROLES AGE VERSION ip-172-31-130-175.us-east-2.compute.internal Ready worker 3h55m v1.13.4+05c6cd644 ip-172-31-133-232.us-east-2.compute.internal Ready,SchedulingDisabled master 4h v1.13.4+05c6cd644 ip-172-31-144-17.us-east-2.compute.internal Ready worker 3h54m v1.13.4+05c6cd644 ip-172-31-150-202.us-east-2.compute.internal Ready master 4h v1.13.4+05c6cd644 ip-172-31-163-24.us-east-2.compute.internal Ready master 4h v1.13.4+05c6cd644 kubelet log from the node ip-172-31-133-232.us-east-2.compute.internal May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.639154 1791 reconciler.go:181] operationExecutor.UnmountVolume started for volume "host" (UniqueName: "kubernetes.io/host-path/3cc8d4c5-6f15-11e9-a073-0addadce9b60-host") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60") May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.639226 1791 reconciler.go:181] operationExecutor.UnmountVolume started for volume "default-token-cqg9d" (UniqueName: "kubernetes.io/secret/3cc8d4c5-6f15-11e9-a073-0addadce9b60-default-token-cqg9d") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60") May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.639373 1791 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/host-path/3cc8d4c5-6f15-11e9-a073-0addadce9b60-host" (OuterVolumeSpecName: "host") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60"). InnerVolumeSpecName "host". PluginName "kubernetes.io/host-path", VolumeGidValue "" May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.644324 1791 operation_generator.go:688] UnmountVolume.TearDown succeeded for volume "kubernetes.io/secret/3cc8d4c5-6f15-11e9-a073-0addadce9b60-default-token-cqg9d" (OuterVolumeSpecName: "default-token-cqg9d") pod "3cc8d4c5-6f15-11e9-a073-0addadce9b60" (UID: "3cc8d4c5-6f15-11e9-a073-0addadce9b60"). InnerVolumeSpecName "default-token-cqg9d". PluginName "kubernetes.io/secret", VolumeGidValue "" May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.746346 1791 reconciler.go:301] Volume detached for volume "host" (UniqueName: "kubernetes.io/host-path/3cc8d4c5-6f15-11e9-a073-0addadce9b60-host") on node "ip-172-31-133-232.us-east-2.compute.internal" DevicePath "" May 05 09:10:17 ip-172-31-133-232 hyperkube[1791]: I0505 09:10:17.746396 1791 reconciler.go:301] Volume detached for volume "default-token-cqg9d" (UniqueName: "kubernetes.io/secret/3cc8d4c5-6f15-11e9-a073-0addadce9b60-default-token-cqg9d") on node "ip-172-31-133-232.us-east-2.compute.internal" DevicePath "" May 05 09:10:18 ip-172-31-133-232 hyperkube[1791]: W0505 09:10:18.586968 1791 pod_container_deletor.go:75] Container "d45fe80d79bee1992c5d8dfbd9976d9d22211c2e535a808095c7db3aa8b4ba96" not found in pod's containers $ oc describe nodes |grep machineconfig machineconfiguration.openshift.io/currentConfig: rendered-worker-947820780bdec5ffd4008a707cb83310 machineconfiguration.openshift.io/desiredConfig: rendered-worker-947820780bdec5ffd4008a707cb83310 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-8984b6368bfe1d46d30d7374a15a8261 machineconfiguration.openshift.io/desiredConfig: rendered-master-ee069029e0286524133849a11ae49886 machineconfiguration.openshift.io/reason: unexpected on-disk state validating against rendered-master-8984b6368bfe1d46d30d7374a15a8261 machineconfiguration.openshift.io/state: Degraded machineconfiguration.openshift.io/currentConfig: rendered-worker-947820780bdec5ffd4008a707cb83310 machineconfiguration.openshift.io/desiredConfig: rendered-worker-947820780bdec5ffd4008a707cb83310 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-ee069029e0286524133849a11ae49886 machineconfiguration.openshift.io/desiredConfig: rendered-master-ee069029e0286524133849a11ae49886 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done machineconfiguration.openshift.io/currentConfig: rendered-master-ee069029e0286524133849a11ae49886 machineconfiguration.openshift.io/desiredConfig: rendered-master-ee069029e0286524133849a11ae49886 machineconfiguration.openshift.io/reason: machineconfiguration.openshift.io/state: Done
This is a new one, not about this BZ. The potential fix for what you're reporting is here https://github.com/openshift/machine-config-operator/pull/704/commits/2510cff1c728a79d2cf3aeaa9ea6cda6cd8c6a6c
I'm moving the BZ to modified and the BZ to follow for that last upgrade error is this https://bugzilla.redhat.com/show_bug.cgi?id=1706606
I just upgraded my cluster from 4.1.0-0.nightly-2019-05-06-011159 to 4.1.0-0.nightly-2019-05-06-073916 and I see MCO failing very similarly: Unable to apply 4.1.0-0.nightly-2019-05-06-073916: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 1, updated: 1, unavailable: 1) Is 4.1.0-0.nightly-2019-05-06-073916 expected to have the fix?
Alex, what you're getting is https://bugzilla.redhat.com/show_bug.cgi?id=1706606
Created attachment 1565263 [details] MCD Logs I narrowed down which instance of MCD failed and have attached the logs. It looks like it's upset that the contents of /etc/crio/crio.conf have diverged from what it expected.
Combing over the latest failures in the `release-openshift-origin-installer-e2e-aws-upgrade` job...looks like the original failure isn't being seen anymore: $ for i in 1266 1268 1276 1278 1280 1281 1284 1286 1290 1292 1294 1303; do curl -s https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/$i/ | grep machine-config | head -n1; done May 07 01:40:01.831 E ns/openshift-machine-config-operator pod/machine-config-daemon-gkvkz node/ip-10-0-147-7.ec2.internal container=machine-config-daemon container exited with code -1 (Error): May 07 02:55:01.800 E ns/openshift-machine-config-operator pod/machine-config-operator-59669c4789-cm7pz node/ip-10-0-148-164.ec2.internal container=machine-config-operator container exited with code 2 (Error): May 07 09:54:56.482 E ns/openshift-machine-config-operator pod/machine-config-controller-69cdf48bf6-6xqsb node/ip-10-0-172-79.ec2.internal container=machine-config-controller container exited with code 2 (Error): May 07 11:26:22.635 E ns/openshift-machine-config-operator pod/machine-config-operator-5995bc78bb-b87t7 node/ip-10-0-132-201.ec2.internal container=machine-config-operator container exited with code 2 (Error): May 07 11:37:56.364 E ns/openshift-machine-config-operator pod/machine-config-operator-5995bc78bb-xzh4s node/ip-10-0-170-251.ec2.internal container=machine-config-operator container exited with code 2 (Error): May 07 13:04:19.003 E ns/openshift-machine-config-operator pod/machine-config-operator-5995bc78bb-nt4tf node/ip-10-0-161-194.ec2.internal container=machine-config-operator container exited with code 2 (Error): May 07 14:25:29.870 E ns/openshift-machine-config-operator pod/machine-config-daemon-grg5t node/ip-10-0-163-28.ec2.internal container=machine-config-daemon container exited with code 255 (Error): May 07 14:25:35.076 E ns/openshift-machine-config-operator pod/machine-config-daemon-w7zmh node/ip-10-0-167-63.ec2.internal container=machine-config-daemon container exited with code -1 (Error): May 07 17:25:04.008 E ns/openshift-machine-config-operator pod/machine-config-controller-676998fb78-9l5zp node/ip-10-0-138-208.ec2.internal container=machine-config-controller container exited with code 2 (Error): ...which leads me to believe the 10m wait enabled in PR 663 seems to be mitigating the original error. Moving to VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758