Bug 1840385
Summary: | MCD fails when adding replicas to MachineSet and error isn't visible enough | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Naveen Malik <nmalik> | ||||
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | ||||
Status: | CLOSED ERRATA | QA Contact: | Michael Nguyen <mnguyen> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | high | ||||||
Version: | 4.3.z | CC: | agarcial, amurdaca, cattias, jaharrin, jbeakley, kgarriso, pbergene, tparikh | ||||
Target Milestone: | --- | Keywords: | ServiceDeliveryImpact | ||||
Target Release: | 4.6.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-10-27 16:01:40 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Naveen Malik
2020-05-26 20:29:54 UTC
Machines are stuck on provisioned which means all good creating the infra. The issue can be either with the nodes not being able to authenticate themselves during bootstrap or something wrong with machine-config being passed. Can we please get logs for cluster-machine-approver pod and kubectl get csr? Unfortunately cluster-machine-approver are missed from must-gather. Attaching logs. $ oc get csr No resources found. Created attachment 1693033 [details]
machine-approver-controller container logs
All seems ok from machine API and infrastructure pov. Machines report "provisioned" appropriately and there's no even pending CSRs. For some reason the software running within the instances has failed and has prevented the instances from running a kubelet, request to authenticate themselves and becoming nodes. Form one of the machine config daemons: ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2 2020-05-19T13:06:46.445346629Z Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b920c6ed07fa42565949ce61f14b3b9dd7bea64d2cfb13539fcea7068ea799f2: error pinging docker registry quay.io: Get https://quay.io/v2/: EOF 2020-05-19T13:06:46.445346629Z W0519 13:06:46.222937 2996778 run.go:40] podman failed: exit status 125; retrying... 2020-05-19T13:08:06.702105132Z I0519 13:08:06.223106 2996778 run.go:16] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ Would be good to surface this up to the Machine CR, good to know we have some indicator at least! This happened while quay.io was having problems last week, May 19, so it's not surprising this is the issue. I will try to kick one of the stuck machines by deleting it and report findings. New Machine came online with EC2 instance and Node just fine. Perhaps we re-state this BZ as the pull error on node provisioning is not surfaced to the Machine CR? (In reply to Naveen Malik from comment #8) > New Machine came online with EC2 instance and Node just fine. Perhaps we > re-state this BZ as the pull error on node provisioning is not surfaced to > the Machine CR? given that and the fact that the pull error is shadowing the real reason behind this bz anyway and the domain is still MAO, I'm bringing this back to MAO for further triage as MCD should just pull that image once quay works >given that and the fact that the pull error is shadowing the real reason behind this bz anyway and the domain is still MAO, I'm bringing this back to MAO for further triage as MCD should just pull that image once quay works
I'm not sure how you mean. The pull error is not shadowing the real reason. There's no other real reason than the pull error. There's no further triage needed here. This happened because quay suffering an outage and so the kubelet coudn't join the cluster, that's the root cause.
This highlights a product shortfall at signalling this failure in a meaningful manner. The operational knowledge for managing this config resides on the mco. I'd suggest for this component to explore ways to put semantics where it might make sense on top of possible failures like this one and bubble those up to where they might be better visible for users/consumers.
(In reply to Alberto from comment #11) > >given that and the fact that the pull error is shadowing the real reason behind this bz anyway and the domain is still MAO, I'm bringing this back to MAO for further triage as MCD should just pull that image once quay works > > I'm not sure how you mean. The pull error is not shadowing the real reason. > There's no other real reason than the pull error. There's no further triage > needed here. This happened because quay suffering an outage and so the > kubelet coudn't join the cluster, that's the root cause. > > This highlights a product shortfall at signalling this failure in a > meaningful manner. The operational knowledge for managing this config > resides on the mco. I'd suggest for this component to explore ways to put > semantics where it might make sense on top of possible failures like this > one and bubble those up to where they might be better visible for > users/consumers. Right, the MCO's cluster object should report this up to the CO itself, I'll move it back to try and assess better how we communicate the error up the stack but since it was because of the quay.io outage and not a regression, I'm moving this to 4.6 as well. In order words, this error would have happen exactly the same in any UPI environment or such where the machine API were not involved, and it would have been even more hidden.
>Right, the MCO's cluster object should report this up to the CO itself, I'll move it back to try and assess better how we communicate the error up the stack but since it was because of the quay.io outage and not a regression, I'm moving this to 4.6 as well.
Thanks, that make sense to me. We need to work together traversing all components to make the product better exploring ways to make these failures more visible and signalling them bottom-up from mco.
Adding UpcomingSprint as this won't make the current sprint. We'll try to work on this bug in the next sprint. The cluster from the must-gather reconciled eventually and we can see from logs that the pull worked (quay worked at some point) so the worker pool reconciled successfully and cleared any previous error that we _were reporting_. Unfortunately, what we report to the MCP and MCO Cluster object is just: ``` failed to run pivot: failed to start machine-config-daemon-host.service: exit status 1 ``` Which isn't helping us understand that "pulling from quay.io" was the issue. I've attached a PR that aims at bubbling up any error we encounter during an os update, that should take care of reporting the correct error. I want to just stress the obvious: we clear the error reporting in MCO if the reconcile was successful (like in the must-gather) but I think we all agree on that behaviour. RED HAT CONFIDENTIAL Verified on 4.6.0-0.nightly-2020-07-07-083718 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2020-07-07-083718 True False 5h28m Cluster version is 4.6.0-0.nightly-2020-07-07-083718 1. Modify the osimageurl to have a different sha256 $ oc -n openshift-machine-config-operator edit cm/machine-config-osimageurl 2. Wait for errors to occur during upgrade and watch the MCD log $ oc -n openshift-machine-config-operator logs machine-config-daemon-26hck -c machine-config-daemon ...snip... I0707 20:17:42.354418 221218 run.go:18] Running: podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: Error reading manifest sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown W0707 20:17:43.479307 221218 run.go:44] podman failed: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 failed: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: Error reading manifest sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown : exit status 125; retrying... 3. Verify the error is rolled up to the Machine Config Pool $ oc get mcp/worker -o yaml apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: creationTimestamp: "2020-07-07T14:31:10Z" generation: 7 labels: machineconfiguration.openshift.io/mco-built-in: "" managedFields: - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:metadata: f:labels: .: {} f:machineconfiguration.openshift.io/mco-built-in: {} f:spec: .: {} f:configuration: {} f:machineConfigSelector: .: {} f:matchLabels: .: {} f:machineconfiguration.openshift.io/role: {} f:nodeSelector: .: {} f:matchLabels: .: {} f:node-role.kubernetes.io/worker: {} f:paused: {} manager: machine-config-operator operation: Update time: "2020-07-07T14:31:10Z" - apiVersion: machineconfiguration.openshift.io/v1 fieldsType: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: {} f:source: {} f:status: .: {} f:conditions: {} f:configuration: .: {} f:name: {} f:source: {} f:degradedMachineCount: {} f:machineCount: {} f:observedGeneration: {} f:readyMachineCount: {} f:unavailableMachineCount: {} f:updatedMachineCount: {} manager: machine-config-controller operation: Update time: "2020-07-07T20:09:29Z" name: worker resourceVersion: "145806" selfLink: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker uid: 28a56f99-651b-4d2d-baed-bd1a3f29071d spec: configuration: name: rendered-worker-8e7f1827931649f08fb55d3ce0c698e0 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh machineConfigSelector: matchLabels: machineconfiguration.openshift.io/role: worker nodeSelector: matchLabels: node-role.kubernetes.io/worker: "" paused: false status: conditions: - lastTransitionTime: "2020-07-07T14:31:58Z" message: "" reason: "" status: "False" type: RenderDegraded - lastTransitionTime: "2020-07-07T20:05:49Z" message: "" reason: "" status: "False" type: Updated - lastTransitionTime: "2020-07-07T20:05:49Z" message: All nodes are updating to rendered-worker-8e7f1827931649f08fb55d3ce0c698e0 reason: "" status: "True" type: Updating - lastTransitionTime: "2020-07-07T20:09:29Z" message: 'Node ip-10-0-213-191.us-west-2.compute.internal is reporting: "failed to run pivot: failed to run pivot: exit status 1: failed to run command podman (6 tries): timed out waiting for the condition: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 failed: Error: error pulling image \"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4\": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: Error reading manifest sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown\n: exit status 125"' reason: 1 nodes are reporting degraded status on sync status: "True" type: NodeDegraded - lastTransitionTime: "2020-07-07T20:09:29Z" message: "" reason: "" status: "True" type: Degraded configuration: name: rendered-worker-f2257ee53059916ea51a9652971094c5 source: - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 00-worker - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-container-runtime - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 01-worker-kubelet - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-generated-registries - apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfig name: 99-worker-ssh degradedMachineCount: 1 machineCount: 3 observedGeneration: 7 readyMachineCount: 0 unavailableMachineCount: 1 updatedMachineCount: 0 4. Verified the error is rolled up to the clusteroperator $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.6.0-0.nightly-2020-07-07-083718 True False False 5h43m cloud-credential 4.6.0-0.nightly-2020-07-07-083718 True False False 5h53m cluster-autoscaler 4.6.0-0.nightly-2020-07-07-083718 True False False 5h41m config-operator 4.6.0-0.nightly-2020-07-07-083718 True False False 5h48m console 4.6.0-0.nightly-2020-07-07-083718 True False False 5h33m csi-snapshot-controller 4.6.0-0.nightly-2020-07-07-083718 True False False 13m dns 4.6.0-0.nightly-2020-07-07-083718 True False False 5h47m etcd 4.6.0-0.nightly-2020-07-07-083718 True False False 5h45m image-registry 4.6.0-0.nightly-2020-07-07-083718 True False False 5h36m ingress 4.6.0-0.nightly-2020-07-07-083718 True False False 5h36m insights 4.6.0-0.nightly-2020-07-07-083718 True False False 5h41m kube-apiserver 4.6.0-0.nightly-2020-07-07-083718 True False False 5h44m kube-controller-manager 4.6.0-0.nightly-2020-07-07-083718 True False False 5h44m kube-scheduler 4.6.0-0.nightly-2020-07-07-083718 True False False 5h46m kube-storage-version-migrator 4.6.0-0.nightly-2020-07-07-083718 True False False 13m machine-api 4.6.0-0.nightly-2020-07-07-083718 True False False 5h38m machine-approver 4.6.0-0.nightly-2020-07-07-083718 True False False 5h44m machine-config 4.6.0-0.nightly-2020-07-07-083718 False False True 3m59s marketplace 4.6.0-0.nightly-2020-07-07-083718 True False False 12m monitoring 4.6.0-0.nightly-2020-07-07-083718 True False False 149m network 4.6.0-0.nightly-2020-07-07-083718 True False False 5h48m node-tuning 4.6.0-0.nightly-2020-07-07-083718 True False False 5h48m openshift-apiserver 4.6.0-0.nightly-2020-07-07-083718 True False True 5h43m openshift-controller-manager 4.6.0-0.nightly-2020-07-07-083718 True False False 5h42m openshift-samples 4.6.0-0.nightly-2020-07-07-083718 True False False 5h41m operator-lifecycle-manager 4.6.0-0.nightly-2020-07-07-083718 True False False 5h47m operator-lifecycle-manager-catalog 4.6.0-0.nightly-2020-07-07-083718 True False False 5h47m operator-lifecycle-manager-packageserver 4.6.0-0.nightly-2020-07-07-083718 True False False 13m service-ca 4.6.0-0.nightly-2020-07-07-083718 True False False 5h48m storage 4.6.0-0.nightly-2020-07-07-083718 True False False 5h42m $ oc describe co/machine-config Name: machine-config Namespace: Labels: <none> Annotations: exclude.release.openshift.io/internal-openshift-hosted: true API Version: config.openshift.io/v1 Kind: ClusterOperator Metadata: Creation Timestamp: 2020-07-07T14:25:54Z Generation: 1 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:annotations: .: f:exclude.release.openshift.io/internal-openshift-hosted: f:spec: f:status: Manager: cluster-version-operator Operation: Update Time: 2020-07-07T14:25:54Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: f:conditions: f:extension: .: f:lastSyncError: f:master: f:worker: f:relatedObjects: f:versions: Manager: machine-config-operator Operation: Update Time: 2020-07-07T20:16:02Z Resource Version: 148084 Self Link: /apis/config.openshift.io/v1/clusteroperators/machine-config UID: 6fde0884-7b00-4d7a-8a5e-5d36915d1aed Spec: Status: Conditions: Last Transition Time: 2020-07-07T14:32:25Z Message: Cluster version is 4.6.0-0.nightly-2020-07-07-083718 Status: False Type: Progressing Last Transition Time: 2020-07-07T20:15:46Z Message: Failed to resync 4.6.0-0.nightly-2020-07-07-083718 because: timed out waiting for the condition during syncRequiredMachineConfigPools: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 0, updated: 0, unavailable: 1) Reason: RequiredPoolsFailed Status: True Type: Degraded Last Transition Time: 2020-07-07T20:15:46Z Message: Cluster not available for 4.6.0-0.nightly-2020-07-07-083718 Status: False Type: Available Last Transition Time: 2020-07-07T14:32:24Z Reason: AsExpected Status: True Type: Upgradeable Extension: Last Sync Error: error pool master is not ready, retrying. Status: (pool degraded: true total: 3, ready 2, updated: 2, unavailable: 1) Master: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip-10-0-211-227.us-west-2.compute.internal is reporting: \"failed to run pivot: failed to run pivot: exit status 1: failed to run command podman (6 tries): timed out waiting for the condition: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 failed: Error: error pulling image \\\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4\\\": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: Error reading manifest sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown\\n: exit status 125\"" Worker: pool is degraded because nodes fail with "1 nodes are reporting degraded status on sync": "Node ip-10-0-213-191.us-west-2.compute.internal is reporting: \"failed to run pivot: failed to run pivot: exit status 1: failed to run command podman (6 tries): timed out waiting for the condition: running podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 failed: Error: error pulling image \\\"quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4\\\": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4: Error reading manifest sha256:d36a8e91d2affa7999b0da6d30e01360f9727b7d2cbfd250d0f7d5f385314ca4 in quay.io/openshift-release-dev/ocp-v4.0-art-dev: manifest unknown: manifest unknown\\n: exit status 125\"" Related Objects: Group: Name: openshift-machine-config-operator Resource: namespaces Group: machineconfiguration.openshift.io Name: Resource: machineconfigpools Group: machineconfiguration.openshift.io Name: Resource: controllerconfigs Group: machineconfiguration.openshift.io Name: Resource: kubeletconfigs Group: machineconfiguration.openshift.io Name: Resource: containerruntimeconfigs Group: machineconfiguration.openshift.io Name: Resource: machineconfigs Versions: Name: operator Version: 4.6.0-0.nightly-2020-07-07-083718 Events: <none> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196 |