Description of problem: Install an IPI gcp cluster, machine is stuck at 'Provisioned' phase and can not become running. Version-Release number of selected component (if applicable): 4.6.0-0.nightly-2020-07-19-093912 How reproducible: Sometimes Steps to Reproduce: 1. Install an IPI gcp cluster 2. Check machine and node Machine phase can not become 'Running', it's stuck at 'Provisioned' state. In the yaml output of the machine, it has a status.addresses with type=internalDNS. 3. oc get csr, no csr was pending. Actual results: $ oc get node NAME STATUS ROLES AGE VERSION zhsun720gcp1-t9dp6-master-0.c.openshift-qe.internal Ready master 4h17m v1.18.3+a34fde4 zhsun720gcp1-t9dp6-master-1.c.openshift-qe.internal Ready master 4h17m v1.18.3+a34fde4 zhsun720gcp1-t9dp6-master-2.c.openshift-qe.internal Ready master 4h17m v1.18.3+a34fde4 zhsun720gcp1-t9dp6-worker-a-x9rgs.c.openshift-qe.internal Ready worker 3h51m v1.18.3+a34fde4 zhsun720gcp1-t9dp6-worker-c-bwx5l.c.openshift-qe.internal Ready worker 3h50m v1.18.3+a34fde4 $ oc get machine NAME PHASE TYPE REGION ZONE AGE zhsun720gcp1-t9dp6-master-0 Running n1-standard-4 us-central1 us-central1-a 4h21m zhsun720gcp1-t9dp6-master-1 Running n1-standard-4 us-central1 us-central1-b 4h21m zhsun720gcp1-t9dp6-master-2 Running n1-standard-4 us-central1 us-central1-c 4h21m zhsun720gcp1-t9dp6-worker-a-x9rgs Running n1-standard-4 us-central1 us-central1-a 3h58m zhsun720gcp1-t9dp6-worker-b-hgnxz Provisioned n1-standard-4 us-central1 us-central1-b 3h58m zhsun720gcp1-t9dp6-worker-c-bwx5l Running n1-standard-4 us-central1 us-central1-c 3h58m status: addresses: - address: 10.0.32.3 type: InternalIP - address: zhsun720gcp1-t9dp6-worker-b-hgnxz.us-central1-b.c.openshift-qe.internal type: InternalDNS - address: zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal type: InternalDNS - address: zhsun720gcp1-t9dp6-worker-b-hgnxz type: InternalDNS lastUpdated: "2020-07-20T03:38:22Z" phase: Provisioned providerStatus: conditions: - lastProbeTime: "2020-07-20T03:38:00Z" lastTransitionTime: "2020-07-20T03:38:00Z" message: machine successfully created reason: MachineCreationSucceeded status: "True" type: MachineCreated instanceId: zhsun720gcp1-t9dp6-worker-b-hgnxz instanceState: RUNNING metadata: {} Expected results: Machine could be Running and join the cluster. Additional info: must gather: http://file.rdu.redhat.com/~zhsun/must-gather.local.4255250240848515852.tar.gz
There is a cluster in case you need to debug. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/102720/artifact/workdir/install-dir/auth/kubeconfig/*view*/
FYI for anyone who comes to look at this, the namespaces folder in the must gather has the folder structure but no files, so we don't have any of the logs from the controllers
>Jul 20 03:38:45 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull... Jul 20 03:39:16 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal sh[1687]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk to a V1 docker registry So this is a (possibly transient) systemd failure. For this scenario where the machine hangs as provisioned you should be seeing alerts triggered by the machine API. https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L13-L23 Also fwiw Machine Health Checks are to mitigate this. In terms of actions items: We are considering possibly dump the instance logs somewhere to make this visible https://issues.redhat.com/browse/OCPCLOUD-909 I'd expect the MCO to be more protective against possibly transient issues e.g to have some retry logic here https://github.com/openshift/machine-config-operator/blob/617b6544eb885f48b99f21d09a6b7254d50cb003/templates/common/_base/units/machine-config-daemon-pull.service.yaml#L19
I'm moving this to mco now for visibility and considering my last point above. Feel free to move back to cloud. We keep investigating how to make this failures more visible.
(In reply to Alberto from comment #5) > >Jul 20 03:38:45 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull... > Jul 20 03:39:16 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal > sh[1687]: Error: error pulling image > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa": unable to > pull > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: unable to > pull image: Error initializing source > docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk > to a V1 docker registry > > So this is a (possibly transient) systemd failure. > For this scenario where the machine hangs as provisioned you should be > seeing alerts triggered by the machine API. > https://github.com/openshift/machine-api-operator/blob/master/install/ > 0000_90_machine-api-operator_04_alertrules.yaml#L13-L23 > Also fwiw Machine Health Checks are to mitigate this. > > In terms of actions items: > We are considering possibly dump the instance logs somewhere to make this > visible https://issues.redhat.com/browse/OCPCLOUD-909 > > I'd expect the MCO to be more protective against possibly transient issues > e.g to have some retry logic here > https://github.com/openshift/machine-config-operator/blob/ > 617b6544eb885f48b99f21d09a6b7254d50cb003/templates/common/_base/units/ > machine-config-daemon-pull.service.yaml#L19 The MCO has retry logic for transient network errors during image operations. Unfortunately, what you're getting: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk > to a V1 docker registry It's not a transient error. We do still retry tho. As for visibility, we've landed https://github.com/openshift/machine-config-operator/pull/1868 which makes the error visible at the MCP/MCO level for easier debugging. Alberto, I'm moving this back to you to close (I imagine).
>It's not a transient error. We do still retry tho. ok cool, could you please point me to where in code is the retry happening? I'd consider "can't talk to a docker registry" a possible transient error including the registry not being temporary available. Sunzhaohua can you please share must gather logs and validate this is in visible at the MCP/MCO?
@Alberto the cluster was destroyed, and the namespaces folder in the must gather has the folder structure but no files, so couldn't check, once we meet this again will share logs. http://file.rdu.redhat.com/~zhsun/must-gather.local.4255250240848515852.tar.gz
ok, Thanks sunzhaohua.
VMs were created in cloud console. No pending CSRs. This is likely an ignition failure. Moving to MCO team.
On a second look of the failure in https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - the machine-config-daemon-pull.service is oneshot and it doesn't retry on failure - wondering if that makes the nodes just hang in there
> we'd need a broken (alive) cluster to grab what's happening on the workers like https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - is this 100% reproducible also so I can directly try? From our ci test result, it should not be a 100% reproduce issue. But i will leave u a live cluster in case i encounter it again.
(In reply to Antonio Murdaca from comment #14) > On a second look of the failure in > https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - the > machine-config-daemon-pull.service is oneshot and it doesn't retry on > failure - wondering if that makes the nodes just hang in there It does. At least, this is what we observed on OpenStack in [1] where master-1 never joined the cluster. machine-config-daemon-pull.service failed with: Sep 01 07:36:35 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: Starting Machine Config Daemon Pull... Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 sh[1674]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ed41be071be9414b8893c28411a76d35c13989192d9f2c1735695312526f12": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ed41be071be9414b8893c28411a76d35c13989192d9f2c1735695312526f12: unable to pull image: Error writing blob: error storing blob to file "/var/tmp/storage730560957/5": read tcp 10.0.3.76:56530->13.225.198.38:443: read: connection reset by peer Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Main process exited, code=exited, status=125/n/a Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Failed with result 'exit-code'. Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: Failed to start Machine Config Daemon Pull. Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Consumed 6.462s CPU time Likely a quay.io hiccup that could have been mitigated with a retry on image pulling. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.6/1300691929096261632/artifacts/e2e-openstack/
dang, thanks Martin, I'll follow up
*** Bug 1875678 has been marked as a duplicate of this bug. ***
Closing this with https://bugzilla.redhat.com/show_bug.cgi?id=1870343#c7 as we're effectively fixing this by shipping the PR in that other bug - hopefully the same root or that will be reopened. *** This bug has been marked as a duplicate of bug 1870343 ***