Bug 1859428
Summary: | [gcp] Machine stuck at 'Provisioned' phase | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> | |
Component: | Machine Config Operator | Assignee: | Antonio Murdaca <amurdaca> | |
Status: | CLOSED DUPLICATE | QA Contact: | Michael Nguyen <mnguyen> | |
Severity: | low | Docs Contact: | ||
Priority: | medium | |||
Version: | 4.6 | CC: | jiajliu, m.andre, mgugino, walters, wjiang | |
Target Milestone: | --- | Keywords: | Reopened, UpcomingSprint | |
Target Release: | 4.6.0 | Flags: | agarcial:
needinfo+
|
|
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1879826 (view as bug list) | Environment: | ||
Last Closed: | 2020-09-08 12:40:01 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1879826 |
Description
sunzhaohua
2020-07-22 02:04:56 UTC
There is a cluster in case you need to debug. https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/102720/artifact/workdir/install-dir/auth/kubeconfig/*view*/ FYI for anyone who comes to look at this, the namespaces folder in the must gather has the folder structure but no files, so we don't have any of the logs from the controllers >Jul 20 03:38:45 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull... Jul 20 03:39:16 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal sh[1687]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk to a V1 docker registry So this is a (possibly transient) systemd failure. For this scenario where the machine hangs as provisioned you should be seeing alerts triggered by the machine API. https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L13-L23 Also fwiw Machine Health Checks are to mitigate this. In terms of actions items: We are considering possibly dump the instance logs somewhere to make this visible https://issues.redhat.com/browse/OCPCLOUD-909 I'd expect the MCO to be more protective against possibly transient issues e.g to have some retry logic here https://github.com/openshift/machine-config-operator/blob/617b6544eb885f48b99f21d09a6b7254d50cb003/templates/common/_base/units/machine-config-daemon-pull.service.yaml#L19 I'm moving this to mco now for visibility and considering my last point above. Feel free to move back to cloud. We keep investigating how to make this failures more visible. (In reply to Alberto from comment #5) > >Jul 20 03:38:45 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull... > Jul 20 03:39:16 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal > sh[1687]: Error: error pulling image > "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa": unable to > pull > quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: unable to > pull image: Error initializing source > docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk > to a V1 docker registry > > So this is a (possibly transient) systemd failure. > For this scenario where the machine hangs as provisioned you should be > seeing alerts triggered by the machine API. > https://github.com/openshift/machine-api-operator/blob/master/install/ > 0000_90_machine-api-operator_04_alertrules.yaml#L13-L23 > Also fwiw Machine Health Checks are to mitigate this. > > In terms of actions items: > We are considering possibly dump the instance logs somewhere to make this > visible https://issues.redhat.com/browse/OCPCLOUD-909 > > I'd expect the MCO to be more protective against possibly transient issues > e.g to have some retry logic here > https://github.com/openshift/machine-config-operator/blob/ > 617b6544eb885f48b99f21d09a6b7254d50cb003/templates/common/_base/units/ > machine-config-daemon-pull.service.yaml#L19 The MCO has retry logic for transient network errors during image operations. Unfortunately, what you're getting: > 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk > to a V1 docker registry It's not a transient error. We do still retry tho. As for visibility, we've landed https://github.com/openshift/machine-config-operator/pull/1868 which makes the error visible at the MCP/MCO level for easier debugging. Alberto, I'm moving this back to you to close (I imagine). >It's not a transient error. We do still retry tho.
ok cool, could you please point me to where in code is the retry happening?
I'd consider "can't talk to a docker registry" a possible transient error including the registry not being temporary available.
Sunzhaohua can you please share must gather logs and validate this is in visible at the MCP/MCO?
@Alberto the cluster was destroyed, and the namespaces folder in the must gather has the folder structure but no files, so couldn't check, once we meet this again will share logs. http://file.rdu.redhat.com/~zhsun/must-gather.local.4255250240848515852.tar.gz ok, Thanks sunzhaohua. VMs were created in cloud console. No pending CSRs. This is likely an ignition failure. Moving to MCO team. On a second look of the failure in https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - the machine-config-daemon-pull.service is oneshot and it doesn't retry on failure - wondering if that makes the nodes just hang in there > we'd need a broken (alive) cluster to grab what's happening on the workers like https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - is this 100% reproducible also so I can directly try?
From our ci test result, it should not be a 100% reproduce issue. But i will leave u a live cluster in case i encounter it again.
(In reply to Antonio Murdaca from comment #14) > On a second look of the failure in > https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - the > machine-config-daemon-pull.service is oneshot and it doesn't retry on > failure - wondering if that makes the nodes just hang in there It does. At least, this is what we observed on OpenStack in [1] where master-1 never joined the cluster. machine-config-daemon-pull.service failed with: Sep 01 07:36:35 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: Starting Machine Config Daemon Pull... Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 sh[1674]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ed41be071be9414b8893c28411a76d35c13989192d9f2c1735695312526f12": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ed41be071be9414b8893c28411a76d35c13989192d9f2c1735695312526f12: unable to pull image: Error writing blob: error storing blob to file "/var/tmp/storage730560957/5": read tcp 10.0.3.76:56530->13.225.198.38:443: read: connection reset by peer Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Main process exited, code=exited, status=125/n/a Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Failed with result 'exit-code'. Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: Failed to start Machine Config Daemon Pull. Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Consumed 6.462s CPU time Likely a quay.io hiccup that could have been mitigated with a retry on image pulling. [1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.6/1300691929096261632/artifacts/e2e-openstack/ dang, thanks Martin, I'll follow up *** Bug 1875678 has been marked as a duplicate of this bug. *** Closing this with https://bugzilla.redhat.com/show_bug.cgi?id=1870343#c7 as we're effectively fixing this by shipping the PR in that other bug - hopefully the same root or that will be reopened. *** This bug has been marked as a duplicate of bug 1870343 *** |