Bug 1859428

Summary:	[gcp] Machine stuck at 'Provisioned' phase
Product:	OpenShift Container Platform	Reporter:	sunzhaohua <zhsun>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.6	CC:	jiajliu, m.andre, mgugino, walters, wjiang
Target Milestone:	---	Keywords:	Reopened, UpcomingSprint
Target Release:	4.6.0	Flags:	agarcial: needinfo+
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1879826 (view as bug list)		Environment:
Last Closed:	2020-09-08 12:40:01 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1879826

Description sunzhaohua 2020-07-22 02:04:56 UTC

Description of problem:
Install an IPI gcp cluster, machine is stuck at 'Provisioned' phase and can not become running.

Version-Release number of selected component (if applicable):
4.6.0-0.nightly-2020-07-19-093912

How reproducible:
Sometimes

Steps to Reproduce:
1. Install an IPI gcp cluster
2. Check machine and node
Machine phase can not become 'Running', it's stuck at 'Provisioned' state. In the yaml output of the machine, it has a status.addresses with type=internalDNS.
3. oc get csr, no csr was pending. 

Actual results:

$ oc get node
NAME                                                        STATUS     ROLES    AGE     VERSION
zhsun720gcp1-t9dp6-master-0.c.openshift-qe.internal         Ready      master   4h17m   v1.18.3+a34fde4
zhsun720gcp1-t9dp6-master-1.c.openshift-qe.internal         Ready      master   4h17m   v1.18.3+a34fde4
zhsun720gcp1-t9dp6-master-2.c.openshift-qe.internal         Ready      master   4h17m   v1.18.3+a34fde4
zhsun720gcp1-t9dp6-worker-a-x9rgs.c.openshift-qe.internal   Ready      worker   3h51m   v1.18.3+a34fde4
zhsun720gcp1-t9dp6-worker-c-bwx5l.c.openshift-qe.internal   Ready      worker   3h50m   v1.18.3+a34fde4

$ oc get machine
NAME                                PHASE         TYPE            REGION        ZONE            AGE
zhsun720gcp1-t9dp6-master-0         Running       n1-standard-4   us-central1   us-central1-a   4h21m
zhsun720gcp1-t9dp6-master-1         Running       n1-standard-4   us-central1   us-central1-b   4h21m
zhsun720gcp1-t9dp6-master-2         Running       n1-standard-4   us-central1   us-central1-c   4h21m
zhsun720gcp1-t9dp6-worker-a-x9rgs   Running       n1-standard-4   us-central1   us-central1-a   3h58m
zhsun720gcp1-t9dp6-worker-b-hgnxz   Provisioned   n1-standard-4   us-central1   us-central1-b   3h58m
zhsun720gcp1-t9dp6-worker-c-bwx5l   Running       n1-standard-4   us-central1   us-central1-c   3h58m

status:
  addresses:
  - address: 10.0.32.3
    type: InternalIP
  - address: zhsun720gcp1-t9dp6-worker-b-hgnxz.us-central1-b.c.openshift-qe.internal
    type: InternalDNS
  - address: zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal
    type: InternalDNS
  - address: zhsun720gcp1-t9dp6-worker-b-hgnxz
    type: InternalDNS
  lastUpdated: "2020-07-20T03:38:22Z"
  phase: Provisioned
  providerStatus:
    conditions:
    - lastProbeTime: "2020-07-20T03:38:00Z"
      lastTransitionTime: "2020-07-20T03:38:00Z"
      message: machine successfully created
      reason: MachineCreationSucceeded
      status: "True"
      type: MachineCreated
    instanceId: zhsun720gcp1-t9dp6-worker-b-hgnxz
    instanceState: RUNNING
    metadata: {}

Expected results:
Machine could be Running and join the cluster.

Additional info:
must gather: 
http://file.rdu.redhat.com/~zhsun/must-gather.local.4255250240848515852.tar.gz

Comment 1 sunzhaohua 2020-07-22 02:29:36 UTC

There is a cluster in case you need to debug.
https://mastern-jenkins-csb-openshift-qe.cloud.paas.psi.redhat.com/job/Launch%20Environment%20Flexy/102720/artifact/workdir/install-dir/auth/kubeconfig/*view*/

Comment 2 Joel Speed 2020-07-22 11:40:01 UTC

FYI for anyone who comes to look at this, the namespaces folder in the must gather has the folder structure but no files, so we don't have any of the logs from the controllers

Comment 5 Alberto 2020-07-23 13:24:56 UTC

>Jul 20 03:38:45 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull...
Jul 20 03:39:16 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal sh[1687]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: unable to pull image: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk to a V1 docker registry

So this is a (possibly transient) systemd failure.
For this scenario where the machine hangs as provisioned you should be seeing alerts triggered by the machine API.
https://github.com/openshift/machine-api-operator/blob/master/install/0000_90_machine-api-operator_04_alertrules.yaml#L13-L23
Also fwiw Machine Health Checks are to mitigate this.

In terms of actions items:
We are considering possibly dump the instance logs somewhere to make this visible https://issues.redhat.com/browse/OCPCLOUD-909

I'd expect the MCO to be more protective against possibly transient issues e.g to have some retry logic here https://github.com/openshift/machine-config-operator/blob/617b6544eb885f48b99f21d09a6b7254d50cb003/templates/common/_base/units/machine-config-daemon-pull.service.yaml#L19

Comment 6 Alberto 2020-07-23 13:44:14 UTC

I'm moving this to mco now for visibility and considering my last point above.
Feel free to move back to cloud. We keep investigating how to make this failures more visible.

Comment 7 Antonio Murdaca 2020-07-27 08:20:09 UTC

(In reply to Alberto from comment #5)
> >Jul 20 03:38:45 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal systemd[1]: Starting Machine Config Daemon Pull...
> Jul 20 03:39:16 zhsun720gcp1-t9dp6-worker-b-hgnxz.c.openshift-qe.internal
> sh[1687]: Error: error pulling image
> "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa": unable to
> pull
> quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: unable to
> pull image: Error initializing source
> docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:
> 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk
> to a V1 docker registry
> 
> So this is a (possibly transient) systemd failure.
> For this scenario where the machine hangs as provisioned you should be
> seeing alerts triggered by the machine API.
> https://github.com/openshift/machine-api-operator/blob/master/install/
> 0000_90_machine-api-operator_04_alertrules.yaml#L13-L23
> Also fwiw Machine Health Checks are to mitigate this.
> 
> In terms of actions items:
> We are considering possibly dump the instance logs somewhere to make this
> visible https://issues.redhat.com/browse/OCPCLOUD-909
> 
> I'd expect the MCO to be more protective against possibly transient issues
> e.g to have some retry logic here
> https://github.com/openshift/machine-config-operator/blob/
> 617b6544eb885f48b99f21d09a6b7254d50cb003/templates/common/_base/units/
> machine-config-daemon-pull.service.yaml#L19

The MCO has retry logic for transient network errors during image operations. Unfortunately, what you're getting:

> 67b0c6c90d55d1ada864daae1a0ce2b5204785ba72851f314295c4f8814688aa: can't talk
> to a V1 docker registry

It's not a transient error. We do still retry tho. As for visibility, we've landed https://github.com/openshift/machine-config-operator/pull/1868 which makes the error visible at the MCP/MCO level for easier debugging.

Alberto, I'm moving this back to you to close (I imagine).

Comment 8 Alberto 2020-07-29 08:50:43 UTC

>It's not a transient error. We do still retry tho.

ok cool, could you please point me to where in code is the retry happening?
I'd consider "can't talk to a docker registry" a possible transient error including the registry not being temporary available.

Sunzhaohua can you please share must gather logs and validate this is in visible at the MCP/MCO?

Comment 9 sunzhaohua 2020-07-30 07:04:26 UTC

@Alberto the cluster was destroyed, and the namespaces folder in the must gather has the folder structure but no files, so couldn't check, once we meet this again will share logs.
http://file.rdu.redhat.com/~zhsun/must-gather.local.4255250240848515852.tar.gz

Comment 10 Alberto 2020-08-03 08:47:02 UTC

ok, Thanks sunzhaohua.

Comment 12 Michael Gugino 2020-08-27 03:17:32 UTC

VMs were created in cloud console.  No pending CSRs.  This is likely an ignition failure.  Moving to MCO team.

Comment 14 Antonio Murdaca 2020-08-27 11:02:10 UTC

On a second look of the failure in https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - the machine-config-daemon-pull.service is oneshot and it doesn't retry on failure - wondering if that makes the nodes just hang in there

Comment 15 liujia 2020-08-28 02:29:00 UTC

> we'd need a broken (alive) cluster to grab what's happening on the workers like https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - is this 100% reproducible also so I can directly try?

From our ci test result, it should not be a 100% reproduce issue. But i will leave u a live cluster in case i encounter it again.

Comment 16 Martin André 2020-09-02 15:42:09 UTC

(In reply to Antonio Murdaca from comment #14)
> On a second look of the failure in
> https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c4 - the
> machine-config-daemon-pull.service is oneshot and it doesn't retry on
> failure - wondering if that makes the nodes just hang in there

It does. At least, this is what we observed on OpenStack in [1] where master-1 never joined the cluster.

machine-config-daemon-pull.service failed with:

Sep 01 07:36:35 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: Starting Machine Config Daemon Pull...
Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 sh[1674]: Error: error pulling image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ed41be071be9414b8893c28411a76d35c13989192d9f2c1735695312526f12": unable to pull quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:34ed41be071be9414b8893c28411a76d35c13989192d9f2c1735695312526f12: unable to pull image: Error writing blob: error storing blob to file "/var/tmp/storage730560957/5": read tcp 10.0.3.76:56530->13.225.198.38:443: read: connection reset by peer
Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Main process exited, code=exited, status=125/n/a
Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Failed with result 'exit-code'.
Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: Failed to start Machine Config Daemon Pull.
Sep 01 07:37:33 k80b2cxr-12e28-mtfdr-master-1 systemd[1]: machine-config-daemon-pull.service: Consumed 6.462s CPU time

Likely a quay.io hiccup that could have been mitigated with a retry on image pulling.

[1] https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-openstack-4.6/1300691929096261632/artifacts/e2e-openstack/

Comment 17 Antonio Murdaca 2020-09-02 15:46:31 UTC

dang, thanks Martin, I'll follow up

Comment 18 Antonio Murdaca 2020-09-08 12:37:32 UTC

*** Bug 1875678 has been marked as a duplicate of this bug. ***

Comment 19 Antonio Murdaca 2020-09-08 12:40:01 UTC

Closing this with https://bugzilla.redhat.com/show_bug.cgi?id=1870343#c7 as we're effectively fixing this by shipping the PR in that other bug - hopefully the same root or that will be reopened.

*** This bug has been marked as a duplicate of bug 1870343 ***