Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1779375

Summary:	Azure flake: OS Provisioning for VM ... did not finish in the allotted time
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	Installer	Assignee:	aos-install
Installer sub component:	openshift-installer	QA Contact:	Gaoyun Pei <gpei>
Status:	CLOSED DEFERRED	Docs Contact:
Severity:	low
Priority:	low	CC:	adahiya, bbreard, bleanhar, dahernan, dustymabe, ffranz, imcleod, jligon, miabbott, nstielau, oarribas, pkanthal, pravesh, rkshirsa, walters
Version:	4.3.0	Keywords:	Reopened, UpcomingSprint
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-07-09 18:56:22 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2019-12-03 20:40:36 UTC

Errors like [1]:

level=error msg="Error: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-gf3bnp64-cbd67-tpz55-master-2' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \""
level=error
level=error msg="  on ../tmp/openshift-install-623142065/master/master.tf line 37, in resource \"azurerm_virtual_machine\" \"master\":"
level=error msg="  37: resource \"azurerm_virtual_machine\" \"master\" {"

Might be secondary to quota issues (same job):

level=error msg="Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=<nil> Code=\"OperationNotAllowed\" Message=\"The operation couldn't be completed as it results in exceeding quota limit of Core. Maximum allowed: 200, Current in use: 198, Additional requested: 4. Read more about quota limits at https://aka.ms/AzurePerVMQuotaLimits. Submit a request for Quota increase using the link https://aka.ms/ProdportalCRP/?#create/Microsoft.Support/Parameters/%7B%22subId%22:%22d38f1e38-4bed-438e-b227-833f997adf6a%22,%22pesId%22:%2206bfd9d3-516b-d5c6-5802-169c800dec89%22,%22supportTopicId%22:%22e12e3d1d-7fa0-af33-c6d0-3c50df9658a3%22%7D.\""
level=error
level=error msg="  on ../tmp/openshift-install-623142065/bootstrap/main.tf line 106, in resource \"azurerm_virtual_machine\" \"bootstrap\":"
level=error msg=" 106: resource \"azurerm_virtual_machine\" \"bootstrap\" {"

Low priority, since this is an account/provider issue, but thought I'd file the bug so folks could discover previous analysis in Bugzilla.

[1]: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/507

Comment 2 Colin Walters 2020-01-31 19:18:20 UTC

We'll reopen this if it shows up more often.

Comment 3 David Hernández Fernández 2020-05-20 08:57:56 UTC

Reopening.

I understand that this is account/provider issue and not really related to Openshift installer, but is there anything we can do from Red Hat? How did you solve the issue? Facing it on OCP 4.3 IPI Azure installation.

Comment 4 Micah Abbott 2020-05-20 20:32:48 UTC

Setting to low priority as this is the first instance of the timeout being reported since the BZ was opened originally; targeting for 4.6 as possible fix target.

@wking do you recall any specifics about how this BZ was resolved for the Azure CI jobs?

Comment 5 W. Trevor King 2020-05-21 02:36:48 UTC

I still think that this is just a "sometimes the provider is slow" issue, as described in comment 0.  Colin closed this in comment 2 citing a low flake rate.  I don't think that we actually did anything.  I think we should just leave this around to track the fact that we know it exists, expect the flake rate to be low, expect this to be an Azure-side issue, and think we might be able to work around (by increasing timeouts or some such) if it happens more often (and for whatever reason we think a longer Azure-side delay would be reasonable).  But that approach doesn't work as well know that folks are supposed to kick open bugs every sprint, so <shrug>, folks can do whatever they want with it.

Also, why is everyone whispering in private comments?  I don't see anything sensitive above.

Comment 6 David Hernández Fernández 2020-05-21 11:36:19 UTC

A reinstallation from scratch works, so RCA is unclear, probably major timeouts would help. I have tried in my local environment and I was not able to reproduce the issue anymore.

Comment 8 W. Trevor King 2020-06-01 19:52:41 UTC

Hmm, are we pruning job logs?  Checking [1], seems like 507 and surrounds are gone and we start at 515?  Anyhow, I don't think there was all that much in the original logs beyond the excerpt I quoted in comment 0.  I dunno why Abinav moved this from the installer to RHCOS; I also don't see what RHCOS would have to do with this.  I still think this is the provider being slow, and the installer saying any of:

a. We will fail when the provider is that slow, because it's unacceptable and folks can delete and retry.  WONTFIX
b. We could wait for longer when the provider is that slow, but it happens so rarely that we are unlikely to ever get around to that. DEFERRED
c. We could wait for longer when the provider is that slow; will get a PR up eventually.  Until then... NEW.

Moving back to the installer based on that.  If I'm just missing something and this is really an RHCOS thing, please explain why when kicking back over :)

[1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/?marker=logs%2frelease-openshift-ocp-installer-e2e-azure-4.3%2f500%2f

Comment 9 Abhinav Dahiya 2020-06-03 15:58:19 UTC

(In reply to W. Trevor King from comment #8)
> Hmm, are we pruning job logs?  Checking [1], seems like 507 and surrounds
> are gone and we start at 515?  Anyhow, I don't think there was all that much
> in the original logs beyond the excerpt I quoted in comment 0.  I dunno why
> Abinav moved this from the installer to RHCOS; I also don't see what RHCOS
> would have to do with this.  


Azure requires the OS to provide a single when it's booted and the machine is marked in provisioning until then.
Since the control-plane 's ignition waits to get config from the bootstrap host, any failure where the machine-config-server on the bootstrap host cannot server the ignition, it manifests in that error.

> OS Provisioning for VM 'ci-op-gf3bnp64-cbd67-tpz55-master-2' did not finish in the allotted time

the error makes it clear that the OS failed to provision, so it's not installer's or Azure bug.

I still think this is the provider being slow,
> and the installer saying any of:
> 
> a. We will fail when the provider is that slow, because it's unacceptable
> and folks can delete and retry.  WONTFIX
> b. We could wait for longer when the provider is that slow, but it happens
> so rarely that we are unlikely to ever get around to that. DEFERRED
> c. We could wait for longer when the provider is that slow; will get a PR up
> eventually.  Until then... NEW.
> 
> Moving back to the installer based on that.  If I'm just missing something
> and this is really an RHCOS thing, please explain why when kicking back over
> :)
> 
> [1]:
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> release-openshift-ocp-installer-e2e-azure-4.3/?marker=logs%2frelease-
> openshift-ocp-installer-e2e-azure-4.3%2f500%2f

Comment 10 Brenton Leanhardt 2020-06-03 18:25:18 UTC

Is there anywhere in the provisioning process where we could add a check to prove the problem is caused by the provider?

Comment 11 W. Trevor King 2020-06-03 22:05:09 UTC

> Since the control-plane 's ignition waits to get config from the bootstrap host, any failure where the machine-config-server on the bootstrap host cannot server the ignition, it manifests in that error.

Ahh, that makes sense to me.  CI search [2] turns up [3] and a few other recent examples.  That job was testing 4.4.0-0.nightly-2020-05-30-022631 [4], and it died with:

time="2020-05-30T03:28:43Z" level=debug msg="module.master.azurerm_virtual_machine.master[1]: Still creating... [40m40s elapsed]"
time="2020-05-30T03:28:44Z" level=debug msg="2020/05/30 03:28:44 [DEBUG] module.master.azurerm_virtual_machine.master[1]: apply errored, but we're indicating that via the Error pointer rather than returning it: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \""
time="2020-05-30T03:28:44Z" level=debug msg="2020/05/30 03:28:44 [ERROR] module.master: eval: *terraform.EvalApplyPost, err: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \""
time="2020-05-30T03:28:44Z" level=debug msg="2020/05/30 03:28:44 [ERROR] module.master: eval: *terraform.EvalSequence, err: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \""
time="2020-05-30T03:28:44Z" level=error
time="2020-05-30T03:28:44Z" level=error msg="Error: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \""
time="2020-05-30T03:28:44Z" level=error
time="2020-05-30T03:28:44Z" level=error msg="  on ../tmp/openshift-install-764857706/master/master.tf line 81, in resource \"azurerm_virtual_machine\" \"master\":"
time="2020-05-30T03:28:44Z" level=error msg="  81: resource \"azurerm_virtual_machine\" \"master\" {"

But we don't have console logs for those machines [6] (which would be a CI-template fix) and the installer doesn't try to gather bootstrap logs when it dies this way in Terraform (which would be an installer fix).  Absent one or both of those, it's not clear to me how RHCOS folks would root-cause this.

[1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/570/pull-ci-openshift-machine-api-operator-master-e2e-azure/987
[2]: https://search.svc.ci.openshift.org/?search=OS+Provisioning+for+VM.*did+not+finish+in+the+allotted+time&maxAge=336h&type=bug%2Bjunit&name=azure
[3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700
[4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700/build-log.txt
[5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700/artifacts/e2e-azure/installer/.openshift_install.log
[6]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700/artifacts/e2e-azure/nodes/

Comment 12 Brenton Leanhardt 2020-06-04 12:14:47 UTC

I'd be supportive of cloning this bug to track the installer fixes to make this more debuggable.  Abhinav, I'm curious if you think that's reasonable here.

Comment 13 Micah Abbott 2020-06-09 17:03:02 UTC

So comment #9 shows that is probably a provisioning problem on the OS side, but comment #11 explains we don't have the necessary logs for RHCOS to debug.  There may be an OS level problem here, but without the necessary logs we can't do much with this problem.

I'm gonna keep this as a low priority BZ for 4.6, hoping that we get an enhancement to the installer (re: comment #12) that can deliver us more logs.

Comment 15 Fabiano Franz 2020-06-10 21:29:03 UTC

I remember seeing this issue a number of times back when I was working in Azure UPI, and it would happen *a lot* more in 4.4 GA and the first z-streams of 4.4 than in 4.3. Surprisingly I found the logs from a 'openshift-install gather' of one of these failures still laying here in my machine. :-) It's attached, hopefully it helps somehow.

Comment 16 W. Trevor King 2020-06-10 22:19:25 UTC

From Fabiano's log-bundle:

* log-bundle-20200430122444/failed-units.txt
  > * afterburn-sshkeys loaded failed failed Afterburn (SSH Keys)

  Probably a red-herring, but would be nice if we could clean this up.

* log-bundle-20200430122444/control-plane/10.0.0.5,10.0.0.6,10.0.0.7/
  This looks like you used comma-delimited IPs instead of:

    $ openshift-install gather bootstrap --bootstrap whatever --master 10.0.0.5 --master 10.0.0.6 --master 10.0.0.7

  Previous discussion in [1,2].  Upshot here is that we didn't attempt to gather from the actual control-plane machines.

* This is certainly suspicious:

    $ grep -o 'Error executing MCO.*' bootstrap/journals/bootkube.log | uniq -c
      809 Error executing MCO: unknown flag: --cluster-etcd-operator-image

  Looks like the installer grew it in [3].  Possibly you were just pointing a 4.4 installer at a 4.3 payload [4].  Checking the CVO pod:

    $ grep SOURCE_GIT bootstrap/pods/e9ad8851b53c.inspect 
                "SOURCE_GIT_URL=https://github.com/openshift/cluster-version-operator",
                "SOURCE_GIT_COMMIT=ac81ab4e700e3d7d01cfeb8dae5a21aa28b6c3dc",
                "SOURCE_GIT_TAG=v1.0.0-172-gac81ab4",

  And then, in the CVO repo:

    $ git --no-pager log --oneline origin/release-4.4 | grep ac81ab4
    $ git --no-pager log --oneline origin/release-4.3 | grep ac81ab4
    ac81ab4 Merge pull request #340 from jottofar/backport-signature-verify

  So unless we also have that installer/release mismatch going on in CI, I don't think this log bundle helps :/.

[1]: https://github.com/openshift/release/pull/4734#issue-307109872
[2]: https://github.com/openshift/openshift-docs/pull/17196
[3]: https://github.com/openshift/installer/pull/2730
[4]: https://github.com/openshift/installer/pull/2730#issuecomment-561935476

Comment 17 Ben Howard 2020-06-25 15:43:37 UTC

> * afterburn-sshkeys loaded failed failed Afterburn (SSH Keys)
This is a red-herring. All it means is that Afterburn did not find SSH keys in the Azure Meta-data. That's it. 
https://github.com/coreos/afterburn/blob/master/systemd/afterburn-sshkeys%40.service.in

The failure is fine, IMO. 

@wking how can we the Installer versions checked in CI? This is not looking promising that its an RHCOS issue.

Comment 18 W. Trevor King 2020-06-26 04:41:58 UTC

> All it means is that Afterburn did not find SSH keys in the Azure Meta-data.

If that's acceptable (and from your wording it seems like it is), can we adjust the unit exit 0 and say "I have successfully loaded all zero keys you asked for" instead of failing?  Otherwise it's up to folks with less context than the unit authors to determine the failure is serious or acceptable.

> how can we the Installer versions checked in CI?

I am having trouble parsing this.  Can you rephrase?

Comment 19 Colin Walters 2020-06-26 16:33:16 UTC

Afterburn is already disabled in 4.6 since https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/972
I'll backport that to at least 4.5 and probably 4.4.

Comment 20 Colin Walters 2020-06-26 16:33:31 UTC

(er, more precisely afterburn-sshkeys@)

Comment 21 Colin Walters 2020-06-26 16:35:35 UTC

https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1046

Comment 22 Ben Howard 2020-07-08 22:25:14 UTC

https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1046 merged in regards to the `afterburn-sshkeys@`

Comment 24 W. Trevor King 2020-07-08 22:46:42 UTC

Do we expect to get a usable log bundle without some legwork?  I still think the comment 11 debugging proposals are both potential paths forward.  [1] is precedent for gathering console logs; I'd expect someone familiar with Azure could create a similar step for Azure.  And if we don't want to actually land code in the installer to log-bundle automatically we could hack up an Azure step to manually run an installer gather on some/all install failures.  This is certainly still happening reasonably often in CI:

$ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=OS+Provisioning+for+VM.*did+not+finish+in+the+allotted+time&maxAge=336h&type=junit&name=release-openshift-ocp-.*azure&groupBy=job' | grep 'failures match'
release-openshift-ocp-installer-e2e-azure-4.6 - 154 runs, 44% failed, 4% of failures match
release-openshift-ocp-installer-e2e-azure-serial-4.3 - 43 runs, 23% failed, 10% of failures match
release-openshift-ocp-installer-e2e-azure-ovn-4.4 - 14 runs, 64% failed, 11% of failures match
release-openshift-ocp-installer-e2e-azure-4.3 - 45 runs, 33% failed, 7% of failures match
release-openshift-ocp-installer-e2e-azure-serial-4.4 - 46 runs, 33% failed, 7% of failures match

[1]: https://github.com/openshift/release/tree/39557830d861800802fa78033edcb70783691274/ci-operator/step-registry/gather/aws-console

Comment 26 Abhinav Dahiya 2020-07-09 18:56:22 UTC

> The CoreOS team (including myself), have expertise around how RHCOS interacts with Azure at the hypervisor level. The value of "hacking" something together for solving this bug is eclipsed by the value of having a true feature in the installer for capturing the logs (in Azure speak - enabling boot diagnostics). I'm sending this over the Installer to enable that feature in the installer since doing that leg work will yield long term results that will make debugging boot failures much easier for all teams. Once we have those logs, I am more than happy to take a look. 

Installer already captures log when bootstrapping fails.. 

Here RHCOS is failing to boot because on Azure OS provisioning is required for machine to be accepted as running on Azure. This is not the case on any other cloud. The installer has no way of knowing that the machine isn't running because of this special case. Personally i feel like we should try to keep everything similar across platforms because this error condition is too difficult to correctly handle.

As for adding boot logs to installer gather, RHCOS team is free to work with installer team to incorporate to add to bootstap bundle.

Comment 27 Abhinav Dahiya 2020-07-09 19:07:31 UTC

https://issues.redhat.com/browse/CORS-1490 tracks where the installer will try to capture OS provisioning timeout as bootstrap failure.

Comment 28 W. Trevor King 2020-08-06 23:42:55 UTC

https://github.com/openshift/installer/pull/4020 is in flight with a proposal to adding log gathering after (some?) Terraform errors.

Comment 30 Micah Abbott 2020-09-16 20:13:58 UTC

*** Bug 1878009 has been marked as a duplicate of this bug. ***