Bug 1779375
| Summary: | Azure flake: OS Provisioning for VM ... did not finish in the allotted time | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | W. Trevor King <wking> |
| Component: | Installer | Assignee: | aos-install |
| Installer sub component: | openshift-installer | QA Contact: | Gaoyun Pei <gpei> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | adahiya, bbreard, bleanhar, dahernan, dustymabe, ffranz, imcleod, jligon, miabbott, nstielau, oarribas, pkanthal, pravesh, rkshirsa, walters |
| Version: | 4.3.0 | Keywords: | Reopened, UpcomingSprint |
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-09 18:56:22 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
W. Trevor King
2019-12-03 20:40:36 UTC
We'll reopen this if it shows up more often. Reopening. I understand that this is account/provider issue and not really related to Openshift installer, but is there anything we can do from Red Hat? How did you solve the issue? Facing it on OCP 4.3 IPI Azure installation. Setting to low priority as this is the first instance of the timeout being reported since the BZ was opened originally; targeting for 4.6 as possible fix target. @wking do you recall any specifics about how this BZ was resolved for the Azure CI jobs? I still think that this is just a "sometimes the provider is slow" issue, as described in comment 0. Colin closed this in comment 2 citing a low flake rate. I don't think that we actually did anything. I think we should just leave this around to track the fact that we know it exists, expect the flake rate to be low, expect this to be an Azure-side issue, and think we might be able to work around (by increasing timeouts or some such) if it happens more often (and for whatever reason we think a longer Azure-side delay would be reasonable). But that approach doesn't work as well know that folks are supposed to kick open bugs every sprint, so <shrug>, folks can do whatever they want with it. Also, why is everyone whispering in private comments? I don't see anything sensitive above. A reinstallation from scratch works, so RCA is unclear, probably major timeouts would help. I have tried in my local environment and I was not able to reproduce the issue anymore. Hmm, are we pruning job logs? Checking [1], seems like 507 and surrounds are gone and we start at 515? Anyhow, I don't think there was all that much in the original logs beyond the excerpt I quoted in comment 0. I dunno why Abinav moved this from the installer to RHCOS; I also don't see what RHCOS would have to do with this. I still think this is the provider being slow, and the installer saying any of: a. We will fail when the provider is that slow, because it's unacceptable and folks can delete and retry. WONTFIX b. We could wait for longer when the provider is that slow, but it happens so rarely that we are unlikely to ever get around to that. DEFERRED c. We could wait for longer when the provider is that slow; will get a PR up eventually. Until then... NEW. Moving back to the installer based on that. If I'm just missing something and this is really an RHCOS thing, please explain why when kicking back over :) [1]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.3/?marker=logs%2frelease-openshift-ocp-installer-e2e-azure-4.3%2f500%2f (In reply to W. Trevor King from comment #8) > Hmm, are we pruning job logs? Checking [1], seems like 507 and surrounds > are gone and we start at 515? Anyhow, I don't think there was all that much > in the original logs beyond the excerpt I quoted in comment 0. I dunno why > Abinav moved this from the installer to RHCOS; I also don't see what RHCOS > would have to do with this. Azure requires the OS to provide a single when it's booted and the machine is marked in provisioning until then. Since the control-plane 's ignition waits to get config from the bootstrap host, any failure where the machine-config-server on the bootstrap host cannot server the ignition, it manifests in that error. > OS Provisioning for VM 'ci-op-gf3bnp64-cbd67-tpz55-master-2' did not finish in the allotted time the error makes it clear that the OS failed to provision, so it's not installer's or Azure bug. I still think this is the provider being slow, > and the installer saying any of: > > a. We will fail when the provider is that slow, because it's unacceptable > and folks can delete and retry. WONTFIX > b. We could wait for longer when the provider is that slow, but it happens > so rarely that we are unlikely to ever get around to that. DEFERRED > c. We could wait for longer when the provider is that slow; will get a PR up > eventually. Until then... NEW. > > Moving back to the installer based on that. If I'm just missing something > and this is really an RHCOS thing, please explain why when kicking back over > :) > > [1]: > https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/ > release-openshift-ocp-installer-e2e-azure-4.3/?marker=logs%2frelease- > openshift-ocp-installer-e2e-azure-4.3%2f500%2f Is there anywhere in the provisioning process where we could add a check to prove the problem is caused by the provider? > Since the control-plane 's ignition waits to get config from the bootstrap host, any failure where the machine-config-server on the bootstrap host cannot server the ignition, it manifests in that error. Ahh, that makes sense to me. CI search [2] turns up [3] and a few other recent examples. That job was testing 4.4.0-0.nightly-2020-05-30-022631 [4], and it died with: time="2020-05-30T03:28:43Z" level=debug msg="module.master.azurerm_virtual_machine.master[1]: Still creating... [40m40s elapsed]" time="2020-05-30T03:28:44Z" level=debug msg="2020/05/30 03:28:44 [DEBUG] module.master.azurerm_virtual_machine.master[1]: apply errored, but we're indicating that via the Error pointer rather than returning it: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \"" time="2020-05-30T03:28:44Z" level=debug msg="2020/05/30 03:28:44 [ERROR] module.master: eval: *terraform.EvalApplyPost, err: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \"" time="2020-05-30T03:28:44Z" level=debug msg="2020/05/30 03:28:44 [ERROR] module.master: eval: *terraform.EvalSequence, err: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \"" time="2020-05-30T03:28:44Z" level=error time="2020-05-30T03:28:44Z" level=error msg="Error: Code=\"OSProvisioningTimedOut\" Message=\"OS Provisioning for VM 'ci-op-yyghnnkj-761c8-7qk5z-master-1' did not finish in the allotted time. The VM may still finish provisioning successfully. Please check provisioning state later. Also, make sure the image has been properly prepared (generalized).\\r\\n * Instructions for Windows: https://azure.microsoft.com/documentation/articles/virtual-machines-windows-upload-image/ \\r\\n * Instructions for Linux: https://azure.microsoft.com/documentation/articles/virtual-machines-linux-capture-image/ \"" time="2020-05-30T03:28:44Z" level=error time="2020-05-30T03:28:44Z" level=error msg=" on ../tmp/openshift-install-764857706/master/master.tf line 81, in resource \"azurerm_virtual_machine\" \"master\":" time="2020-05-30T03:28:44Z" level=error msg=" 81: resource \"azurerm_virtual_machine\" \"master\" {" But we don't have console logs for those machines [6] (which would be a CI-template fix) and the installer doesn't try to gather bootstrap logs when it dies this way in Terraform (which would be an installer fix). Absent one or both of those, it's not clear to me how RHCOS folks would root-cause this. [1]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_machine-api-operator/570/pull-ci-openshift-machine-api-operator-master-e2e-azure/987 [2]: https://search.svc.ci.openshift.org/?search=OS+Provisioning+for+VM.*did+not+finish+in+the+allotted+time&maxAge=336h&type=bug%2Bjunit&name=azure [3]: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700 [4]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700/build-log.txt [5]: https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700/artifacts/e2e-azure/installer/.openshift_install.log [6]: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.4/1700/artifacts/e2e-azure/nodes/ I'd be supportive of cloning this bug to track the installer fixes to make this more debuggable. Abhinav, I'm curious if you think that's reasonable here. So comment #9 shows that is probably a provisioning problem on the OS side, but comment #11 explains we don't have the necessary logs for RHCOS to debug. There may be an OS level problem here, but without the necessary logs we can't do much with this problem. I'm gonna keep this as a low priority BZ for 4.6, hoping that we get an enhancement to the installer (re: comment #12) that can deliver us more logs. I remember seeing this issue a number of times back when I was working in Azure UPI, and it would happen *a lot* more in 4.4 GA and the first z-streams of 4.4 than in 4.3. Surprisingly I found the logs from a 'openshift-install gather' of one of these failures still laying here in my machine. :-) It's attached, hopefully it helps somehow. From Fabiano's log-bundle:
* log-bundle-20200430122444/failed-units.txt
> * afterburn-sshkeys loaded failed failed Afterburn (SSH Keys)
Probably a red-herring, but would be nice if we could clean this up.
* log-bundle-20200430122444/control-plane/10.0.0.5,10.0.0.6,10.0.0.7/
This looks like you used comma-delimited IPs instead of:
$ openshift-install gather bootstrap --bootstrap whatever --master 10.0.0.5 --master 10.0.0.6 --master 10.0.0.7
Previous discussion in [1,2]. Upshot here is that we didn't attempt to gather from the actual control-plane machines.
* This is certainly suspicious:
$ grep -o 'Error executing MCO.*' bootstrap/journals/bootkube.log | uniq -c
809 Error executing MCO: unknown flag: --cluster-etcd-operator-image
Looks like the installer grew it in [3]. Possibly you were just pointing a 4.4 installer at a 4.3 payload [4]. Checking the CVO pod:
$ grep SOURCE_GIT bootstrap/pods/e9ad8851b53c.inspect
"SOURCE_GIT_URL=https://github.com/openshift/cluster-version-operator",
"SOURCE_GIT_COMMIT=ac81ab4e700e3d7d01cfeb8dae5a21aa28b6c3dc",
"SOURCE_GIT_TAG=v1.0.0-172-gac81ab4",
And then, in the CVO repo:
$ git --no-pager log --oneline origin/release-4.4 | grep ac81ab4
$ git --no-pager log --oneline origin/release-4.3 | grep ac81ab4
ac81ab4 Merge pull request #340 from jottofar/backport-signature-verify
So unless we also have that installer/release mismatch going on in CI, I don't think this log bundle helps :/.
[1]: https://github.com/openshift/release/pull/4734#issue-307109872
[2]: https://github.com/openshift/openshift-docs/pull/17196
[3]: https://github.com/openshift/installer/pull/2730
[4]: https://github.com/openshift/installer/pull/2730#issuecomment-561935476
> * afterburn-sshkeys loaded failed failed Afterburn (SSH Keys) This is a red-herring. All it means is that Afterburn did not find SSH keys in the Azure Meta-data. That's it. https://github.com/coreos/afterburn/blob/master/systemd/afterburn-sshkeys%40.service.in The failure is fine, IMO. @wking how can we the Installer versions checked in CI? This is not looking promising that its an RHCOS issue. > All it means is that Afterburn did not find SSH keys in the Azure Meta-data. If that's acceptable (and from your wording it seems like it is), can we adjust the unit exit 0 and say "I have successfully loaded all zero keys you asked for" instead of failing? Otherwise it's up to folks with less context than the unit authors to determine the failure is serious or acceptable. > how can we the Installer versions checked in CI? I am having trouble parsing this. Can you rephrase? Afterburn is already disabled in 4.6 since https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/972 I'll backport that to at least 4.5 and probably 4.4. (er, more precisely afterburn-sshkeys@) https://gitlab.cee.redhat.com/coreos/redhat-coreos/-/merge_requests/1046 merged in regards to the `afterburn-sshkeys@` Do we expect to get a usable log bundle without some legwork? I still think the comment 11 debugging proposals are both potential paths forward. [1] is precedent for gathering console logs; I'd expect someone familiar with Azure could create a similar step for Azure. And if we don't want to actually land code in the installer to log-bundle automatically we could hack up an Azure step to manually run an installer gather on some/all install failures. This is certainly still happening reasonably often in CI: $ w3m -dump -cols 200 'https://search.ci.openshift.org/?search=OS+Provisioning+for+VM.*did+not+finish+in+the+allotted+time&maxAge=336h&type=junit&name=release-openshift-ocp-.*azure&groupBy=job' | grep 'failures match' release-openshift-ocp-installer-e2e-azure-4.6 - 154 runs, 44% failed, 4% of failures match release-openshift-ocp-installer-e2e-azure-serial-4.3 - 43 runs, 23% failed, 10% of failures match release-openshift-ocp-installer-e2e-azure-ovn-4.4 - 14 runs, 64% failed, 11% of failures match release-openshift-ocp-installer-e2e-azure-4.3 - 45 runs, 33% failed, 7% of failures match release-openshift-ocp-installer-e2e-azure-serial-4.4 - 46 runs, 33% failed, 7% of failures match [1]: https://github.com/openshift/release/tree/39557830d861800802fa78033edcb70783691274/ci-operator/step-registry/gather/aws-console > The CoreOS team (including myself), have expertise around how RHCOS interacts with Azure at the hypervisor level. The value of "hacking" something together for solving this bug is eclipsed by the value of having a true feature in the installer for capturing the logs (in Azure speak - enabling boot diagnostics). I'm sending this over the Installer to enable that feature in the installer since doing that leg work will yield long term results that will make debugging boot failures much easier for all teams. Once we have those logs, I am more than happy to take a look.
Installer already captures log when bootstrapping fails..
Here RHCOS is failing to boot because on Azure OS provisioning is required for machine to be accepted as running on Azure. This is not the case on any other cloud. The installer has no way of knowing that the machine isn't running because of this special case. Personally i feel like we should try to keep everything similar across platforms because this error condition is too difficult to correctly handle.
As for adding boot logs to installer gather, RHCOS team is free to work with installer team to incorporate to add to bootstap bundle.
https://issues.redhat.com/browse/CORS-1490 tracks where the installer will try to capture OS provisioning timeout as bootstrap failure. https://github.com/openshift/installer/pull/4020 is in flight with a proposal to adding log gathering after (some?) Terraform errors. *** Bug 1878009 has been marked as a duplicate of this bug. *** |