Bug 1853068 - Azure: Afterburn boot check-in requests fail after boot diagnostics is enabled
Summary: Azure: Afterburn boot check-in requests fail after boot diagnostics is enabled
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.4
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.6.0
Assignee: Luca BRUNO
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On:
Blocks: 1186913
TreeView+ depends on / blocked
 
Reported: 2020-07-01 20:20 UTC by Micah Abbott
Modified: 2020-10-27 16:12 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Manual changes to a VM during provisioning alters the incarnation number. Consequence: Afterburn read-state reporting fails due to an incarnation number mismatch. Fix: Afterburn now fetches the goalstate just in time, using the fresh incarnation number when posting the ready-state. Result: Ready-state is correctly posted.
Clone Of:
Environment:
Last Closed: 2020-10-27 16:11:46 UTC
Target Upstream Version:
lucab: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:4196 0 None None None 2020-10-27 16:12:05 UTC

Description Micah Abbott 2020-07-01 20:20:11 UTC
Copied from https://issues.redhat.com/browse/RFE-1006?focusedCommentId=14177572&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14177572

```
From the log of the case.

    [ 20.418628] systemd[1]: Starting CoreOS Firstboot encryption of root device...
    [ 20.469573] coreos-cryptfs[944]: coreos-cryptfs: Fetching clevis config
    [ 20.537685] coreos-cryptfs[944]: coreos-cryptfs: Detected provided Clevis config
    [ 20.548567] coreos-cryptfs[944]: {
    [ 20.552368] coreos-cryptfs[944]: "t": 2,
    [ 20.564200] coreos-cryptfs[944]: "pins": {
    [ 20.569717] coreos-cryptfs[944]: "tang": [
    [ 20.578522] coreos-cryptfs[944]: {
    [ 20.583604] coreos-cryptfs[944]: "url": "http://tangsp01...."

    .....

    [ ***] (2 of 2) A start job is running for…root device (1min 16s / 30min 14s)
    [ 82.365317] coreos-cryptfs[944]: Progress: 27.7%, ETA 02:00, 804 MiB written, speed 17.4 MiB/s

The problem is that the afterburn job does not check if the cryptfs is still on going and it checks the node availability which fails.
How about to add a "After=cryptfs" to afterburn so that afterburn runs after the successful cryptfs run

    [ 85.812874] afterburn[935]: Jun 24 08:26:05.425 INFO Post
    ing http://168.63.129.16/machine/?comp=health: Attempt #10
    [ 85.826176] afterburn[935]: Error: checking-in instance boot to cloud provider

    [FAILED] Failed to start Afterburn (Check In - from the i[ 85.854686] afterburn[935]: Caused by: timed out
    nitramfs).[ 85.868445] afterburn[935]: Caused by: POST failed: 410 Gone

    [ 85.875034] systemd[1]: rhcos-afterburn-checkin.service: Main process exited, code=exited, status=1/FAILURE
    [ 85.883975] systemd[1]: rhcos-afterburn-checkin.service: Failed with result 'exit-code'.

    Starting CoreOS Firstboot encryption of root device...
    [ 20.395839] systemd[1]: Started Ignition (disks).
    [ 20.412084] ignition[934]: Ignition finished successfully
    [ 20.418628] systemd[1]: Starting CoreOS Firstboot encryption of root device...
    [ 20.469573] coreos-cryptfs[944]: coreos-cryptfs: Fetching clevis config
```

Comment 1 Colin Walters 2020-07-01 21:23:08 UTC
I think they mean the other way around - we should have afterburn-checkin run early.  And it should do so now:

```
# Since we don't care about the actual success of the boot
# from the OS perspective, check in as soon as we are able.
After=network.target
```
And that's the sole ordering we have there.  Oh but the FIPS code does:

# We want to run to completion before check-in: if we immediately reboot for
# FIPS, it'd be more appropriate to say we're up on that second boot. Also on
# some platforms (like Azure), check-in "ejects" the user-data drive, so
# Ignition can't fetch the config again.
Before=rhcos-afterburn-checkin.service

And that's going to order it later...but it still looks like they *should* run concurrently.

Comment 2 Aleksandar Lazic (RHAc) 2020-07-02 05:57:15 UTC
I assume that this is the code for azure check in test https://github.com/coreos/afterburn/blob/master/src/providers/azure/mod.rs#L501

AFAIK Azure show the machine only in ready state after a successful boot which can only be done after a encryption.

Which of the services is running at "Starting CoreOS Firstboot encryption of root device" phase I assume 
https://github.com/coreos/afterburn/blob/master/systemd/afterburn-firstboot-checkin.service

Performs the "https://github.com/coreos/afterburn/blob/master/systemd/afterburn-firstboot-checkin.service" a check if the machine is up and running?

Comment 4 Luca BRUNO 2020-07-02 08:55:55 UTC
My understanding of the log snippet in this ticket is that we are observing the result of "coreos-encrypt.service" and "rhcos-afterburn-checkin.service" not having any explicit ordering between them (this does not seem problematic per se).

Both services are declared to run after "rhcos-fips.service". All the services are of type oneshot. As such, once "rhcos-fips.service" has successfully completed, both the above ones are free to run. Logs indeed confirm they are running in parallel.

Additionally Afterburn is trying to check in, but receiving a "410 Gone" from the cloud endpoint (I have never encountered this behavior on Azure before, so it's a worrisome red flag).
However, "rhcos-afterburn-checkin.service" does not have any "OnFailure" settings, so I won't expect this failure alone to abort the boot in initramfs (the log snippet seems to confirm this too, but it's truncated).

So a few overall questions to move this forward:
 * Are all the 10 attempts by Afterburn failing with 410? If so, are the same check-in failures observed even when installing without encryption, on the exact same environment?
 * Is there anything relevant that should be noted about this Azure instance (e.g. is it on an experimental region, Gov cloud, Stack or Stack Hub, etc)?
 * How is the boot process actually failing? From the logs, it doesn't seem to abort due to the Afterburn failure.
 * Is this node being installed in FIPS mode perhaps?
 * [@behoward] Is it ok to keep the two services starting in parallel? Or is there an expectation of relative ordering between them, in either direction?

Comment 5 Aleksandar Lazic (RHAc) 2020-07-02 10:04:11 UTC
> * Are all the 10 attempts by Afterburn failing with 410? 

That's all entries in the log about afterburn.

```
egrep after tang_clevis_nbdbe_coreos.log
[    **] (2 of 2) A start job is running for…n of root device (42s / 30min 14s)[   48.593890] afterburn[935]: Jun 24 08:25:28.206 WARN Failed to get fabric address from DHCP: timed out
[   48.629332] afterburn[935]: Jun 24 08:25:28.206 INFO Using fallback address
[   48.635527] afterburn[935]: Jun 24 08:25:28.219 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1
[   48.643519] afterburn[935]: Jun 24 08:25:28.224 INFO Fetch successful
[   48.648844] afterburn[935]: Jun 24 08:25:28.224 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1
[   48.657571] afterburn[935]: Jun 24 08:25:28.228 INFO Fetch successful
[   48.663368] afterburn[935]: Jun 24 08:25:28.237 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #1
[   ***] (2 of 2) A start job is running for…n of root device (43s / 30min 14s)[   49.636538] afterburn[935]: Jun 24 08:25:29.249 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #2
[   51.655089] afterburn[935]: Jun 24 08:25:31.267 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #3
[     *] (2 of 2) A start job is running for…n of root device (49s / 30min 14s)[   55.674572] afterburn[935]: Jun 24 08:25:35.287 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #4
[ ***  ] (2 of 2) A start job is running for…n of root device (54s / 30min 14s)[   60.699585] afterburn[935]: Jun 24 08:25:40.312 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #5
[**    ] (1 of 2) A start job is running for…om the initramfs) (59s / no limit)[   65.721953] afterburn[935]: Jun 24 08:25:45.334 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #6
[   ***] (2 of 2) A start job is running for… root device (1min 4s / 30min 14s)[   70.735569] afterburn[935]: Jun 24 08:25:50.348 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #7
[   ***] (2 of 2) A start job is running for… root device (1min 9s / 30min 14s)[   75.762143] afterburn[935]: Jun 24 08:25:55.374 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #8
[***   ] (1 of 2) A start job is running for…e initramfs) (1min 14s / no limit)[   80.787438] afterburn[935]: Jun 24 08:26:00.400 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #9
[***   ] (1 of 2) A start job is running for…e initramfs) (1min 19s / no limit)[   85.812874] afterburn[935]: Jun 24 08:26:05.425 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #10
[   85.826176] afterburn[935]: Error: checking-in instance boot to cloud provider
[FAILED] Failed to start Afterburn (Check In - from the i[   85.854686] afterburn[935]: Caused by: timed out
nitramfs).[   85.868445] afterburn[935]: Caused by: POST failed: 410 Gone
[   85.875034] systemd[1]: rhcos-afterburn-checkin.service: Main process exited, code=exited, status=1/FAILURE
[   85.883975] systemd[1]: rhcos-afterburn-checkin.service: Failed with result 'exit-code'.
See 'systemctl status rhcos-afterburn-checkin.service' for details.
```

> * Is there anything relevant that should be noted about this Azure instance (e.g. is it on an experimental region, Gov cloud, Stack or Stack Hub, etc)?
It's on Germany west central.

> * How is the boot process actually failing? From the logs, it doesn't seem to abort due to the Afterburn failure.

```
[***   ] (1 of 2) A start job is running for…e initramfs) (1min 19s / no limit)[   85.812874] afterburn[935]: Jun 24 08:26:05.425 INFO Post
ing http://168.63.129.16/machine/?comp=health: Attempt #10
[   85.826176] afterburn[935]: Error: checking-in instance boot to cloud provider

[FAILED] Failed to start Afterburn (Check In - from the i[   85.854686] afterburn[935]: Caused by: timed out
nitramfs).[   85.868445] afterburn[935]: Caused by: POST failed: 410 Gone

[   85.875034] systemd[1]: rhcos-afterburn-checkin.service: Main process exited, code=exited, status=1/FAILURE
[   85.883975] systemd[1]: rhcos-afterburn-checkin.service: Failed with result 'exit-code'.
[   85.903753] systemd[1]: Failed to start Afterburn (Check In - from the initramfs).
See 'systemctl status rhcos-afterburn-checkin.service' for details.
[[   85.914847] systemd[1]: Dependency failed for Ignition (files).
DEPEND] Dependency failed for Ignition (files).
[[   85.944587] systemd[1]: Dependency failed for Ignition Complete.
DEPEND] Dependency failed for Ignition Complete.
[DEPEND] Dependency failed for Initrd Default Target.
[  OK  ] Stopped target Initrd Root Device.
```

> * Is this node being installed in FIPS mode perhaps?
How can I verify this?

The whole log is available in the customer case 02684784

Comment 6 Luca BRUNO 2020-07-10 14:04:25 UTC
> * Is this node being installed in FIPS mode perhaps?

Self-answering here: no.
Logs mention "rhcos-fips[917]: FIPS mode not requested".

Comment 7 Luca BRUNO 2020-07-10 14:25:30 UTC
I've been poking at Azure (Germany West Central) with an RHCOS 4.4.z image for a bit now, but I've been unable to trigger a "410 Gone" response from the cloud WireServer (168.63.129.16).

My initial guess was that RHCOS could be mistakenly trying to check in multiple times during the same boot, and getting back a 410 on the non-first attempts. However this is not the case, as the WireServer seems to be fine with multiple subsequent check-ins (I tested this scenario manually).

Digging through MS documentation pages, I see they have an FAQ [0] entry covering this case:

> 5. Why am I getting the error 500 Internal Server Error or 410 Resource Gone?
>
> Retry your request based on exponential back off system or other methods described in Transient fault handling. If the issue persists create a support issue in Azure portal for the VM.

The client logic in Afterburn does retry POST requests, with backoff. However, we consistently (10x) see the server turning them down.

[0] https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service#known-issues-and-faq

Comment 9 Aleksandar Lazic (RHAc) 2020-07-10 21:41:16 UTC
Afterburn should not check the machine availability until the root encryption is done, from my point of view.
Can you tell the afterburn service to run after successful run of coreos-cryptfs?

Comment 10 Luca BRUNO 2020-07-15 12:06:11 UTC
> Afterburn should not check the machine availability until the root encryption is done, from my point of view.
> Can you tell the afterburn service to run after successful run of coreos-cryptfs?

Afterburn does not check for availability, it only reports (i.e. POST to Azure) that the machine booted into userspace.

The semantics of what "booted into userspace" means are quite fuzzy, and for OpenShift specifically the architects requested it to be done early in initramfs so that the openshift-installer can follow the rest of process.

So in general no, we won't be pushing the check in logic to happen later on, due to requirements/usecases from higher up.

Additionally, moving this logic forward or backward in time does't change the fact that the cloud platform is returning an error on all attempts, which is the issue at hand.

Comment 11 Aleksandar Lazic (RHAc) 2020-07-15 12:19:24 UTC
@luca. You are right, sorry I have misunderstood afterburn.

Comment 12 Luca BRUNO 2020-07-15 12:45:47 UTC
One potential interesting hint from https://github.com/Azure/azure-sdk-for-go/issues/1848: the WireServer routinely goes down for maintenance, with a p90 monthly downtime up to 3 minutes (!).

Our retry logic goes through 10 failed attempts in a shorter timespan (~1 minute).

If this indeed what we are seeing here (pure speculation), we are observing a 1 out of 15000 chance for a newly created Azure instance to fail due to WireServer downtime (i.e. 0.007% based on their p90).

Assuming spaced recurring maintenance, retrying the same install at a luckier time of day should work.
On our side, we could try to lower the chances of failure by sticking to retrying a bit longer, at the expense of pathological long install times during infrastructure maintenance.

Comment 14 Luca BRUNO 2020-07-16 15:16:13 UTC
Followup from the reporter: it looks like 410 errors correlate with human intervention (changing VM config and activating Boot-Diagnostics while the VM already started provisioning), and do not appear without such intervention.

I've noted down some future work from previous discoveries at https://github.com/coreos/afterburn/issues/459.

I'm now going back to see if I can reproduce the 410 too (by perturbing VM config while provisioning is ongoing), and also checking whether relaxing Afterburn timeouts helps going past the errors.

Comment 15 Luca BRUNO 2020-07-21 14:09:31 UTC
Playing a bit of guessing game and trial-and-error, I think I've traced down the root cause of this to https://github.com/coreos/afterburn/issues/461. It will be fixed upstream and then reach back to a future RHCOS release.

From what I've seen this only triggers if some instance settings are changed in the timespan between machine creation and boot reporting. The plain workaround is not to tweak settings while provisioning is in flight (e.g. by enabling boot diagnostics directly when creating the instance).

Comment 16 Aleksandar Lazic (RHAc) 2020-07-21 23:20:44 UTC
The Customer have activated the Serial console at provisioning which could cause the 410 response. The machine was perfectly provisioned and the filesystems was encrypted when the Serial console was not activated.

After the reboot stops the machine becuse the clevis client can't reach the tang server because there is no network started before the clevis client is running.

Maybe this commit will solve the network issue in the initram?
https://github.com/coreos/fedora-coreos-tracker/issues/394#issuecomment-604598128

Comment 18 Radomir Ludva 2020-07-22 11:03:54 UTC
What's missing is the KernelArgument  "rd.neednet=1", they have also applied this to the Nodes now via a MachineConfig.
Once the Config is rendered, Tang/Clevis works as expected. The Disk is encrypted and can be unlocked on a following reboot.

Can you please confirm if this is a supported solution and setup?

Comment 19 Luca BRUNO 2020-08-10 08:39:27 UTC
The ready-state logic in Afterburn has been augmented in https://github.com/coreos/afterburn/pull/468 to take care of the increasing incarnation number.

It has been released as part of 4.5.0, see https://github.com/coreos/afterburn/releases/tag/v4.5.0.

The release has already been packaged and will be included in OCP 4.6 (starting with RHCOS 46.82.202008071340).

Comment 23 Colin Walters 2020-09-28 18:49:43 UTC
OK so basically we believe the initial diagnosis of a conflict between afterburn and rhcos-encrypt was wrong; the core confusion here is this:
https://bugzilla.redhat.com/show_bug.cgi?id=1853068#c18

Basically yes, you should add a MachineConfig to set `rd.neednet=1` when using Tang.

Now while investigating this, we discovered afterburn could fail when boot diagnostics is enabled after boot.

Since Tang is reported as working, let's keep this BZ to solely be about boot diagnostics.

So to verify this then, a manual test of:

- launch either a current 4.6 openshift cluster or just a single RHCOS node in Azure
- enable boot diagnostics
- reboot
- afterburn checkin should succeed

or so?

Comment 25 Micah Abbott 2020-10-08 18:51:59 UTC
I'm going to mark this Verified: SanityOnly against 4.6.0-rc.0

Using the steps from comment #23

- installed cluster
- check worker node for evidence of checkin
- navigated Azure portal; enabled boot diagnostics on same node
- reboot worker node
- validate node comes up successfully
- validate checkin was succesful
- validate boot diagnostics are usable

Comment 27 errata-xmlrpc 2020-10-27 16:11:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196


Note You need to log in before you can comment on or make changes to this bug.