Copied from https://issues.redhat.com/browse/RFE-1006?focusedCommentId=14177572&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14177572 ``` From the log of the case. [ 20.418628] systemd[1]: Starting CoreOS Firstboot encryption of root device... [ 20.469573] coreos-cryptfs[944]: coreos-cryptfs: Fetching clevis config [ 20.537685] coreos-cryptfs[944]: coreos-cryptfs: Detected provided Clevis config [ 20.548567] coreos-cryptfs[944]: { [ 20.552368] coreos-cryptfs[944]: "t": 2, [ 20.564200] coreos-cryptfs[944]: "pins": { [ 20.569717] coreos-cryptfs[944]: "tang": [ [ 20.578522] coreos-cryptfs[944]: { [ 20.583604] coreos-cryptfs[944]: "url": "http://tangsp01...." ..... [ ***] (2 of 2) A start job is running for…root device (1min 16s / 30min 14s) [ 82.365317] coreos-cryptfs[944]: Progress: 27.7%, ETA 02:00, 804 MiB written, speed 17.4 MiB/s The problem is that the afterburn job does not check if the cryptfs is still on going and it checks the node availability which fails. How about to add a "After=cryptfs" to afterburn so that afterburn runs after the successful cryptfs run [ 85.812874] afterburn[935]: Jun 24 08:26:05.425 INFO Post ing http://168.63.129.16/machine/?comp=health: Attempt #10 [ 85.826176] afterburn[935]: Error: checking-in instance boot to cloud provider [FAILED] Failed to start Afterburn (Check In - from the i[ 85.854686] afterburn[935]: Caused by: timed out nitramfs).[ 85.868445] afterburn[935]: Caused by: POST failed: 410 Gone [ 85.875034] systemd[1]: rhcos-afterburn-checkin.service: Main process exited, code=exited, status=1/FAILURE [ 85.883975] systemd[1]: rhcos-afterburn-checkin.service: Failed with result 'exit-code'. Starting CoreOS Firstboot encryption of root device... [ 20.395839] systemd[1]: Started Ignition (disks). [ 20.412084] ignition[934]: Ignition finished successfully [ 20.418628] systemd[1]: Starting CoreOS Firstboot encryption of root device... [ 20.469573] coreos-cryptfs[944]: coreos-cryptfs: Fetching clevis config ```
I think they mean the other way around - we should have afterburn-checkin run early. And it should do so now: ``` # Since we don't care about the actual success of the boot # from the OS perspective, check in as soon as we are able. After=network.target ``` And that's the sole ordering we have there. Oh but the FIPS code does: # We want to run to completion before check-in: if we immediately reboot for # FIPS, it'd be more appropriate to say we're up on that second boot. Also on # some platforms (like Azure), check-in "ejects" the user-data drive, so # Ignition can't fetch the config again. Before=rhcos-afterburn-checkin.service And that's going to order it later...but it still looks like they *should* run concurrently.
I assume that this is the code for azure check in test https://github.com/coreos/afterburn/blob/master/src/providers/azure/mod.rs#L501 AFAIK Azure show the machine only in ready state after a successful boot which can only be done after a encryption. Which of the services is running at "Starting CoreOS Firstboot encryption of root device" phase I assume https://github.com/coreos/afterburn/blob/master/systemd/afterburn-firstboot-checkin.service Performs the "https://github.com/coreos/afterburn/blob/master/systemd/afterburn-firstboot-checkin.service" a check if the machine is up and running?
My understanding of the log snippet in this ticket is that we are observing the result of "coreos-encrypt.service" and "rhcos-afterburn-checkin.service" not having any explicit ordering between them (this does not seem problematic per se). Both services are declared to run after "rhcos-fips.service". All the services are of type oneshot. As such, once "rhcos-fips.service" has successfully completed, both the above ones are free to run. Logs indeed confirm they are running in parallel. Additionally Afterburn is trying to check in, but receiving a "410 Gone" from the cloud endpoint (I have never encountered this behavior on Azure before, so it's a worrisome red flag). However, "rhcos-afterburn-checkin.service" does not have any "OnFailure" settings, so I won't expect this failure alone to abort the boot in initramfs (the log snippet seems to confirm this too, but it's truncated). So a few overall questions to move this forward: * Are all the 10 attempts by Afterburn failing with 410? If so, are the same check-in failures observed even when installing without encryption, on the exact same environment? * Is there anything relevant that should be noted about this Azure instance (e.g. is it on an experimental region, Gov cloud, Stack or Stack Hub, etc)? * How is the boot process actually failing? From the logs, it doesn't seem to abort due to the Afterburn failure. * Is this node being installed in FIPS mode perhaps? * [@behoward] Is it ok to keep the two services starting in parallel? Or is there an expectation of relative ordering between them, in either direction?
> * Are all the 10 attempts by Afterburn failing with 410? That's all entries in the log about afterburn. ``` egrep after tang_clevis_nbdbe_coreos.log [ **] (2 of 2) A start job is running for…n of root device (42s / 30min 14s)[ 48.593890] afterburn[935]: Jun 24 08:25:28.206 WARN Failed to get fabric address from DHCP: timed out [ 48.629332] afterburn[935]: Jun 24 08:25:28.206 INFO Using fallback address [ 48.635527] afterburn[935]: Jun 24 08:25:28.219 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1 [ 48.643519] afterburn[935]: Jun 24 08:25:28.224 INFO Fetch successful [ 48.648844] afterburn[935]: Jun 24 08:25:28.224 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1 [ 48.657571] afterburn[935]: Jun 24 08:25:28.228 INFO Fetch successful [ 48.663368] afterburn[935]: Jun 24 08:25:28.237 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #1 [ ***] (2 of 2) A start job is running for…n of root device (43s / 30min 14s)[ 49.636538] afterburn[935]: Jun 24 08:25:29.249 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #2 [ 51.655089] afterburn[935]: Jun 24 08:25:31.267 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #3 [ *] (2 of 2) A start job is running for…n of root device (49s / 30min 14s)[ 55.674572] afterburn[935]: Jun 24 08:25:35.287 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #4 [ *** ] (2 of 2) A start job is running for…n of root device (54s / 30min 14s)[ 60.699585] afterburn[935]: Jun 24 08:25:40.312 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #5 [** ] (1 of 2) A start job is running for…om the initramfs) (59s / no limit)[ 65.721953] afterburn[935]: Jun 24 08:25:45.334 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #6 [ ***] (2 of 2) A start job is running for… root device (1min 4s / 30min 14s)[ 70.735569] afterburn[935]: Jun 24 08:25:50.348 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #7 [ ***] (2 of 2) A start job is running for… root device (1min 9s / 30min 14s)[ 75.762143] afterburn[935]: Jun 24 08:25:55.374 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #8 [*** ] (1 of 2) A start job is running for…e initramfs) (1min 14s / no limit)[ 80.787438] afterburn[935]: Jun 24 08:26:00.400 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #9 [*** ] (1 of 2) A start job is running for…e initramfs) (1min 19s / no limit)[ 85.812874] afterburn[935]: Jun 24 08:26:05.425 INFO Posting http://168.63.129.16/machine/?comp=health: Attempt #10 [ 85.826176] afterburn[935]: Error: checking-in instance boot to cloud provider [FAILED] Failed to start Afterburn (Check In - from the i[ 85.854686] afterburn[935]: Caused by: timed out nitramfs).[ 85.868445] afterburn[935]: Caused by: POST failed: 410 Gone [ 85.875034] systemd[1]: rhcos-afterburn-checkin.service: Main process exited, code=exited, status=1/FAILURE [ 85.883975] systemd[1]: rhcos-afterburn-checkin.service: Failed with result 'exit-code'. See 'systemctl status rhcos-afterburn-checkin.service' for details. ``` > * Is there anything relevant that should be noted about this Azure instance (e.g. is it on an experimental region, Gov cloud, Stack or Stack Hub, etc)? It's on Germany west central. > * How is the boot process actually failing? From the logs, it doesn't seem to abort due to the Afterburn failure. ``` [*** ] (1 of 2) A start job is running for…e initramfs) (1min 19s / no limit)[ 85.812874] afterburn[935]: Jun 24 08:26:05.425 INFO Post ing http://168.63.129.16/machine/?comp=health: Attempt #10 [ 85.826176] afterburn[935]: Error: checking-in instance boot to cloud provider [FAILED] Failed to start Afterburn (Check In - from the i[ 85.854686] afterburn[935]: Caused by: timed out nitramfs).[ 85.868445] afterburn[935]: Caused by: POST failed: 410 Gone [ 85.875034] systemd[1]: rhcos-afterburn-checkin.service: Main process exited, code=exited, status=1/FAILURE [ 85.883975] systemd[1]: rhcos-afterburn-checkin.service: Failed with result 'exit-code'. [ 85.903753] systemd[1]: Failed to start Afterburn (Check In - from the initramfs). See 'systemctl status rhcos-afterburn-checkin.service' for details. [[ 85.914847] systemd[1]: Dependency failed for Ignition (files). DEPEND] Dependency failed for Ignition (files). [[ 85.944587] systemd[1]: Dependency failed for Ignition Complete. DEPEND] Dependency failed for Ignition Complete. [DEPEND] Dependency failed for Initrd Default Target. [ OK ] Stopped target Initrd Root Device. ``` > * Is this node being installed in FIPS mode perhaps? How can I verify this? The whole log is available in the customer case 02684784
> * Is this node being installed in FIPS mode perhaps? Self-answering here: no. Logs mention "rhcos-fips[917]: FIPS mode not requested".
I've been poking at Azure (Germany West Central) with an RHCOS 4.4.z image for a bit now, but I've been unable to trigger a "410 Gone" response from the cloud WireServer (168.63.129.16). My initial guess was that RHCOS could be mistakenly trying to check in multiple times during the same boot, and getting back a 410 on the non-first attempts. However this is not the case, as the WireServer seems to be fine with multiple subsequent check-ins (I tested this scenario manually). Digging through MS documentation pages, I see they have an FAQ [0] entry covering this case: > 5. Why am I getting the error 500 Internal Server Error or 410 Resource Gone? > > Retry your request based on exponential back off system or other methods described in Transient fault handling. If the issue persists create a support issue in Azure portal for the VM. The client logic in Afterburn does retry POST requests, with backoff. However, we consistently (10x) see the server turning them down. [0] https://docs.microsoft.com/en-us/azure/virtual-machines/windows/instance-metadata-service#known-issues-and-faq
Afterburn should not check the machine availability until the root encryption is done, from my point of view. Can you tell the afterburn service to run after successful run of coreos-cryptfs?
> Afterburn should not check the machine availability until the root encryption is done, from my point of view. > Can you tell the afterburn service to run after successful run of coreos-cryptfs? Afterburn does not check for availability, it only reports (i.e. POST to Azure) that the machine booted into userspace. The semantics of what "booted into userspace" means are quite fuzzy, and for OpenShift specifically the architects requested it to be done early in initramfs so that the openshift-installer can follow the rest of process. So in general no, we won't be pushing the check in logic to happen later on, due to requirements/usecases from higher up. Additionally, moving this logic forward or backward in time does't change the fact that the cloud platform is returning an error on all attempts, which is the issue at hand.
@luca. You are right, sorry I have misunderstood afterburn.
One potential interesting hint from https://github.com/Azure/azure-sdk-for-go/issues/1848: the WireServer routinely goes down for maintenance, with a p90 monthly downtime up to 3 minutes (!). Our retry logic goes through 10 failed attempts in a shorter timespan (~1 minute). If this indeed what we are seeing here (pure speculation), we are observing a 1 out of 15000 chance for a newly created Azure instance to fail due to WireServer downtime (i.e. 0.007% based on their p90). Assuming spaced recurring maintenance, retrying the same install at a luckier time of day should work. On our side, we could try to lower the chances of failure by sticking to retrying a bit longer, at the expense of pathological long install times during infrastructure maintenance.
Followup from the reporter: it looks like 410 errors correlate with human intervention (changing VM config and activating Boot-Diagnostics while the VM already started provisioning), and do not appear without such intervention. I've noted down some future work from previous discoveries at https://github.com/coreos/afterburn/issues/459. I'm now going back to see if I can reproduce the 410 too (by perturbing VM config while provisioning is ongoing), and also checking whether relaxing Afterburn timeouts helps going past the errors.
Playing a bit of guessing game and trial-and-error, I think I've traced down the root cause of this to https://github.com/coreos/afterburn/issues/461. It will be fixed upstream and then reach back to a future RHCOS release. From what I've seen this only triggers if some instance settings are changed in the timespan between machine creation and boot reporting. The plain workaround is not to tweak settings while provisioning is in flight (e.g. by enabling boot diagnostics directly when creating the instance).
The Customer have activated the Serial console at provisioning which could cause the 410 response. The machine was perfectly provisioned and the filesystems was encrypted when the Serial console was not activated. After the reboot stops the machine becuse the clevis client can't reach the tang server because there is no network started before the clevis client is running. Maybe this commit will solve the network issue in the initram? https://github.com/coreos/fedora-coreos-tracker/issues/394#issuecomment-604598128
What's missing is the KernelArgument "rd.neednet=1", they have also applied this to the Nodes now via a MachineConfig. Once the Config is rendered, Tang/Clevis works as expected. The Disk is encrypted and can be unlocked on a following reboot. Can you please confirm if this is a supported solution and setup?
The ready-state logic in Afterburn has been augmented in https://github.com/coreos/afterburn/pull/468 to take care of the increasing incarnation number. It has been released as part of 4.5.0, see https://github.com/coreos/afterburn/releases/tag/v4.5.0. The release has already been packaged and will be included in OCP 4.6 (starting with RHCOS 46.82.202008071340).
OK so basically we believe the initial diagnosis of a conflict between afterburn and rhcos-encrypt was wrong; the core confusion here is this: https://bugzilla.redhat.com/show_bug.cgi?id=1853068#c18 Basically yes, you should add a MachineConfig to set `rd.neednet=1` when using Tang. Now while investigating this, we discovered afterburn could fail when boot diagnostics is enabled after boot. Since Tang is reported as working, let's keep this BZ to solely be about boot diagnostics. So to verify this then, a manual test of: - launch either a current 4.6 openshift cluster or just a single RHCOS node in Azure - enable boot diagnostics - reboot - afterburn checkin should succeed or so?
I'm going to mark this Verified: SanityOnly against 4.6.0-rc.0 Using the steps from comment #23 - installed cluster - check worker node for evidence of checkin - navigated Azure portal; enabled boot diagnostics on same node - reboot worker node - validate node comes up successfully - validate checkin was succesful - validate boot diagnostics are usable
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196