Bug 2037626

Summary:	unable to fetch ignition file when scaleup rhel worker nodes on cluster enabled Tang disk encryption
Product:	OpenShift Container Platform	Reporter:	jima
Component:	Installer	Assignee:	Brent Barbachem <bbarbach>
Installer sub component:	openshift-ansible	QA Contact:	Gaoyun Pei <gpei>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, bbarbach, jerzhang, mkrejci, padillon, zzlotnik
Version:	4.9
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 10:41:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description jima 2022-01-06 07:15:27 UTC

Version:
openshift-ansible-4.9.0-202109101042.p0.git.4d833d3.assembly.stream.el7.noarch.rpm


Platform: vsphere

Please specify:
cluster enabled Tang disk encryption

What happened?
Install cluster enabled Tang disk encryption, then scaleup rhel worker node, ansible-playbook failed at below task with error "500 Internal Server Error"
TASK [openshift_node : Fetch bootstrap ignition file locally] ******************
Thursday 06 January 2022  11:19:02 +0800 (0:00:00.920)       0:05:06.235 ****** 
FAILED - RETRYING: Fetch bootstrap ignition file locally (60 retries left).
...
FAILED - RETRYING: Fetch bootstrap ignition file locally (1 retries left).
fatal: [172.31.249.68]: FAILED! => {"attempts": 60, "changed": false, "connection": "close", "content": "", "content_length": "0", "date": "Thu, 06 Jan 2022 03:30:47 GMT", "elapsed": 0, "msg": "Status code was 500 and not [200]: HTTP Error 500: Internal Server Error", "path": "/tmp/ansible.eamoe2ns/bootstrap.ign", "redirected": false, "status": 500, "url": "https://api-int.jima0106a.qe.devcluster.openshift.com:22623/config/worker"}

we found that adding some headers in module ansible.builtin.uri of failed task, playbooks are completed successfully.

- name: Fetch bootstrap ignition file locally
  uri:
    url: "{{ openshift_node_bootstrap_endpoint }}"
    dest: "{{ temp_dir.path }}/bootstrap.ign"
    validate_certs: false
    headers:
      Accept: application/vnd.coreos.ignition+json; version=3.2.0  <----- new added header
    http_agent: "Ignition/0.35.0"
  delay: 10
  retries: 60
  register: bootstrap_ignition
  until:
  - bootstrap_ignition.status is defined
  - bootstrap_ignition.status == 200

Also did some trials by using curl on rhel worker machine, get same issue.

[root@jima0106a-d6b8t-rhel-0 source]# curl -kI https://api-int.jima0106a.qe.devcluster.openshift.com:22623/config/worker -vvvv 
*   Trying 172.31.248.175...
* TCP_NODELAY set
* Connected to api-int.jima0106a.qe.devcluster.openshift.com (172.31.248.175) port 22623 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=system:machine-config-server
*  start date: Jan  6 02:20:07 2022 GMT
*  expire date: Jan  4 02:20:07 2032 GMT
*  issuer: OU=openshift; CN=root-ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> HEAD /config/worker HTTP/1.1
> Host: api-int.jima0106a.qe.devcluster.openshift.com:22623
> User-Agent: curl/7.61.1
> Accept: */*
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/1.1 500 Internal Server Error
HTTP/1.1 500 Internal Server Error
< Content-Length: 0
Content-Length: 0
< Date: Thu, 06 Jan 2022 05:50:29 GMT
Date: Thu, 06 Jan 2022 05:50:29 GMT

< 
* Connection #0 to host api-int.jima0106a.qe.devcluster.openshift.com left intact

[root@jima0106a-d6b8t-rhel-0 source]# curl -H "Accept: application/vnd.coreos.ignition+json; version=3.2.0" -kI https://api-int.jima0106a.qe.devcluster.openshift.com:22623/config/worker -vvvv
*   Trying 172.31.248.175...
* TCP_NODELAY set
* Connected to api-int.jima0106a.qe.devcluster.openshift.com (172.31.248.175) port 22623 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: /etc/pki/tls/certs/ca-bundle.crt
  CApath: none
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, [no content] (0):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server accepted to use http/1.1
* Server certificate:
*  subject: CN=system:machine-config-server
*  start date: Jan  6 02:20:07 2022 GMT
*  expire date: Jan  4 02:20:07 2032 GMT
*  issuer: OU=openshift; CN=root-ca
*  SSL certificate verify result: unable to get local issuer certificate (20), continuing anyway.
* TLSv1.3 (OUT), TLS app data, [no content] (0):
> HEAD /config/worker HTTP/1.1
> Host: api-int.jima0106a.qe.devcluster.openshift.com:22623
> User-Agent: curl/7.61.1
> Accept: application/vnd.coreos.ignition+json; version=3.2.0
> 
* TLSv1.3 (IN), TLS handshake, [no content] (0):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS app data, [no content] (0):
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Content-Length: 316627
Content-Length: 316627
< Content-Type: application/json
Content-Type: application/json
< Date: Thu, 06 Jan 2022 05:51:09 GMT
Date: Thu, 06 Jan 2022 05:51:09 GMT

< 
* Connection #0 to host api-int.jima0106a.qe.devcluster.openshift.com left intact

Attached the logs of running ansible-playbook with -vvv.

What did you expect to happen?
scale up rhel worker machine should be successful on cluster enabled Tang disk encryption

How to reproduce it (as minimally and precisely as possible)?
Once cluster enabled Tang disk encrpytion, scale up rhel nodes and windows nodes always failed.

Anything else we need to know?
Issue also happened when scale up windows nodes

Comment 2 Patrick Dillon 2022-01-13 19:15:05 UTC

This looks like there is an error in the Machine Config Server (it is returning a 500 error when querying for bootstrap.ign) can you attach a must gather log bundle or the machine config server logs?

Comment 3 Matthew Staebler 2022-01-13 21:24:08 UTC

Tang disk encryption is only supported starting in ignition config version 3.2.0. The request coming from the ansible playbook is not requesting an explicit ignition config version. The default served by the machine-config-server is something older than version 3.2.0, so the machine-config-server is balking at creating the ignition config. It seems to me that the machine-config-server should serve a ignition config using the necessary config version when the request is not explicitly requesting a config version.

As such, this seems like a machine-config-server bug to me. Feel free to send it back if you feel that the requester should instead need to know the details of what ignition config version is required.

Comment 6 jima 2022-01-17 01:50:47 UTC

Hi Zack,

sorry that seems I didn't describe the issue clearly in Description.

In task of "Fetch bootstrap ignition file locally", it is as below in product code, there is no any Accept headers defined, and got the error in Comment 4.

- name: Fetch bootstrap ignition file locally
  uri:
    url: "{{ openshift_node_bootstrap_endpoint }}"
    dest: "{{ temp_dir.path }}/bootstrap.ign"
    validate_certs: false
    http_agent: "Ignition/0.35.0"
  delay: 10
  retries: 60
  register: bootstrap_ignition
  until:
  - bootstrap_ignition.status is defined
  - bootstrap_ignition.status == 200

After adding Accept headers as below, it works.

- name: Fetch bootstrap ignition file locally
  uri:
    url: "{{ openshift_node_bootstrap_endpoint }}"
    dest: "{{ temp_dir.path }}/bootstrap.ign"
    validate_certs: false
    headers:
      Accept: application/vnd.coreos.ignition+json; version=3.2.0 
    http_agent: "Ignition/0.35.0"
  delay: 10
  retries: 60
  register: bootstrap_ignition
  until:
  - bootstrap_ignition.status is defined
  - bootstrap_ignition.status == 200

Ansible debug log has been attached in comment 1, please check.

Comment 7 Yu Qi Zhang 2022-01-18 01:26:59 UTC

So a few things here:

1. The expected behaviour (MCO serving spec 2) is in place for historical reasons to ease transition to spec 3. We have opened up a new card for changing the default https://issues.redhat.com/browse/MCO-154.

2. I think generally, we would like to have anything calling ignition/MCS directly to provide the necessary accept header. One thing is this can help prevent bugs in the future for version bumps bringing in unexpected behaviour, when the caller is not ready to switch to the new version. (A guess a counterpoint is that, without specifying the version, the ansible repo will not need updates when the MCO changes spec versions, which may or may not be ideal. In any case, all versions should always be supported. In this case, tang only works on 3.2+, so a system that still uses spec 2 would not be able to parse it).

I am going to send back to the installer team for consideration of adding that header to the ansible playbooks to fix this for existing versions, and the MCO will work on changing the default for future releases.

Comment 8 Matthew Staebler 2022-01-18 02:32:52 UTC

For (2), it is acceptable to me that your preference is that the caller provide an accept header. But, in the absence of that, as a consumer of your API, I take on the risk of receiving whichever version you choose to send to me. What I do not expect is that, when I tell your API that I do not care which version I get, your server responds that it cannot give me a response because the server cannot generate the version that the server itself chose to use. I expect the server to generate the config using whichever version that it has to use. If that is a version that the client cannot ultimately use, then it is the responsibility of the client to look at the version in the response and determine that.

Nevertheless, we will look into making the ansible more robust, although it is not clear to me yet how.

@jerzhang Can you remind me which versions of ignition config are supported in which versions of OpenShift?

Comment 9 Matthew Staebler 2022-01-18 02:34:49 UTC

I am marking this as a non-blocker since there is a workaround whereby the user running the ansible playbook edits the playbook to pass accept headers when requesting the ignition config from the MCO.

Comment 10 Matthew Staebler 2022-01-18 10:43:24 UTC

@jerzhang What is the behavior of the MCS when the accept headers include both v2.2 and v3.2? Would the MCS serve v3.2 (1) always, (2) only when v3.2 is required, or (3) still try to serve v2.2 and fail?

Comment 11 Yu Qi Zhang 2022-01-18 17:42:01 UTC

> I take on the risk of receiving whichever version you choose to send to me. What I do not expect is that, when I tell your API that I do not care which version I get, your server responds that it cannot give me a response because the server cannot generate the version that the server itself chose to use. I expect the server to generate the config using whichever version that it has to use. If that is a version that the client cannot ultimately use, then it is the responsibility of the client to look at the version in the response and determine that.

That is also an improvement we can look to make (if the spec contains fields that are not in prior versions - in this case, tang encryption, serve the latest)

> Can you remind me which versions of ignition config are supported in which versions of OpenShift?

OCP Version 4.1->4.4 supports 2.2
OCP Version 4.5 supports 2.2, 3.0
OCP Version 4.6 supports 2.2. 3.0, 3.1
OCP Version 4.7+ supports 2.2, 3.0, 3.1, 3.2 <-added tang

For details, please see: 1. Ignition (spec) version to OCP version of https://docs.google.com/document/d/1HfGU-kZdogPb2EBnEVWU1ZLLsX05vkBhwI4JrQxUtkE/edit#

> What is the behavior of the MCS when the accept headers include both v2.2 and v3.2? Would the MCS serve v3.2 (1) always, (2) only when v3.2 is required, or (3) still try to serve v2.2 and fail?

I think we just match ignition behaviour on this. We don't have that kind of nuance. If you for some reason provide both, it just takes the first section it sees that has "vnd.coreos.ignition+json" and serves it, meaning that if your header lists 2.2 first, it will serve 2.2, and 3.2 if it's first.

Comment 12 Matthew Staebler 2022-01-18 19:01:02 UTC

> I think we just match ignition behaviour on this. We don't have that kind of nuance. If you for some reason provide both, it just takes the first section it sees that has "vnd.coreos.ignition+json" and serves it, meaning that if your header lists 2.2 first, it will serve 2.2, and 3.2 if it's first.

@jerzhang If we list the versions in decreasing order, will the MCS skip versions that it does not recognize? Or will the MCS balk on unrecognized versions?

In other words, can we add the accept header "application/vnd.coreos.ignition+json;version=3.2.0, application/vnd.coreos.ignition+json; version=2.2.0" and let MCS pick 3.2.0 on OpenShift 4.7+ and 2.2.0 on earlier versions?

Comment 13 Yu Qi Zhang 2022-01-19 18:53:12 UTC

This wouldn't work. In 4.6 for example, this would error because only 3.1/2.2 is supported, and any other headers would cause an error I am pretty sure

Comment 14 Scott Dodson 2022-01-19 19:20:46 UTC

MCS has to serve up archaic Ignition due to the fact that we don't lifecycle RHCOS boot media however openshift-ansible always runs an MCD that matches the cluster version[1], therefore we know that it should always support the latest Ignition spec available in that version of MCS and MCD. How about hardcoding the version for now but extending the MCD accept an argument and emit its Ignition spec version so that we can then request the correct version without updating code into the future?

1 - https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node/tasks/apply_machine_config.yml#L40-L51

Comment 15 Patrick Dillon 2022-03-22 17:43:50 UTC

The installer team can resolve this bz by implementing the solution in comment 14, by hardcoding the value and await future work to determine the appropriate ignition version programmatically.

Comment 21 errata-xmlrpc 2022-08-10 10:41:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069