1899289 – [4.6.z] Need to set GODEBUG=x509ignoreCN=0 in initrd

Bug 1899289 - [4.6.z] Need to set GODEBUG=x509ignoreCN=0 in initrd

Summary: [4.6.z] Need to set GODEBUG=x509ignoreCN=0 in initrd

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	RHCOS
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Nikita Dubrovskii (IBM)
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:	non-multi-arch, bootimage
Depends On:	1886134
Blocks:	1899176
TreeView+	depends on / blocked

Reported:	2020-11-18 20:06 UTC by Micah Abbott
Modified:	2020-11-30 16:46 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1886134
Environment:
Last Closed:	2020-11-30 16:46:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:5115	0	None	None	None	2020-11-30 16:46:29 UTC

Description Micah Abbott 2020-11-18 20:06:39 UTC

+++ This bug was initially created as a clone of Bug #1886134 +++

This bug was initially created as a copy of Bug #1882191

I am copying this bug because: 

The root cause of this bug is that they had a registry with a certificate that fails due to this change. The bug that this was copied from resolves part of that problem by setting the GODEBUG environment variable system wide via systemd. However that still leaves one gap and that's Ignition which runs in the initrd and may similarly interface with external endpoints that certificates that fail this validation.

Therefore, we should set the environment variable in the initrd as well.

This seems like a relatively small gap to close so I don't believe that this should be a 4.6 GA blocker but it'd be nice to get it fixed in early 4.6.z.



Description of problem:
Performing a OCP 4.6 Installation in a restricted network on zVM fails.  The 

Version-Release number of selected component (if applicable):
RHCOS 4.6.0-0.nightly-s390x-2020-09-10-112115
OCP 4.6.0-0.nightly-s390x-2020-09-22-223822

How reproducible:
Consistently

Steps to Reproduce:
1. Follow steps to configure the mirror host on bastion:
https://docs.openshift.com/container-platform/4.5/installing/install_config/installing-restricted-networks-preparations.html
2. Install cluster using restricted network steps:
https://docs.openshift.com/container-platform/4.5/installing/installing_bare_metal/installing-restricted-networks-bare-metal.html#installing-restricted-networks-bare-metal
3. IPL the bootstrap and cluster nodes.

Actual results: Bootstrap, master and worker nodes all start.  However, the master nodes never become Ready:

[root@OSPAMGR2 ~]# oc get nodes
NAME                                   STATUS     ROLES    AGE     VERSION
master-0.ospamgr2-sep22.zvmocp.notld   NotReady   master   4h1m    v1.19.0+8a39924
master-1.ospamgr2-sep22.zvmocp.notld   NotReady   master   3h56m   v1.19.0+8a39924
master-2.ospamgr2-sep22.zvmocp.notld   NotReady   master   3h48m   v1.19.0+8a39924

Preventing the worker nodes from starting.  The bootkube.service reports this:

Sep 23 23:02:41 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:41.319432       1 reflector.go:251] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to watch *v1.Pod: Get "https://localhost:6443/api/v1/pods?watch=true": dial tcp [::1]:6443: connect: connection refused
Sep 23 23:02:42 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:42.325119       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused
Sep 23 23:02:43 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:43.327963       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused
Sep 23 23:02:44 bootstrap-0.ospamgr2-sep22.zvmocp.notld bootkube.sh[19435]: E0923 23:02:44.332599       1 reflector.go:134] github.com/openshift/cluster-bootstrap/pkg/start/status.go:66: Failed to list *v1.Pod: Get "https://localhost:6443/api/v1/pods": dial tcp [::1]:6443: connect: connection refused


Expected results: Master and worker nodes start successfully


Additional info:

--- Additional comment from Scott Dodson on 2020-10-07 17:44:28 UTC ---

Sorry, I forgot to copy/paste what "this change" is. I'm referring to https://golang.google.cn/doc/go1.15#commonname

--- Additional comment from Eric Paris on 2020-10-07 18:00:20 UTC ---

This bug has set a target release without specifying a severity. As part of triage when determining the importance of bugs a severity should be specified. Since these bugs have not been properly triaged we are removing the target release. Teams will need to add a severity before setting the target release again.

--- Additional comment from Scott Dodson on 2020-10-07 19:19:25 UTC ---

https://github.com/openshift/machine-config-operator/pull/2141#issuecomment-704989651 is where the discussion as to this need arose

--- Additional comment from Steve Milner on 2020-10-07 19:36:02 UTC ---

See the note on zipl in https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/8/html/managing_monitoring_and_updating_the_kernel/configuring-kernel-command-line-parameters_managing-monitoring-and-updating-the-kernel#understanding-kernel-command-line-parameters_configuring-kernel-command-line-parameters

It _may_ make sense to set this directly in the unit or Ignition code rather than setting it in initrd in general.

--- Additional comment from Micah Abbott on 2020-10-21 14:00:32 UTC ---

@Benjamin do you think it is reasonable to set the GODEBUG variable for just Ignition in the initrd?

--- Additional comment from Micah Abbott on 2020-10-21 14:00:52 UTC ---

Setting UpcomingSprint keyword as there are other higher priority tasks and issues being worked on.

--- Additional comment from Benjamin Gilbert on 2020-10-21 19:42:23 UTC ---

Yes, I do.

--- Additional comment from Colin Walters on 2020-11-11 22:52:55 UTC ---

xref https://github.com/openshift/oc/pull/628#issuecomment-725698791

Note this requires a bootimage update; we already have a request for one to pull in the fix for https://github.com/coreos/fedora-coreos-config/pull/733 too.

Comment 4 Michael Nguyen 2020-11-24 16:36:06 UTC

I don't have access to z systems but I verified that the dracut module is in 4.6.0-0.nightly-2020-11-22-160856 and RHCOS 46.82.202011210620-0 has the environment variable set.


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-11-22-160856   True        False         67s     Cluster version is 4.6.0-0.nightly-2020-11-22-160856
$ oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
ip-10-0-130-200.us-west-2.compute.internal   Ready    master   26m   v1.19.0+43983cd
ip-10-0-137-216.us-west-2.compute.internal   Ready    worker   16m   v1.19.0+43983cd
ip-10-0-176-34.us-west-2.compute.internal    Ready    master   25m   v1.19.0+43983cd
ip-10-0-189-58.us-west-2.compute.internal    Ready    worker   16m   v1.19.0+43983cd
ip-10-0-196-76.us-west-2.compute.internal    Ready    master   26m   v1.19.0+43983cd
ip-10-0-209-11.us-west-2.compute.internal    Ready    worker   17m   v1.19.0+43983cd

$ oc debug node/ip-10-0-130-200.us-west-2.compute.internal 
Starting pod/ip-10-0-130-200us-west-2computeinternal-debug ...
To use host binaries, run `chroot /host`
If you don't see a command prompt, try pressing enter.

sh-4.2# chroot /host
sh-4.4# cat /usr/lib/dracut/modules.d/10ignition-godebug/*
# https://bugzilla.redhat.com/show_bug.cgi?id=1886134
# Because Ignition which runs in the initrd may interface with external endpoints,
# we should set the environment variable in the initrd
[Manager]
DefaultEnvironment=GODEBUG=x509ignoreCN=0
#!/bin/bash
# -*- mode: shell-script; indent-tabs-mode: nil; sh-basic-offset: 4; -*-
# ex: ts=8 sw=4 sts=4 et filetype=sh

depends() {
    echo systemd
}

install() {
    inst_simple "$moddir/10-default-env-godebug.conf" \
        "/etc/systemd/system.conf.d/10-default-env-godebug.conf"
}
sh-4.4# exit
exit
sh-4.2# exit
exit

Removing debug pod ...


Entering emergency mode. Exit the shell to continue.
Type "journalctl" to view system logs.
You might want to save "/run/initramfs/rdsosreport.txt" to a USB stick or /boot
after mounting them and attach it to a bug report.


:/# 
:/# 
:/# env
DRACUT_SYSTEMD=1
rflags=
INVOCATION_ID=1df0a0730de5400dbc5a297437e483eb
hook=emergency
PWD=/
root=
fstype=auto
HOME=/
JOURNAL_STREAM=9:13527
UDEVVERSION=239
hookdir=/lib/dracut/hooks
NEWROOT=/sysroot
DEBUG_MEM_LEVEL=0
action=Boot
TERM=vt220
GODEBUG=x509ignoreCN=0
SHLVL=1
RD_DEBUG=no
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin
PS1=:${PWD}# 
_rdshell_name=dracut
_=/usr/bin/env

Comment 5 Jonathan Lebon 2020-11-26 16:32:35 UTC

Fixed by https://github.com/openshift/installer/pull/4422.

Comment 7 errata-xmlrpc 2020-11-30 16:46:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.6 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:5115

Note You need to log in before you can comment on or make changes to this bug.