Bug 1978041

Summary: Proxy environment setting is missing in MCO env file
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: Machine Config OperatorAssignee: Yu Qi Zhang <jerzhang>
Status: CLOSED ERRATA QA Contact: Rio Liu <rioliu>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 4.6.zCC: bleanhar, jerzhang, jhou, jialiu, jima, rioliu, schoudha, wking
Target Milestone: ---Keywords: Regression, TestBlocker, UpgradeBlocker
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: UpdateRecommendationsBlocked
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-14 07:16:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1920027    
Bug Blocks:    

Description Yunfei Jiang 2021-07-01 02:31:52 UTC
Description:
Install a cluster behind proxy, the proxy setting is missing in MCO, this cause bootstrap process failed.

on master machine:
[core@ip-10-0-76-55 ~]$ cat /etc/systemd/system/kubelet.service.d/10-mco-default-env.conf
[Service]
Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"

this change may be introduced by https://github.com/openshift/machine-config-operator/pull/2632


[core@ip-10-0-76-55 ~]$ systemctl status machine-config-daemon-firstboot.service
● machine-config-daemon-firstboot.service - Machine Config Daemon Firstboot
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-firstboot.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
Condition: start condition failed at Wed 2021-06-30 02:27:59 UTC; 1h 14min ago
           └─ ConditionPathExists=/etc/ignition-machine-config-encapsulated.json was not met

[core@ip-10-0-76-55 ~]$ journalctl -f -u machine-config-daemon-firstboot.service
-- Logs begin at Wed 2021-06-30 02:17:16 UTC. --
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.042073    2012 rpm-ostree.go:261] Running captured: rpm-ostree status --json
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.099933    2012 rpm-ostree.go:184] Current origin is not custom
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.827752    2012 rpm-ostree.go:211] Pivoting to: 46.82.202106211840-0 (e0c0c734343efcd6b24cc771bcbad4beb8fbd556bd3b34df266f7b046fff956f)
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.827771    2012 rpm-ostree.go:243] Executing rebase from repo path /run/mco-machine-os-content/os-content-888034288/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b141fabe2c3589194d625ccfd7ce503c55c1833cbab238174c26e1148225bcba and checksum e0c0c734343efcd6b24cc771bcbad4beb8fbd556bd3b34df266f7b046fff956f
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.827779    2012 rpm-ostree.go:261] Running captured: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-888034288/srv/repo:e0c0c734343efcd6b24cc771bcbad4beb8fbd556bd3b34df266f7b046fff956f --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b141fabe2c3589194d625ccfd7ce503c55c1833cbab238174c26e1148225bcba --custom-origin-description Managed by machine-config-operator
Jun 30 02:27:37 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:37.165464    2012 update.go:1678] initiating reboot: Completing firstboot provisioning to rendered-master-39d7b607b214247392cd5bb37907d15f
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=killed, status=15/TERM
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'signal'.
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: Stopped Machine Config Daemon Firstboot.
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: machine-config-daemon-firstboot.service: Consumed 16.882s CPU time

Version:
4.6.0-0.nightly-2021-06-25-031210

Platform:
all

What happened?
Installation failed at bootstrap phase.

What did you expect to happen?
Install cluster successfully.

How to reproduce it (as minimally and precisely as possible)?
Install a cluster behind proxy

Comment 4 W. Trevor King 2021-07-01 03:18:51 UTC
> Version:
> 4.6.0-0.nightly-2021-06-25-031210

To drop in a more recognizable string, that's the one we used for 4.6.37 [1].

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.6.37

Comment 5 W. Trevor King 2021-07-01 03:24:39 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 6 W. Trevor King 2021-07-01 03:59:28 UTC
Setting depends on 4.7.0's bug 1920027.  That bug had three PRs:

* [2363] handling zero-length dropins/units.  This made it back to 4.6.21 [2].
* [2365] separating dropins for the kubelet.  Tried to take this back to 4.6.z in February, but it didn't apply cleanly [2].
* [2378] separating dropins for CRI-O.  Tried to take this back to 4.6.z in February, but it didn't apply cleanly [2]

But with bug 1926944 bringing [2632] into 4.6.37, we now have multiple dropins in the same file in 4.6, and need manual backports of 2365 and 2378, which can happen in a single PR or two linked from this bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1933075#c8
[2]: https://github.com/openshift/machine-config-operator/pull/2365#issuecomment-785991276
[3]: https://github.com/openshift/machine-config-operator/pull/2378#issuecomment-785991657
[2363]: https://github.com/openshift/machine-config-operator/pull/2363
[2365]: https://github.com/openshift/machine-config-operator/pull/2365
[2378]: https://github.com/openshift/machine-config-operator/pull/2378
[2632]: https://github.com/openshift/machine-config-operator/pull/2632

Comment 16 errata-xmlrpc 2021-07-14 07:16:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2641

Comment 17 W. Trevor King 2021-08-18 22:25:28 UTC
I dunno if we got comment 5's requested impact statement here, but we'd tombstoned 4.6.37 on this bug back in July [1], so no need for an impact statement now.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/902#event-4965638902