Bug 1978041 - Proxy environment setting is missing in MCO env file
Summary: Proxy environment setting is missing in MCO env file
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6.z
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.6.z
Assignee: Yu Qi Zhang
QA Contact: Rio Liu
URL:
Whiteboard: UpdateRecommendationsBlocked
Depends On: 1920027
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-07-01 02:31 UTC by Yunfei Jiang
Modified: 2021-08-18 22:25 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-14 07:16:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2652 0 None open Bug 1978041: templates: split unit dropins into separate files 2021-07-01 03:58:37 UTC
Red Hat Product Errata RHBA-2021:2641 0 None None None 2021-07-14 07:17:12 UTC

Description Yunfei Jiang 2021-07-01 02:31:52 UTC
Description:
Install a cluster behind proxy, the proxy setting is missing in MCO, this cause bootstrap process failed.

on master machine:
[core@ip-10-0-76-55 ~]$ cat /etc/systemd/system/kubelet.service.d/10-mco-default-env.conf
[Service]
Environment="GODEBUG=x509ignoreCN=0,madvdontneed=1"

this change may be introduced by https://github.com/openshift/machine-config-operator/pull/2632


[core@ip-10-0-76-55 ~]$ systemctl status machine-config-daemon-firstboot.service
● machine-config-daemon-firstboot.service - Machine Config Daemon Firstboot
   Loaded: loaded (/etc/systemd/system/machine-config-daemon-firstboot.service; enabled; vendor preset: enabled)
   Active: inactive (dead)
Condition: start condition failed at Wed 2021-06-30 02:27:59 UTC; 1h 14min ago
           └─ ConditionPathExists=/etc/ignition-machine-config-encapsulated.json was not met

[core@ip-10-0-76-55 ~]$ journalctl -f -u machine-config-daemon-firstboot.service
-- Logs begin at Wed 2021-06-30 02:17:16 UTC. --
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.042073    2012 rpm-ostree.go:261] Running captured: rpm-ostree status --json
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.099933    2012 rpm-ostree.go:184] Current origin is not custom
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.827752    2012 rpm-ostree.go:211] Pivoting to: 46.82.202106211840-0 (e0c0c734343efcd6b24cc771bcbad4beb8fbd556bd3b34df266f7b046fff956f)
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.827771    2012 rpm-ostree.go:243] Executing rebase from repo path /run/mco-machine-os-content/os-content-888034288/srv/repo with customImageURL pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b141fabe2c3589194d625ccfd7ce503c55c1833cbab238174c26e1148225bcba and checksum e0c0c734343efcd6b24cc771bcbad4beb8fbd556bd3b34df266f7b046fff956f
Jun 30 02:27:24 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:24.827779    2012 rpm-ostree.go:261] Running captured: rpm-ostree rebase --experimental /run/mco-machine-os-content/os-content-888034288/srv/repo:e0c0c734343efcd6b24cc771bcbad4beb8fbd556bd3b34df266f7b046fff956f --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b141fabe2c3589194d625ccfd7ce503c55c1833cbab238174c26e1148225bcba --custom-origin-description Managed by machine-config-operator
Jun 30 02:27:37 ip-10-0-76-55 machine-config-daemon[2012]: I0630 02:27:37.165464    2012 update.go:1678] initiating reboot: Completing firstboot provisioning to rendered-master-39d7b607b214247392cd5bb37907d15f
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: machine-config-daemon-firstboot.service: Main process exited, code=killed, status=15/TERM
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: machine-config-daemon-firstboot.service: Failed with result 'signal'.
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: Stopped Machine Config Daemon Firstboot.
Jun 30 02:27:37 ip-10-0-76-55 systemd[1]: machine-config-daemon-firstboot.service: Consumed 16.882s CPU time

Version:
4.6.0-0.nightly-2021-06-25-031210

Platform:
all

What happened?
Installation failed at bootstrap phase.

What did you expect to happen?
Install cluster successfully.

How to reproduce it (as minimally and precisely as possible)?
Install a cluster behind proxy

Comment 4 W. Trevor King 2021-07-01 03:18:51 UTC
> Version:
> 4.6.0-0.nightly-2021-06-25-031210

To drop in a more recognizable string, that's the one we used for 4.6.37 [1].

[1]: https://amd64.ocp.releases.ci.openshift.org/releasestream/4-stable/release/4.6.37

Comment 5 W. Trevor King 2021-07-01 03:24:39 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 6 W. Trevor King 2021-07-01 03:59:28 UTC
Setting depends on 4.7.0's bug 1920027.  That bug had three PRs:

* [2363] handling zero-length dropins/units.  This made it back to 4.6.21 [2].
* [2365] separating dropins for the kubelet.  Tried to take this back to 4.6.z in February, but it didn't apply cleanly [2].
* [2378] separating dropins for CRI-O.  Tried to take this back to 4.6.z in February, but it didn't apply cleanly [2]

But with bug 1926944 bringing [2632] into 4.6.37, we now have multiple dropins in the same file in 4.6, and need manual backports of 2365 and 2378, which can happen in a single PR or two linked from this bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1933075#c8
[2]: https://github.com/openshift/machine-config-operator/pull/2365#issuecomment-785991276
[3]: https://github.com/openshift/machine-config-operator/pull/2378#issuecomment-785991657
[2363]: https://github.com/openshift/machine-config-operator/pull/2363
[2365]: https://github.com/openshift/machine-config-operator/pull/2365
[2378]: https://github.com/openshift/machine-config-operator/pull/2378
[2632]: https://github.com/openshift/machine-config-operator/pull/2632

Comment 16 errata-xmlrpc 2021-07-14 07:16:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.38 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:2641

Comment 17 W. Trevor King 2021-08-18 22:25:28 UTC
I dunno if we got comment 5's requested impact statement here, but we'd tombstoned 4.6.37 on this bug back in July [1], so no need for an impact statement now.

[1]: https://github.com/openshift/cincinnati-graph-data/pull/902#event-4965638902


Note You need to log in before you can comment on or make changes to this bug.