Bug 1981549 - Machine-config daemon does not recover from broken Proxy configuration
Summary: Machine-config daemon does not recover from broken Proxy configuration
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.10.0
Assignee: Yu Qi Zhang
QA Contact: Sergio
URL:
Whiteboard:
: 1981146 2070930 (view as bug list)
Depends On:
Blocks: 2071686 2071689
TreeView+ depends on / blocked
 
Reported: 2021-07-12 19:09 UTC by W. Trevor King
Modified: 2023-09-18 00:28 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 2071686 (view as bug list)
Environment:
Last Closed: 2022-03-12 04:35:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2800 0 None open Bug 1981549: lib/resourcemerge: handle container env var deletions 2021-10-12 20:43:03 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:36:15 UTC

Description W. Trevor King 2021-07-12 19:09:18 UTC
In a cluster updating from 4.7.9 to 4.7.11, the machine-config operator sticks with [1]:

  - lastTransitionTime: "2021-07-12T16:02:10Z"
    message: 'Unable to apply 4.7.11: timed out waiting for the condition during syncRequiredMachineConfigPools:
      error pool master is not ready, retrying. Status: (pool degraded: true total:
      3, ready 0, updated: 0, unavailable: 1)'
    reason: RequiredPoolsFailed
    status: "True"
    type: Degraded
  ...
  extension:
    master: 'pool is degraded because nodes fail with "1 nodes are reporting degraded
      status on sync": "Node ip-10-0-198-230.ec2.internal is reporting: \"failed to
      run command nice (6 tries): timed out waiting for the condition: running nice
      -- ionice -c 3 podman pull -q --authfile /var/lib/kubelet/config.json quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d
      failed: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d:
      error pinging docker registry quay.io: Get \\\"https://quay.io/v2/\\\": proxyconnect
      tcp: dial tcp: lookup clusterhttpsproxy.example.com on 10.0.128.2:53: no such
      host\\n: exit status 125\""'

Seems like the Proxy config object had the bogus clusterhttpsproxy.example.com inserted, and subsequently removed, but the machine-config daemon was unable to recover after the bogus insertion.  Machine-config daemon pod YAML [2] and logs [3].  Machine-config DaemonSet still has the broken Proxy environment variables [4].  The machine-config ControllerConfig proxy entry seems clear:

  $ yaml2json <cluster-scoped-resources/machineconfiguration.openshift.io/controllerconfigs/machine-config-controller.yaml  | jq -r .spec.proxy  
  null

So I would expect the machine-config operator to be clearing the proxy environment variables off the DaemonSet [5], but that does not seem to be happening.

This is similar to bug 1928581, except in this case, the update wedged before the network operator pod went down, so it should be easier to recover from.

[1]: https://gist.github.com/abutcher/ce403a9ab355fe51dce8f549a8d2ae5d
[2]: https://gist.github.com/abutcher/9928a877af6d97d5e624501143a1d6ea
[3]: https://gist.github.com/abutcher/34cb0d5e4998ff27215290391b20b21f
[4]: https://gist.github.com/abutcher/92dea431695b483c4ed08911445a1a60
[5]: https://github.com/openshift/machine-config-operator/blob/e3863b02b7403342cdf0f981889e8c3cfc2d86bb/manifests/machineconfigdaemon/daemonset.yaml#L40-L53

Comment 2 W. Trevor King 2021-07-12 20:57:35 UTC
Attempting to repair, we manually removed the HTTP*PROXY environment variables from the MCD DaemonSet.  The MCD update rolled out, but then the two impacted nodes stuck with logs like:

  I0712 20:16:36.854515  317114 daemon.go:1088] Validating against current config rendered-master-8b288a4d95457df6a503d6516d2c6971
  E0712 20:16:36.854563  317114 writer.go:135] Marking Degraded due to: unexpected on-disk state validating against rendered-master-8b288a4d95457df6a503d6516d2c6971: expected target osImageURL "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d", have "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:d94c4dfe32d6b9321b61e221c35559e22cf6e96f75bef0345fe978530fff3e02"

To see if it helped, we touched /run/machine-config-daemon-force on the compute node following [1], but it did not help:

  E0712 20:36:16.210890  716066 writer.go:135] Marking Degraded due to: failed to update OS to quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8
eb871c5ec632e3e750e7e82e03e61c82b056f5c2b8af200e15a4d : with stdout output: : error running rpm-ostree rebase --experimental /srv/repo:f15456562794ed50b10e596
45d97aae9fbeccad57ed23e3ccaa0fb1a021369cd --custom-origin-url pivot://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:e850d731de8eb871c5ec632e3e750e7e82
e03e61c82b056f5c2b8af200e15a4d --custom-origin-description Managed by machine-config-operator: error: opendir(/srv/repo): No such file or directory
: exit status 1

Full logs for that MCD in [2].

[1]: https://github.com/openshift/machine-config-operator/pull/2265/files
[2]: https://gist.github.com/abutcher/31e478f47f7961f99ddd024153e08d38

Comment 3 W. Trevor King 2021-07-12 22:13:26 UTC
Bug 1981146 sounds very similar to this one, possibly one bug should be closed as a dup of the other.

Comment 4 W. Trevor King 2021-07-13 14:48:43 UTC
Bug in MCO code is that the DaemonSet management depends on ensureContainer, which currently ensures that the in-cluster resource contains at least the env vars the MCO wants, but does not remove any additional env vars which may be in the in-cluster resource already [1].  So you set the Proxy properties, MCO injects the new env vars in the DaemonSet.  Then you clear the Proxy properties, the MCO no longer requires the DaemonSet to contain the proxy env vars, but just says "huh, dunno what those are about; I'll let them stay", instead of removing them.  We fixed something similar, but not quite the same, for the CVO in bug 1951339, and have [2] open about unifying the code-bases.

[1]: https://github.com/openshift/machine-config-operator/blob/a5421965cf4e440b94bd1a4774072c517291fda0/lib/resourcemerge/core.go#L102-L117
[2]: https://issues.redhat.com/browse/GRPA-3832

Comment 8 Yu Qi Zhang 2021-10-19 15:27:48 UTC
*** Bug 1981146 has been marked as a duplicate of this bug. ***

Comment 15 Sergio 2021-11-04 09:49:47 UTC
Verified using IPI AWS cluster with image 4.10.0-0.nightly-2021-10-30-025206

When we configure a proxy "proxy1" and later on we remove the proxy settings, a new MC is rendered without the proxy and the proxy variables are removed correctly from MCD pods. Then we configure "proxy2" and "proxy2" is properly configured too. So we can consider the BZ verified.

But we have observed that when we configure "proxy1" and we reconfigure it to "proxy2" instead of removing it, the reconfiguration never happens, even if we wait for 16 hours. Maybe it could be the result of being impacted by https://bugzilla.redhat.com/show_bug.cgi?id=2005694, but it is not a performance problem (delayed reconfiguration).

To configure "proxy1" and then configure "proxy2", we need to configure "proxy1" -> remove "proxy1" -> configure "proxy2".

Nevertheless, the behavior described in this BZ is fixed so we move the status to VERIFIED.

Comment 16 Sergio 2021-11-04 10:49:06 UTC
Verification steps:

1. Configure a proxy in the cluster

oc edit proxy cluster
...
  spec:
    httpProxy: http://user:pass@proxy-fake:1111
    httpsProxy: http://user:pass@proxy-fake:1111
    noProxy: test.no-proxy.com
    trustedCA:
      name: ""

2. Verfify that the proxy has been added to MCD pods environment variables
$ oc get pods -o yaml machine-config-daemon-6snvk | grep env -A 9
    env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    - name: HTTP_PROXY
      value: http://user:pass@proxy-fake:1111
    - name: HTTPS_PROXY
      value: http://user:pass@proxy-fake:1111

3. Remove the proxy from the cluster

oc edit proxy cluster
...
spec:
  trustedCA:
    name: ""

4. Verify that the proxy has been removed from MCD pods environmnet variables

$ oc get pods machine-config-daemon-g72rv -o yaml | grep env -A 9
...
    env:
    - name: NODE_NAME
      valueFrom:
        fieldRef:
          apiVersion: v1
          fieldPath: spec.nodeName
    image: quay.io/openshift-release-d.....

Comment 17 Sergio 2021-11-04 14:47:55 UTC
The proxy1 -> proxy2 reconfiguration issue is a variation of this BZ https://bugzilla.redhat.com/show_bug.cgi?id=1981549. It will be tracked and fixed in this BZ.

Comment 18 Sergio 2021-11-05 08:32:19 UTC
I'm sorry, I made a mistake in my previous comment and put the wrong link.

The proxy1 -> proxy2 reconfiguration issue is actually tracked in this BZ https://bugzilla.redhat.com/show_bug.cgi?id=2005694

Comment 21 errata-xmlrpc 2022-03-12 04:35:46 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Comment 22 Pablo Alonso Rodriguez 2022-04-04 14:32:05 UTC
*** Bug 2070930 has been marked as a duplicate of this bug. ***

Comment 23 Red Hat Bugzilla 2023-09-18 00:28:16 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days


Note You need to log in before you can comment on or make changes to this bug.