1910738 – OCP 4.7 Installation fails on VMWare due to 1 worker that is degraded

Bug 1910738 - OCP 4.7 Installation fails on VMWare due to 1 worker that is degraded

Summary: OCP 4.7 Installation fails on VMWare due to 1 worker that is degraded

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Patrick Dillon
QA Contact:	jima
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1909570 1909642 (view as bug list)
Depends On:
Blocks:	1940585
TreeView+	depends on / blocked

Reported:	2020-12-24 11:23 UTC by Avi Liani
Modified:	2021-08-18 22:11 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: empty nodeip-configuration.service systemd unit was enabled in vsphere UPI. Consequence: nodeip-configuration.service was enabled but there was no content so an error occurred when unable to find the file. Fix: write nodeip-configuration.service content but disable unit on vsphere UPI. Result: vsphere UPI has unit content but unit is correctly disabled. No error when upgrading or otherwise running MCO.
Clone Of:
Environment:
Last Closed:	2021-02-24 15:48:29 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
installation log (66.18 KB, text/plain) 2020-12-24 11:23 UTC, Avi Liani	no flags	Details
must-gather data (16.91 MB, application/gzip) 2020-12-24 11:26 UTC, Avi Liani	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2326	0	None	closed	Bug 1910738: [on-prem] fix nodeip-configuration.service template	2021-01-26 03:20:07 UTC
Red Hat Product Errata	RHSA-2020:5633	0	None	None	None	2021-02-24 15:48:48 UTC

Internal Links: 1940585

Description Avi Liani 2020-12-24 11:23:09 UTC

Created attachment 1741709 [details]
installation log

Thanks for opening a bug report!
Before hitting the button, please fill in as much of the template below as you can.
If you leave out information, it's harder to help you.
Be ready for follow-up questions, and please respond in a timely manner.
If we can't reproduce a bug we might close your issue.
If we're wrong, PLEASE feel free to reopen it and explain why.

Version:

$ openshift-install version
openshift-install 4.7.0-0.nightly-2020-12-21-131655
built from commit 6abb3b5a8b687ee38b6c96368c77a305f6f0b563
release image quay.io/openshift-release-dev/ocp-release-nightly@sha256:e5373e096ae81a2372bad8309a28fcc2a9f04b36295ff5e82329e4b5fc6afa7b


Platform:

Vsphere


Please specify:
* UPI (semi-manual installation on customized infrastructure)

What happened?
during installation all nodes become into Ready state, but in the end i saw :

# oc get machineconfigpool,nodes -o wide
NAME                                                         CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
machineconfigpool.machineconfiguration.openshift.io/master   rendered-master-afecc5afb2668b5cc4f60f4b3fe96214   True      False      False      3              3                   3                     0                      3h48m
machineconfigpool.machineconfiguration.openshift.io/worker   rendered-worker-c255d81fe61657c3062da9e2cfbaee99   False     True       True       3              0                   0                     1                      3h48m

NAME                   STATUS                     ROLES    AGE     VERSION           INTERNAL-IP    EXTERNAL-IP    OS-IMAGE                                                       KERNEL-VERSION                CONTAINER-RUNTIME
node/compute-0         Ready,SchedulingDisabled   worker   3h43m   v1.20.0+87544c5   10.1.160.171   10.1.160.171   Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
node/compute-1         Ready                      worker   3h43m   v1.20.0+87544c5   10.1.160.173   10.1.160.173   Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
node/compute-2         Ready                      worker   3h42m   v1.20.0+87544c5   10.1.160.148   10.1.160.148   Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
node/control-plane-0   Ready                      master   3h50m   v1.20.0+87544c5   10.1.160.182   10.1.160.182   Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
node/control-plane-1   Ready                      master   3h50m   v1.20.0+87544c5   10.1.160.176   10.1.160.176   Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39
node/control-plane-2   Ready                      master   3h50m   v1.20.0+87544c5   10.1.160.190   10.1.160.190   Red Hat Enterprise Linux CoreOS 47.83.202012190438-0 (Ootpa)   4.18.0-240.8.1.el8_3.x86_64   cri-o://1.20.0-0.rhaos4.7.gitd388528.el8.39



# Always at least include the `.openshift_install.log`
.openshift_install.log is attached.

will also upload the must-gather data.

What did you expect to happen?
the installation will finished OK

How to reproduce it (as minimally and precisely as possible)?

$ your-commands-here

Anything else we need to know?

#Enter text here.

Comment 1 Avi Liani 2020-12-24 11:26:00 UTC

Created attachment 1741712 [details]
must-gather data

Comment 2 Scott Dodson 2020-12-24 14:18:47 UTC

This bug lacks any info as to what's been done to debug the situation so resetting severity until the engineering teams have been able to triage.

Comment 5 Vijay Avuthu 2021-01-04 12:09:21 UTC

After applying MachineConfig ( worker-chrony-configuration ), compute-2 has DEGRADED flag TRUE

$ oc get MachineConfigPool worker
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
worker   rendered-worker-25d1d136587a3c56ee93aa85b1d49eb8   False     True       True       3              0                   0                     1                      7d


> compute-2 is reporting "Unit file nodeip-configuration.service does not exist"

$ oc get machineconfigpool worker -o yaml

- lastTransitionTime: "2020-12-28T12:19:44Z"
    message: 'Node compute-2 is reporting: "error enabling unit: Failed to enable unit: Unit file nodeip-configuration.service does not exist.\n"'
    reason: 1 nodes are reporting degraded status on sync
    status: "True"
    type: NodeDegraded
  - lastTransitionTime: "2020-12-28T12:19:44Z"
    message: ""
    reason: ""
    status: "True"
    type: Degraded

Comment 6 Scott Dodson 2021-01-04 13:37:31 UTC

nodeip-configuration.service deployment was modified here https://github.com/openshift/machine-config-operator/commit/5c2d529bf1abc9c7cbc01dcfc7814c3a59092676

Moving to MCO

Comment 8 Ben Howard 2021-01-04 18:10:31 UTC

*** Bug 1909642 has been marked as a duplicate of this bug. ***

Comment 13 Patrick Dillon 2021-01-11 18:49:02 UTC

*** Bug 1909570 has been marked as a duplicate of this bug. ***

Comment 15 jima 2021-01-14 02:21:29 UTC

verified on 4.7.0-0.nightly-2021-01-13-124141, installation on upi-on-vsphere is completed, and machineconfig update is successful.

$ ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-01-13-124141   True        False         68m     Cluster version is 4.7.0-0.nightly-2021-01-13-124141

$ ./oc get machineconfig
NAME                                               GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
00-master                                          69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
00-worker                                          69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
01-master-container-runtime                        69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
01-master-kubelet                                  69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
01-worker-container-runtime                        69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
01-worker-kubelet                                  69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
99-master-generated-registries                     69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
99-master-ssh                                                                                 3.1.0             95m
99-worker-generated-registries                     69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
99-worker-ssh                                                                                 3.1.0             95m
rendered-master-cbdfc843feae448daa9c23e9abfb02bb   69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
rendered-master-eed4433615759c0700eb64332f014044   69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             66m
rendered-worker-9c06b1f59d48afdad6cfdd5e0d466eeb   69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             91m
rendered-worker-bf723a438aae52dd35bff4f2720019eb   69ac8b941b0f29d3cfdfced35aded406d75bc84a   3.2.0             66m

$ ./oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-eed4433615759c0700eb64332f014044   True      False      False      3              3                   3                     0                      92m
worker   rendered-worker-bf723a438aae52dd35bff4f2720019eb   True      False      False      2              2                   2                     0                      92m

Comment 18 errata-xmlrpc 2021-02-24 15:48:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 19 W. Trevor King 2021-03-30 20:16:37 UTC

Based on the blocker+ status for the child bug 1940585, we're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact? Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 20 W. Trevor King 2021-08-18 22:10:30 UTC

Never got the impact statement requested in comment 19, but I don't think we blocked anything on this, and it's been a while, so that's unlikely to change going forward.  If folks are still bumping into the issue and think edges need blocking, please restore the UpgradeBlocker keyword.

Note You need to log in before you can comment on or make changes to this bug.