1933155 – workers-rhel7 CI jobs failing: timed out waiting for the condition on machineconfigpools/worker

Bug 1933155 - workers-rhel7 CI jobs failing: timed out waiting for the condition on machineconfigpools/worker

Summary: workers-rhel7 CI jobs failing: timed out waiting for the condition on machine...

Keywords:
Status:	CLOSED DUPLICATE of bug 1933772
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Ben Howard
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-02-25 20:21 UTC by Russell Teague
Modified:	2021-03-02 13:27 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-02 13:27:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Russell Teague 2021-02-25 20:21:45 UTC

RHEL7 workers scaleup playbook succeeds however after removing the RHCOS nodes the worker machine config never completely rolls out.

Across 177 runs and 22 jobs (80.23% failed), matched 81.69% of failing runs and 36.36% of jobs

CI Search:
https://search.ci.openshift.org/?search=error%3A+timed+out+waiting+for+the+condition+on+machineconfigpools%2Fworker&maxAge=48h&context=1&type=bug%2Bjunit&name=e2e-aws-workers-rhel7&maxMatches=5&maxBytes=20971520&groupBy=job


Example Job:
https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2428/pull-ci-openshift-machine-config-operator-master-e2e-aws-workers-rhel7/1365006063505510400


Build Log snippit (after scaleup.yml playbook completes successfully):
+ echo '2021-02-25 19:32:11+00:00 - Deleting CoreOS machinesets'
2021-02-25 19:32:11+00:00 - Deleting CoreOS machinesets
+ mapfile -t COREOS_MACHINE_SETS
++ oc get machinesets --namespace openshift-machine-api
++ grep worker
++ grep -v rhel
++ awk '{print $1}'
+ [[ 2 != 0 ]]
+ oc delete machinesets --namespace openshift-machine-api ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1a ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1b
machineset.machine.openshift.io "ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1a" deleted
machineset.machine.openshift.io "ci-op-7q1m0tqc-8eab8-vmr4x-worker-us-west-1b" deleted
++ date -u --rfc-3339=seconds
+ echo '2021-02-25 19:32:12+00:00 - Waiting for CoreOS nodes to be removed'
2021-02-25 19:32:12+00:00 - Waiting for CoreOS nodes to be removed
+ oc wait node --for=delete --timeout=10m --selector node.openshift.io/os_id=rhcos,node-role.kubernetes.io/worker
node/ip-10-0-149-56.us-west-1.compute.internal condition met
node/ip-10-0-156-31.us-west-1.compute.internal condition met
node/ip-10-0-208-64.us-west-1.compute.internal condition met
++ date -u --rfc-3339=seconds
+ echo '2021-02-25 19:34:32+00:00 - Waiting for worker machineconfigpool to update'
2021-02-25 19:34:32+00:00 - Waiting for worker machineconfigpool to update
+ oc wait machineconfigpool/worker --for=condition=Updated=True --timeout=10m
error: timed out waiting for the condition on machineconfigpools/worker


Example from MCD pod logs:
2021-02-25T19:42:02.027958844Z I0225 19:42:02.027484   23752 start.go:108] Version: machine-config-daemon-4.6.0-202006240615.p0-678-g4bda8ca2-dirty (4bda8ca28522b3f2e663e8e27f56e79884e8d6ce)
2021-02-25T19:42:02.030583723Z I0225 19:42:02.030550   23752 start.go:121] Calling chroot("/rootfs")
2021-02-25T19:42:02.067087660Z I0225 19:42:02.067030   23752 start.go:97] Copied self to /run/bin/machine-config-daemon on host
2021-02-25T19:42:02.070540459Z I0225 19:42:02.070489   23752 metrics.go:105] Registering Prometheus metrics
2021-02-25T19:42:02.070671155Z I0225 19:42:02.070630   23752 metrics.go:110] Starting metrics listener on 127.0.0.1:8797
2021-02-25T19:42:02.072647740Z I0225 19:42:02.072611   23752 update.go:1901] Starting to manage node: ip-10-0-130-229.us-west-1.compute.internal
2021-02-25T19:42:02.075268617Z I0225 19:42:02.075229   23752 rpm-ostree.go:261] Running captured: journalctl --list-boots
2021-02-25T19:42:02.076415481Z I0225 19:42:02.076381   23752 daemon.go:682] Detected a new login session: New session 1 of user ec2-user.
2021-02-25T19:42:02.076415481Z I0225 19:42:02.076403   23752 daemon.go:683] Login access is discouraged! Applying annotation: machineconfiguration.openshift.io/ssh
2021-02-25T19:42:02.078529860Z I0225 19:42:02.078484   23752 daemon.go:871] journalctl --list-boots:
2021-02-25T19:42:02.078529860Z  0 2d64c5efd7cc4f1d988178be598026e4 Thu 2021-02-25 19:29:38 UTC—Thu 2021-02-25 19:42:02 UTC
2021-02-25T19:42:02.078529860Z I0225 19:42:02.078511   23752 rpm-ostree.go:261] Running captured: systemctl list-units --state=failed --no-legend
2021-02-25T19:42:02.082326587Z I0225 19:42:02.082286   23752 daemon.go:886] systemd service state: OK
2021-02-25T19:42:02.082326587Z I0225 19:42:02.082307   23752 daemon.go:620] Starting MachineConfigDaemon
2021-02-25T19:42:02.082326587Z I0225 19:42:02.082313   23752 daemon.go:590] Guarding against sigterm signal
2021-02-25T19:42:02.082369207Z I0225 19:42:02.082332   23752 daemon.go:627] Enabling Kubelet Healthz Monitor
2021-02-25T19:42:02.182675828Z W0225 19:42:02.182578   23752 daemon.go:648] Got an error from auxiliary tools: error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ip-10-0-130-229.us-west-1.compute.internal" not found
2021-02-25T19:42:02.182675828Z I0225 19:42:02.182628   23752 daemon.go:649] Shutting down MachineConfigDaemon
2021-02-25T19:42:02.182757582Z F0225 19:42:02.182731   23752 helpers.go:147] error: cannot apply annotation for SSH access due to: unable to update node "nil": node "ip-10-0-130-229.us-west-1.compute.internal" not found


Node ip-10-0-130-229.us-west-1.compute.internal Ready state is True.

Comment 1 Russell Teague 2021-02-25 20:34:14 UTC

Looking at job failure status, it appears this PR introduced a change causing this to fail on RHEL nodes.  The merge just before this PR passed workers-rhel7 and this PR didn't.

https://github.com/openshift/machine-config-operator/pull/2395

Comment 2 Yu Qi Zhang 2021-02-25 22:56:46 UTC

Hmm, I wonder if the MCD is failing to terminate because of some weird RHEL upgrade limbo state. Passing to Ben to take a look

Comment 3 Michelle Krejci 2021-03-01 18:23:57 UTC

This is a 4.8 blocker since it is a regression.

Comment 4 Ben Howard 2021-03-02 13:27:48 UTC

Dup of Bug 1933772

*** This bug has been marked as a duplicate of bug 1933772 ***

Note You need to log in before you can comment on or make changes to this bug.