Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 2021971

Summary:	Azure intermittently failing to bring up matching number of nodes for number of machines
Product:	OpenShift Container Platform	Reporter:	Devan Goodwin <dgoodwin>
Component:	Machine Config Operator	Assignee:	Devan Goodwin <dgoodwin>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Rio Liu <rioliu>
Status:	CLOSED WORKSFORME	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, dcbw, deads, jerzhang, kgarriso, mcambria, mkrejci, skumari, walters, wking
Version:	4.10
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-05-16 18:19:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Devan Goodwin 2021-11-10 13:36:57 UTC

In CI, we recently added checks before running tests to ensure that we have a matching number of machines and nodes, as there are tests which require the specified capacity.

In Azure, we're seeing problems where sometimes we're short one worker node.

Using: https://search.ci.openshift.org/?search=6+Machines+-+.*Nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Over the past 7 days we see this error message impacting ~4% of all azure runs and ~7.2% of failed azure runs.

Impact is highest on upgrade jobs, for example periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrad which is run very frequently and is experiencing: 516 runs, 97% failed, 12% of failures match = 11% impact

Sample prow job:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458344287241506816

Comment 1 David Eads 2021-11-11 14:18:57 UTC

This is also the source of the "error: timed out waiting for the condition on clusteroperators/dns" failure (seen in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458707667336826880 for example).

In this test the details show
  	clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6."


So slow machine start appears to be the root cause.  This check waits for 10 minutes *after* the installer has finished and about an hour after the install started.

Comment 2 David Eads 2021-11-11 16:48:22 UTC

Looking deeper into the run, it's  not actually a slow machine start.  We appear to have a couple cases 

1. the node never requests a CSR (this is a pretty deep failure)
2. the image pull for DNS (and many other pods) is failing and that causes the clusteroperator to be pending for a long time 

https://docs.google.com/spreadsheets/d/14SjEN3be0-r9lJNTmoD4TtOpdAQlPgWoRG1ekPTHnHY/edit#gid=0 has details.

I think the more severe problem is #1.  Unfortunately, our CI runs require a running kubelet to get the kubelet logs.

Comment 3 Devan Goodwin 2021-11-22 15:43:30 UTC

We got a couple hits over the weekend in the job where I added the journal gathering from all machines via ssh bastion.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872

In gather-extra artifcats on that prow job, machine ci-op-hlz3byr5-253f3-t4cm6-worker-centralus3-qbfgm shows the status.Phase = Provisioned we've seen in most other cases. The machine's internal IP is 10.0.128.4 and the gathered journal log is at: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.4-journal.log

In that system journal log we can see a lot of:

Nov 19 17:17:05 ci-op-hlz3byr5-253f3-t4cm6-worker-centralus3-qbfgm sh[1797]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1936375501b246ab84d334eb129bbb41d3e0547a5ddc237e973fe98bc49f8c1c: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp 50.16.140.223:443: i/o timeout

Ranging from 17:17 to 17:46, which is the point at which we gave up waiting for the 6th node to appear.

This message, Dan Winship refers to as "almost always an MTU issue" here: https://bugzilla.redhat.com/show_bug.cgi?id=2011939#c1 We closed that bug after the test involved appears to recover but I think the suspected networking MTU issue on Azure is actually alive and (un)well. 

It is unusual that this is virtually always affecting just 1 of 6 machines, very, very rarely 2.

Comment 7 Devan Goodwin 2021-11-23 18:48:54 UTC

Via Andrew Stoycos on slack, we're drifting towards MCO for this at the moment. The configure-ovs script (https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/configure-ovs-network.yaml) appears to have never run on the problematic worker, and thus underlying host networking is never configured. The machine is also the only one which was never rebooted.

See https://bugzilla.redhat.com/show_bug.cgi?id=2021971#c3 for the details on the specific prow job we're using to debug.

Comment 8 Yu Qi Zhang 2021-11-23 19:35:40 UTC

To summarize a slack thread we had on this:

The script https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/units/ovs-configuration.service.yaml#L7

Conditionals on non-firstboot, so what you described so far is working as intended. The bootstrap flow is something like this:

The bootimage is booted. It has a stub ignition to fetch the full config from an MCS endpoint (all 3 workers in the job did so successfully on the same version), and the full config contains all the actual files/units that should be running on the actual system, including the ovs script. Then the MCD-firstboot service (not the pod) runs and pulls the necessary image, pivot the node to the correct version, and reboots

The node in question from Devan's link fails because the MCD pull (the service that pulls the updated os image, and pre-req for MCD-firstboot) failed. See bottom of: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.4-journal.log

vs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.5-journal.log

So the question is why that particular worker node, which should have the same configs as the other ones during booting, is failing to access quay to pull the image.

Comment 9 Dan Williams 2021-11-24 18:20:46 UTC

Quick analysis of https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.4-journal.log shows that we are nowhere near OpenShift network plugin bringup yet. This is pure host-level networking, so any issue would be in the cloud provider connectivity for that VM, or with NetworkManager configuring the machine. NM just does DHCP here though...

Comment 10 Devan Goodwin 2021-11-29 13:09:19 UTC

We've got some more hits over the weekend.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1465233192083525632

Worker Machine 10.0.128.4 stuck in Provisioning state, all others in Running.

Journal successfully gathered as was ip and route info: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1465233192083525632/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/ssh-bastion-gather/

The ip addr show certainly is quite different from all other workers.

Comment 11 mcambria@redhat.com 2021-11-29 14:40:23 UTC

It doesn't look like br-ex was created (but the script did run):

Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + echo 'OVS configuration successfully reverted'


Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: OVS configuration successfully reverted
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + nmcli c show
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: NAME              UUID                                  TYPE      DEVICE
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: Wired Connection  1a77922c-1a75-4ac4-9e8c-b161e4b36227  ethernet  eth0
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + ovs-vsctl show
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 9f4d1f69-ff33-4316-bac8-7927e1503c0b
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]:     ovs_version: "2.15.2"
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + ip route
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: default via 10.0.0.1 dev eth0 proto dhcp metric 100
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 10.0.0.0/17 dev eth0 proto kernel scope link src 10.0.0.6 metric 100
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 168.63.129.16 via 10.0.0.1 dev eth0 proto dhcp metric 100
Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 169.254.169.254 via 10.0.0.1 dev eth0 proto dhcp metric 100

Comment 19 Devan Goodwin 2022-05-16 18:19:00 UTC

It looks like it's resolved in 4.11, at least no hits in the last week.

A few for 4.9 and 4.10: https://search.ci.openshift.org/?search=Timed+out+waiting+for+node+count&maxAge=168h&context=1&type=junit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Probably safe to close for now, we'll reopen if it resurfaces.

Comment 20 Colin Walters 2022-05-16 19:00:36 UTC

This was almost certainly fixed by https://github.com/openshift/machine-config-operator/pull/3070