Bug 2021971
| Summary: | Azure intermittently failing to bring up matching number of nodes for number of machines | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Devan Goodwin <dgoodwin> |
| Component: | Machine Config Operator | Assignee: | Devan Goodwin <dgoodwin> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED WORKSFORME | Docs Contact: | |
| Severity: | medium | ||
| Priority: | medium | CC: | aos-bugs, dcbw, deads, jerzhang, kgarriso, mcambria, mkrejci, skumari, walters, wking |
| Version: | 4.10 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2022-05-16 18:19:00 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Devan Goodwin
2021-11-10 13:36:57 UTC
This is also the source of the "error: timed out waiting for the condition on clusteroperators/dns" failure (seen in https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1458707667336826880 for example). In this test the details show clusteroperator/dns is progressing: DNS "default" reports Progressing=True: "Have 5 available DNS pods, want 6." So slow machine start appears to be the root cause. This check waits for 10 minutes *after* the installer has finished and about an hour after the install started. Looking deeper into the run, it's not actually a slow machine start. We appear to have a couple cases 1. the node never requests a CSR (this is a pretty deep failure) 2. the image pull for DNS (and many other pods) is failing and that causes the clusteroperator to be pending for a long time https://docs.google.com/spreadsheets/d/14SjEN3be0-r9lJNTmoD4TtOpdAQlPgWoRG1ekPTHnHY/edit#gid=0 has details. I think the more severe problem is #1. Unfortunately, our CI runs require a running kubelet to get the kubelet logs. We got a couple hits over the weekend in the job where I added the journal gathering from all machines via ssh bastion. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872 In gather-extra artifcats on that prow job, machine ci-op-hlz3byr5-253f3-t4cm6-worker-centralus3-qbfgm shows the status.Phase = Provisioned we've seen in most other cases. The machine's internal IP is 10.0.128.4 and the gathered journal log is at: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.4-journal.log In that system journal log we can see a lot of: Nov 19 17:17:05 ci-op-hlz3byr5-253f3-t4cm6-worker-centralus3-qbfgm sh[1797]: Error: Error initializing source docker://quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1936375501b246ab84d334eb129bbb41d3e0547a5ddc237e973fe98bc49f8c1c: error pinging docker registry quay.io: Get "https://quay.io/v2/": dial tcp 50.16.140.223:443: i/o timeout Ranging from 17:17 to 17:46, which is the point at which we gave up waiting for the 6th node to appear. This message, Dan Winship refers to as "almost always an MTU issue" here: https://bugzilla.redhat.com/show_bug.cgi?id=2011939#c1 We closed that bug after the test involved appears to recover but I think the suspected networking MTU issue on Azure is actually alive and (un)well. It is unusual that this is virtually always affecting just 1 of 6 machines, very, very rarely 2. Via Andrew Stoycos on slack, we're drifting towards MCO for this at the moment. The configure-ovs script (https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/files/configure-ovs-network.yaml) appears to have never run on the problematic worker, and thus underlying host networking is never configured. The machine is also the only one which was never rebooted. See https://bugzilla.redhat.com/show_bug.cgi?id=2021971#c3 for the details on the specific prow job we're using to debug. To summarize a slack thread we had on this: The script https://github.com/openshift/machine-config-operator/blob/master/templates/common/_base/units/ovs-configuration.service.yaml#L7 Conditionals on non-firstboot, so what you described so far is working as intended. The bootstrap flow is something like this: The bootimage is booted. It has a stub ignition to fetch the full config from an MCS endpoint (all 3 workers in the job did so successfully on the same version), and the full config contains all the actual files/units that should be running on the actual system, including the ovs script. Then the MCD-firstboot service (not the pod) runs and pulls the necessary image, pivot the node to the correct version, and reboots The node in question from Devan's link fails because the MCD pull (the service that pulls the updated os image, and pre-req for MCD-firstboot) failed. See bottom of: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.4-journal.log vs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.5-journal.log So the question is why that particular worker node, which should have the same configs as the other ones during booting, is failing to access quay to pull the image. Quick analysis of https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1461737680877391872/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/machine-journal-logs/10.0.128.4-journal.log shows that we are nowhere near OpenShift network plugin bringup yet. This is pure host-level networking, so any issue would be in the cloud provider connectivity for that VM, or with NetworkManager configuring the machine. NM just does DHCP here though... We've got some more hits over the weekend. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1465233192083525632 Worker Machine 10.0.128.4 stuck in Provisioning state, all others in Running. Journal successfully gathered as was ip and route info: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-upgrade-from-stable-4.9-e2e-azure-upgrade/1465233192083525632/artifacts/e2e-azure-upgrade/openshift-e2e-test/artifacts/ssh-bastion-gather/ The ip addr show certainly is quite different from all other workers. It doesn't look like br-ex was created (but the script did run): Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + echo 'OVS configuration successfully reverted' Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: OVS configuration successfully reverted Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + nmcli c show Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: NAME UUID TYPE DEVICE Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: Wired Connection 1a77922c-1a75-4ac4-9e8c-b161e4b36227 ethernet eth0 Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + ovs-vsctl show Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 9f4d1f69-ff33-4316-bac8-7927e1503c0b Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: ovs_version: "2.15.2" Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: + ip route Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: default via 10.0.0.1 dev eth0 proto dhcp metric 100 Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 10.0.0.0/17 dev eth0 proto kernel scope link src 10.0.0.6 metric 100 Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 168.63.129.16 via 10.0.0.1 dev eth0 proto dhcp metric 100 Nov 29 08:37:29 ci-op-4nbtw992-253f3-p97wl-master-1 configure-ovs.sh[1341]: 169.254.169.254 via 10.0.0.1 dev eth0 proto dhcp metric 100 It looks like it's resolved in 4.11, at least no hits in the last week. A few for 4.9 and 4.10: https://search.ci.openshift.org/?search=Timed+out+waiting+for+node+count&maxAge=168h&context=1&type=junit&name=azure&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job Probably safe to close for now, we'll reopen if it resurfaces. This was almost certainly fixed by https://github.com/openshift/machine-config-operator/pull/3070 |