Bug 1941107 - CI ignition failures
Summary: CI ignition failures
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: RHCOS
Version: 4.8
Hardware: Unspecified
OS: Unspecified
low
medium
Target Milestone: ---
: ---
Assignee: Micah Abbott
QA Contact: Michael Nguyen
URL:
Whiteboard:
Depends On: 1940551
Blocks: 1940972
TreeView+ depends on / blocked
 
Reported: 2021-03-20 00:56 UTC by Michael Gugino
Modified: 2021-10-15 13:40 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1940551
Environment:
Last Closed: 2021-10-15 13:40:21 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Michael Gugino 2021-03-20 00:56:02 UTC
+++ This bug was initially created as a clone of Bug #1940551 +++

https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Looking at the last 7 days, somewhere between 1 and 3% of jobs on metal, vsphere, ovirt, and openstack fail to get all the nodes they expect.  These jobs are on environments with slightly lower reliability than aws/gcp/azure, so some level of failure is expected, but it also points towards potential sources of error on our side that simply haven't been identified and fixed.

I would like metal / vsphere investigated to determine if there are a consistent set of reasons that aren't related to operational concerns from the folks running those (metal is slightly more concerning).  If we can show that the bulk of those failures are accounted for by infrastructure issues, we can drop the severity of this to medium.  However, reducing those sources will also make those teams happier and we may have gaps that have crept in over time.

Setting as high until we determine that this is not a single systemic issue on these platforms.

--- Additional comment from Clayton Coleman on 2021-03-18 15:38:24 UTC ---

A much smaller number of failures have happened on AWS / Azure / GCP so this is probably not common to all platforms (unless there is a race condition triggered by slower provisoining, or there are edge cases where we aren't tolerant of non-cloud errors in the provider stack, etc)

Comment 1 Michael Gugino 2021-03-20 00:59:14 UTC
I've been investigating the above CI query looking for failures.  I found quite a few related to CSRs, especially on vSphere.  Those are separate issues from this and have their own bugs filed.

I'm seeing random failures of nodes to join the cluster:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2458/pull-ci-openshift-machine-config-operator-master-e2e-aws/1372953783738306560

Machine-api looks good, instance is running, just receive no CSR request.  This is 1/3 workers not coming up like it should.  I've seen this on a few other jobs, don't have links handy ATM, but it's more than just this one.  Either the node is failing to execute ignition or the kubelet is failing to create the CSR request.

Comment 3 Yu Qi Zhang 2021-03-22 14:25:13 UTC
The two jobs display similar characteristics but I can't really tell if the fundamental error is the same. For the first job, I can see 3 requests to the MCS which is expected:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2458/pull-ci-openshift-machine-config-operator-master-e2e-aws/1372953783738306560/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-server-g6ch4_machine-config-server.log

has 2 and

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2458/pull-ci-openshift-machine-config-operator-master-e2e-aws/1372953783738306560/artifacts/e2e-aws/gather-extra/artifacts/pods/openshift-machine-config-operator_machine-config-server-jbz5q_machine-config-server.log

has 1

The console is the only information we have for the node that didn't came up, and it's stuck on the fetch stage:

https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2458/pull-ci-openshift-machine-config-operator-master-e2e-aws/1372953783738306560/artifacts/e2e-aws/gather-aws-console/artifacts/i-0f7db99fe5ffda4e4

So either its still working (unlikely) or it failed during the ignition stage.

In the second case, https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1372727043803844608/artifacts/e2e-aws-upgrade/gather-aws-console/artifacts/i-0563ef58253b69588

The node is stuck on networking it seems, and because its an upgrade job, the MCS logs from the initial MCS is lost.

Going to pass to RHCOS team to see if they have more insight on those specific failures. I suspect there isn't enough info for us to work off of, though

Comment 4 Micah Abbott 2021-03-23 18:46:32 UTC
(In reply to Yu Qi Zhang from comment #3)

> Going to pass to RHCOS team to see if they have more insight on those
> specific failures. I suspect there isn't enough info for us to work off of,
> though

Jerry is right about the AWS jobs that he linked to - just not enough info to make a determination what is going on.

One failure looks like the metadata service isn't reachable:

```
[   50.785875] ignition[709]: INFO     : PUT http://169.254.169.254/latest/api/token: attempt #13
[   50.795771] ignition[709]: INFO     : PUT error: Put "http://169.254.169.254/latest/api/token": dial tcp 169.254.169.254:80: connect: network is unreachable
```

The fetch stage should run after `network-online.target`, but it isn't clear why the fetch stage can't get the config

```
[[0m[0;31m*     [0m] A start job is running for Ignition (fetch) (12s / no limit)
[K[[0;1;31m*[0m[0;31m*    [0m] A start job is running for Ignition (fetch) (12s / no limit)
[K[[0;31m*[0;1;31m*[0m[0;31m*   [0m] A start job is running for Ignition (fetch) (13s / no limit)
[K[ [0;31m*[0;1;31m*[0m[0;31m*  [0m] A start job is running for Ignition (fetch) (13s / no limit)
[K[  [0;31m*[0;1;31m*[0m[0;31m* [0m] A start job is running for Ignition (fetch) (14s / no limit)
[K[   [0;31m*[0;1;31m*[0m[0;31m*[0m] A start job is running for Ignition (fetch) (14s / no limit)
[K[    [0;31m*[0;1;31m*[0m] A start job is running for Ignition (fetch) (15s / no limit)
```

I'll tag in some more Ignition experts, but I suspect we would need more concrete data.

@Jonathan @Benjamin @Sohan...any additional insight?


As for the original report about "metal, vsphere, ovirt, and openstack", we'd need to dig into those specific failures to see if something related to RHCOS.

Comment 5 Benjamin Gilbert 2021-03-23 18:57:06 UTC
> The fetch stage should run after `network-online.target`, but it isn't clear why the fetch stage can't get the config

The fetch stage does not run after network-online.target, because "the network is up" isn't well-defined.  For example, the system might have a routable IP address, but not on the NIC that can talk to the provisioning server.  Therefore, Ignition intentionally doesn't wait for networking, and just retries fetches until it can successfully connect to the server.

Comment 6 Jonathan Lebon 2021-03-23 19:05:34 UTC
FWIW, it does look like Ignition successfully fetches at least the user-data config and the pointed config:

[   11.116103] ignition[713]: INFO     : PUT http://169.254.169.254/latest/api/token: attempt #3
[   11.122601] ignition[713]: INFO     : PUT result: OK
[   11.127588] ignition[713]: DEBUG    : parsed url from cmdline: ""
[   11.132557] ignition[713]: INFO     : no config URL provided
[   11.137048] ignition[713]: INFO     : reading system config file "/usr/lib/ignition/user.ign"
[   11.143808] ignition[713]: INFO     : no config at "/usr/lib/ignition/user.ign"
[   11.149927] ignition[713]: INFO     : PUT http://169.254.169.254/latest/api/token: attempt #1
[   11.156917] NetworkManager[741]: <info>  [1616174105.3624] device (lo): carrier: link connected
[   11.163872] ignition[713]: INFO     : PUT result: OK
[   11.168097] ignition[713]: INFO     : GET http://169.254.169.254/2019-10-01/user-data: attempt #1
[   11.175014] ignition[713]: INFO     : GET result: OK
[   11.179128] ignition[713]: DEBUG    : parsing config with SHA512: 825b9f4c41179e090ec3beaea442aa50e10f1f534de2d18934909e3c4c752c78463ee93aa35159bbe46e4b45789587e88ff6137a496e8703e7f0625aadc35327
[   11.191730] ignition[713]: INFO     : Adding "root-ca" to list of CAs
[   11.196667] ignition[713]: INFO     : GET https://api-int.**************-c4a31.origin-ci-int-aws.dev.rhcloud.com:22623/config/worker: attempt #1
...
[   11.299216] ignition[713]: INFO     : GET result: OK

So it's actually not clear to me why Ignition is hanging. There doesn't seem to be any pending requests that it's retrying.

Comment 7 Michael Gugino 2021-03-23 20:27:48 UTC
I didn't realize console bits were available!

Looking at this log, DHCP fails: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-upgrade/1372727043803844608/artifacts/e2e-aws-upgrade/gather-aws-console/artifacts/i-0563ef58253b69588

[   55.012683] NetworkManager[738]: <warn>  [1616119652.9599] dhcp4 (ens3): request timed out

This one DHCP works: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/2458/pull-ci-openshift-machine-config-operator-master-e2e-aws/1372953783738306560/artifacts/e2e-aws/gather-aws-console/artifacts/i-0f7db99fe5ffda4e4

[   11.531169] NetworkManager[741]: <info>  [1616174105.4450] dhcp4 (ens3): state changed unknown -> bound
[   11.538089] NetworkManager[741]: <info>  [1616174105.5092] device (ens3): state change: ip-config -> ip-check (reason 'none', sys-iface-state: 'managed')
[   11.547858] NetworkManager[741]: <info>  [1616174105.5094] device (ens3): state change: ip-check -> secondaries (reason 'none', sys-iface-state: 'managed')
[   11.691068] NetworkManager[741]: <info>  [1616174105.5094] device (ens3): state change: secondaries -> activated (reason 'none', sys-iface-state: 'managed')
[   11.701320] NetworkManager[741]: <info>  [1616174105.5096] manager: NetworkManager state is now CONNECTED_LOCAL
[   11.708913] NetworkManager[741]: <info>  [1616174105.5100] manager: NetworkManager state is now CONNECTED_SITE
[   11.716415] NetworkManager[741]: <info>  [1616174105.5101] policy: set 'Wired Connection' (ens3) as default for IPv4 routing and DNS
[   11.725014] NetworkManager[741]: <info>  [1616174105.5101] policy: set-hostname: set hostname to 'ip-10-0-180-96' (from DHCPv4)
[   11.733459] NetworkManager[741]: <info>  [1616174105.5105] device (ens3): Activation: successful, device activated.
[   11.740883] NetworkManager[741]: <info>  [1616174105.5106] manager: NetworkManager state is now CONNECTED_GLOBAL
[   12.063622] NetworkManager[741]: <info>  [1616174106.4564] device (ens3): carrier: link connected
[   12.070491] NetworkManager[741]: <info>  [1616174106.4566] device (ens3): DHCPv4 lease renewal requested
[   12.079106] NetworkManager[741]: <info>  [1616174106.4723] dhcp4 (ens3): canceled DHCP transaction
[   12.086134] NetworkManager[741]: <info>  [1616174106.4723] dhcp4 (ens3): state changed bound -> done
[   12.093412] NetworkManager[741]: <info>  [1616174106.4724] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
[   12.102402] NetworkManager[741]: <info>  [1616174106.4725] manager: startup complete
[   12.109164] NetworkManager[741]: <info>  [1616174106.4725] quitting now that startup is complete
[[   12.117306] NetworkManager[741]: <info>  [1616174106.4729] dhcp4 (ens3): canceled DHCP transaction
[0;32m  OK  [0m] Started dracut initqueue hook[   12.125719] NetworkManager[741]: <info>  [1616174106.4729] dhcp4 (ens3): state changed unknown -> done
.
[[   12.134442] NetworkManager[741]: <info>  [1616174106.4729] device (ens3): DHCPv4: trying to acquire a new lease within 90 seconds

But then tries to acquire a new lease?

Comment 8 Micah Abbott 2021-03-26 21:19:42 UTC
(In reply to Benjamin Gilbert from comment #5)
> > The fetch stage should run after `network-online.target`, but it isn't clear why the fetch stage can't get the config
> 
> The fetch stage does not run after network-online.target, because "the
> network is up" isn't well-defined.  For example, the system might have a
> routable IP address, but not on the NIC that can talk to the provisioning
> server.  Therefore, Ignition intentionally doesn't wait for networking, and
> just retries fetches until it can successfully connect to the server.

Oops, it is `network.target`

https://github.com/coreos/ignition/blob/master/dracut/30ignition/ignition-fetch.service#L22

(In reply to Michael Gugino from comment #7)
> I didn't realize console bits were available!
> 
> Looking at this log, DHCP fails:
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/
> periodic-ci-openshift-release-master-ci-4.8-upgrade-from-stable-4.7-e2e-aws-
> upgrade/1372727043803844608/artifacts/e2e-aws-upgrade/gather-aws-console/
> artifacts/i-0563ef58253b69588
> 
> [   55.012683] NetworkManager[738]: <warn>  [1616119652.9599] dhcp4 (ens3):
> request timed out

This one kind of looks like it is just the infra not responding to the DHCP request, but I am not a DHCP/NetworkManager expert.


> This one DHCP works:
> https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-
> logs/pull/openshift_machine-config-operator/2458/pull-ci-openshift-machine-
> config-operator-master-e2e-aws/1372953783738306560/artifacts/e2e-aws/gather-
> aws-console/artifacts/i-0f7db99fe5ffda4e4
> 
> [   11.531169] NetworkManager[741]: <info>  [1616174105.4450] dhcp4 (ens3):
> state changed unknown -> bound
> [   11.538089] NetworkManager[741]: <info>  [1616174105.5092] device (ens3):
> state change: ip-config -> ip-check (reason 'none', sys-iface-state:
> 'managed')
> [   11.547858] NetworkManager[741]: <info>  [1616174105.5094] device (ens3):
> state change: ip-check -> secondaries (reason 'none', sys-iface-state:
> 'managed')
> [   11.691068] NetworkManager[741]: <info>  [1616174105.5094] device (ens3):
> state change: secondaries -> activated (reason 'none', sys-iface-state:
> 'managed')
> [   11.701320] NetworkManager[741]: <info>  [1616174105.5096] manager:
> NetworkManager state is now CONNECTED_LOCAL
> [   11.708913] NetworkManager[741]: <info>  [1616174105.5100] manager:
> NetworkManager state is now CONNECTED_SITE
> [   11.716415] NetworkManager[741]: <info>  [1616174105.5101] policy: set
> 'Wired Connection' (ens3) as default for IPv4 routing and DNS
> [   11.725014] NetworkManager[741]: <info>  [1616174105.5101] policy:
> set-hostname: set hostname to 'ip-10-0-180-96' (from DHCPv4)
> [   11.733459] NetworkManager[741]: <info>  [1616174105.5105] device (ens3):
> Activation: successful, device activated.
> [   11.740883] NetworkManager[741]: <info>  [1616174105.5106] manager:
> NetworkManager state is now CONNECTED_GLOBAL
> [   12.063622] NetworkManager[741]: <info>  [1616174106.4564] device (ens3):
> carrier: link connected
> [   12.070491] NetworkManager[741]: <info>  [1616174106.4566] device (ens3):
> DHCPv4 lease renewal requested
> [   12.079106] NetworkManager[741]: <info>  [1616174106.4723] dhcp4 (ens3):
> canceled DHCP transaction
> [   12.086134] NetworkManager[741]: <info>  [1616174106.4723] dhcp4 (ens3):
> state changed bound -> done
> [   12.093412] NetworkManager[741]: <info>  [1616174106.4724] dhcp4 (ens3):
> activation: beginning transaction (timeout in 45 seconds)
> [   12.102402] NetworkManager[741]: <info>  [1616174106.4725] manager:
> startup complete
> [   12.109164] NetworkManager[741]: <info>  [1616174106.4725] quitting now
> that startup is complete
> [[   12.117306] NetworkManager[741]: <info>  [1616174106.4729] dhcp4 (ens3):
> canceled DHCP transaction
> [0;32m  OK  [0m] Started dracut initqueue hook[   12.125719]
> NetworkManager[741]: <info>  [1616174106.4729] dhcp4 (ens3): state changed
> unknown -> done
> .
> [[   12.134442] NetworkManager[741]: <info>  [1616174106.4729] device
> (ens3): DHCPv4: trying to acquire a new lease within 90 seconds
> 
> But then tries to acquire a new lease?


Not sure what to make of those logs, but perhaps our friend @Beniamino from the NetworkManager team can help us here.

I suspect that DEBUG logs from NetworkManager would be most useful, which will likely be difficult (impossible?) to retrieve from our automated CI jobs.

Comment 9 Beniamino Galvani 2021-03-29 15:16:23 UTC
Hi, this:

[   12.093412] NetworkManager[741]: <info>  [1616174106.4724] dhcp4 (ens3): activation: beginning transaction (timeout in 45 seconds)
[   12.102402] NetworkManager[741]: <info>  [1616174106.4725] manager: startup complete
[   12.109164] NetworkManager[741]: <info>  [1616174106.4725] quitting now that startup is complete
[   12.117306] NetworkManager[741]: <info>  [1616174106.4729] dhcp4 (ens3): canceled DHCP transaction
[   12.134442] NetworkManager[741]: <info>  [1616174106.4729] device (ens3): DHCPv4: trying to acquire a new lease within 90 seconds

is a clean termination. The last message about new lease acquisition in 90 seconds is misleading, as NM is shutting down and stops touching the device. I will prepare a patch to remove the message.

In the other log DHCP succeeds and NM also terminates cleanly.

Comment 10 Beniamino Galvani 2021-04-13 06:49:04 UTC
Fix for the wrong "trying to acquire a new lease" log message:

https://gitlab.freedesktop.org/NetworkManager/NetworkManager/-/merge_requests/802

Comment 11 Micah Abbott 2021-04-30 21:02:15 UTC
We still don't have enough data to determine what is causing these intermittent failures; we'd likely need to instrument the CI jobs to collect debug/trace messages from NetworkManager if we think it could be a networking issue.

Comment 12 Micah Abbott 2021-10-15 13:40:21 UTC
Looking at the latest search for failures attributed to "Managed cluster should have same number of Machines and Nodes", it appears to be affecting Azure jobs more frequently than others.

https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

The couple of Azure-specific jobs that I skimmed didn't show any evidence of the nodes themselves not coming up; they got through the Ignition stage and started cri-o/kubelet successfully.


I'm not denying there isn't a problem here, but it doesn't appear to have a clear root cause that we can act on.

Barring any conclusive data, I feel it is better to mark this one as CLOSED/DEFERRED until we have something more actionable to work on.


Note You need to log in before you can comment on or make changes to this bug.