Bug 1693865 - CI Operator does not expose pods deleted due to evictions well
Summary: CI Operator does not expose pods deleted due to evictions well
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Test Infrastructure
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: 4.2.0
Assignee: Petr Muller
QA Contact: libra bugs
URL:
Whiteboard:
: 1707458 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-03-28 20:08 UTC by Aravindh Puthiyaparambil
Modified: 2020-10-27 16:03 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-10-16 06:27:58 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift ci-operator issues 228 0 None closed Failures to delete already-deleted artifacts containers 2021-02-03 23:15:52 UTC
Github openshift oc pull 623 0 None closed Bug 1891555: Create Windows version information 2021-02-03 23:15:52 UTC
Red Hat Product Errata RHBA-2019:2922 0 None None None 2019-10-16 06:28:10 UTC

Description Aravindh Puthiyaparambil 2019-03-28 20:08:24 UTC
Description of problem:
PR e2es and image jobs are failing with:

2019/03/28 19:16:49 Create release image registry.svc.ci.openshift.org/ci-op-g0f7xrtp/release:latest
2019/03/28 19:17:51 error: unable to signal to artifacts container to terminate in pod release-latest, triggering deletion: could not run remote command: unable to upgrade connection: container not found ("artifacts")
2019/03/28 19:17:51 error: unable to retrieve artifacts from pod release-latest: could not read gzipped artifacts: unable to upgrade connection: container not found ("artifacts")
2019/03/28 19:17:57 Ran for 5m37s
error: could not run steps: release "release-latest" failed: pod release-latest was already deleted

Example: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1346/pull-ci-openshift-console-master-e2e-aws/396/#operator-create-the-release-image-containing-all-images-built-by-this-job

Comment 1 Ben Parees 2019-03-29 21:27:17 UTC
i think this is our most common ci/infra failure right now.

Comment 2 Steve Kuznetsov 2019-04-03 15:18:41 UTC
This was root-caused to be from a slew of evictions on nodes that could not remove unused image layers in image garbage control. We upgraded the kubelets in our node pool and verified that the new kubelet correctly removed the stubborn layers, so this should not be an issue any longer.

Comment 3 Antonio Murdaca 2019-04-17 09:05:47 UTC
Looks like this happens again? pod release-latest was already deletedpod release-latest was already deleted

Comment 6 Steve Kuznetsov 2019-04-30 16:02:27 UTC
While the original issue at hand were evictions from the kubelet, this JIRA issue tracks the work to handle the messaging on these cases better:
https://jira.coreos.com/browse/DPTP-366

Comment 7 Steve Kuznetsov 2019-05-01 16:37:59 UTC
Lowering priority as this bug is now scoped to the poor messaging but the underlying issue that was high priority is fixed.

Comment 8 Dan Mace 2019-05-01 18:53:37 UTC
Lots of these in the past 24hrs:

https://search.svc.ci.openshift.org/?search=pod+release-latest+was+already+deleted&maxAge=24h&context=2&type=all
Found 37 results

Comment 9 Steve Kuznetsov 2019-05-02 16:06:56 UTC
Yes, there is some underlying issue with nodes evicting based on ephemeral storage pressure. This is just a symptom, I will open a separate BZ for that issue.

Comment 10 Steve Kuznetsov 2019-05-08 15:24:55 UTC
*** Bug 1707458 has been marked as a duplicate of this bug. ***

Comment 12 Petr Muller 2019-05-17 15:50:32 UTC
This BZ actually covers multiple different situations, and the problem stems
from the fact that ci-operator's PodStep does not survive when something
deletes the Pod on which ci-operator is waiting (it simply bails out [1])

Of course there may be multiple different reasons why the deletions would
happen: evictions (see above), manual inverventions (unlikely in production),
but I believe the reason why we see these in production is that in some cases
one ci-operator deletes Pods on which other ci-operator instances are waiting.

I managed to reproduce a single race in ci-operator, which I believe should be
a main culprit of the failures in production. When multiple ci-operator
instances want to execute the same pod (as is the case with these
release-latest pods), just a single one is created and then all ci-operator
wait for it to finish. All ci-operators then fetch the artifacts from the pod
using the `artifacts` container in it. The problem is that we allow the
`artifacts` container to terminate 5 seconds [2,3] after a ci-operator instance
finishes its artifact grab, so the remaining instances have these 5 seconds to
finish their job. In some cases, this is not enough, and they attempt to
interact with the terminated contaners, resulting in an error. This causes
ci-operator to delete the whole pod, which speeds up other ci-operator hitting
and error.

There may be other cases when ci-operator deletes its pods (one of them is the
interruption handler, but that should not happen that often in production. Or
at least not in a single namespace).

So the options we have:

1. Prolong the 5s deadline. This is not an actual fix: the race will still
exist, but I think it should make the problem manifest less often.
2. Make the `artifacts` container aware there are multiple ci-operators needing
it. Currently the container terminates once a `/tmp/done` file is removed from
it, we would need to extend this "signal" to something like a barrier.
3. We could also make ci-operator more robust and retry the whole pod step in
this situation. This has an advantage of also handling other deletion causes,
like evictions.

[1] https://github.com/openshift/ci-operator/blob/5d370fdee1072b27704423e0245f2c332af77cec/pkg/steps/template.go#L508
[2] https://github.com/openshift/ci-operator/blob/master/pkg/steps/artifacts.go#L393
[3] https://github.com/openshift/ci-operator/blob/master/pkg/steps/artifacts.go#L322

Comment 13 Petr Muller 2019-05-17 16:26:34 UTC
I misread a piece of the code - the 5s I mentioned is not a delay to termination, but a delay between checks, so the actual window for other ci-operator instances is even shorter.

Comment 14 Petr Muller 2019-05-22 12:00:07 UTC
https://github.com/openshift/ci-operator/pull/348 is a band-aid that should reduce occurrences of the race to minimum. We saw last occurrence before merge (May 20 2019 23:15:55 CEST) and none since, so the band-aid helped (we used to have 5-10 hits each day).
https://github.com/openshift/ci-operator/pull/351 improved the message for these situations (not just the race, but any event when pod is deleted under ci-operator's hand). Instead of:

error: could not run steps: release "release-latest" failed: pod release-latest was already deleted

this would be the output:

error: could not wait for pod 'release-latest': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help)
error: pod was deleted while ci-operator step was waiting for it

For "proper" architectural fixes, two JIRAs were created:

- https://jira.coreos.com/browse/DPTP-411 Fix the race where artifacts pod may terminate while other ci-operators still need it
- https://jira.coreos.com/browse/DPTP-412 Make ci-operator PodStep more resilient to pod deletions 

We are monitoring the hits in Sentry and will prioritize these JIRAs accordingly.


Because the issue should not longer happen often enough to hamper CI service efficiency and the message is no longer confusing, moving to ON_QA.

Comment 16 errata-xmlrpc 2019-10-16 06:27:58 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922


Note You need to log in before you can comment on or make changes to this bug.