Description of problem: PR e2es and image jobs are failing with: 2019/03/28 19:16:49 Create release image registry.svc.ci.openshift.org/ci-op-g0f7xrtp/release:latest 2019/03/28 19:17:51 error: unable to signal to artifacts container to terminate in pod release-latest, triggering deletion: could not run remote command: unable to upgrade connection: container not found ("artifacts") 2019/03/28 19:17:51 error: unable to retrieve artifacts from pod release-latest: could not read gzipped artifacts: unable to upgrade connection: container not found ("artifacts") 2019/03/28 19:17:57 Ran for 5m37s error: could not run steps: release "release-latest" failed: pod release-latest was already deleted Example: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_console/1346/pull-ci-openshift-console-master-e2e-aws/396/#operator-create-the-release-image-containing-all-images-built-by-this-job
i think this is our most common ci/infra failure right now.
This was root-caused to be from a slew of evictions on nodes that could not remove unused image layers in image garbage control. We upgraded the kubelets in our node pool and verified that the new kubelet correctly removed the stubborn layers, so this should not be an issue any longer.
Looks like this happens again? pod release-latest was already deletedpod release-latest was already deleted
Seeing this being the cause of 2% failures today. Some recent PRs where this is happening: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22643/pull-ci-openshift-origin-master-e2e-aws-serial/5298/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22643/pull-ci-openshift-origin-master-e2e-aws-image-registry/431/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22643/pull-ci-openshift-origin-master-e2e-aws-image-registry/431/
This is still happening 42/2621 runs https://search.svc.ci.openshift.org/?search=was+already+deleted&maxAge=48h&context=2&type=all https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22706/pull-ci-openshift-origin-master-e2e-aws-serial/5466/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22706/pull-ci-openshift-origin-master-e2e-aws-serial/5466/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22707/pull-ci-openshift-origin-master-images/5602/
While the original issue at hand were evictions from the kubelet, this JIRA issue tracks the work to handle the messaging on these cases better: https://jira.coreos.com/browse/DPTP-366
Lowering priority as this bug is now scoped to the poor messaging but the underlying issue that was high priority is fixed.
Lots of these in the past 24hrs: https://search.svc.ci.openshift.org/?search=pod+release-latest+was+already+deleted&maxAge=24h&context=2&type=all Found 37 results
Yes, there is some underlying issue with nodes evicting based on ephemeral storage pressure. This is just a symptom, I will open a separate BZ for that issue.
*** Bug 1707458 has been marked as a duplicate of this bug. ***
I was acting build cop today and saw this issue roughly 17 times in the past 24 hours. Some recent PRs include: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/160/pull-ci-openshift-cluster-network-operator-master-e2e-aws-upgrade/84/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/160/pull-ci-openshift-cluster-network-operator-master-images/825/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/746/pull-ci-openshift-machine-config-operator-master-images/2422/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/746/pull-ci-openshift-machine-config-operator-master-e2e-aws/3619/ https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/openshift_machine-config-operator/746/pull-ci-openshift-machine-config-operator-master-e2e-aws-upgrade/369/
This BZ actually covers multiple different situations, and the problem stems from the fact that ci-operator's PodStep does not survive when something deletes the Pod on which ci-operator is waiting (it simply bails out [1]) Of course there may be multiple different reasons why the deletions would happen: evictions (see above), manual inverventions (unlikely in production), but I believe the reason why we see these in production is that in some cases one ci-operator deletes Pods on which other ci-operator instances are waiting. I managed to reproduce a single race in ci-operator, which I believe should be a main culprit of the failures in production. When multiple ci-operator instances want to execute the same pod (as is the case with these release-latest pods), just a single one is created and then all ci-operator wait for it to finish. All ci-operators then fetch the artifacts from the pod using the `artifacts` container in it. The problem is that we allow the `artifacts` container to terminate 5 seconds [2,3] after a ci-operator instance finishes its artifact grab, so the remaining instances have these 5 seconds to finish their job. In some cases, this is not enough, and they attempt to interact with the terminated contaners, resulting in an error. This causes ci-operator to delete the whole pod, which speeds up other ci-operator hitting and error. There may be other cases when ci-operator deletes its pods (one of them is the interruption handler, but that should not happen that often in production. Or at least not in a single namespace). So the options we have: 1. Prolong the 5s deadline. This is not an actual fix: the race will still exist, but I think it should make the problem manifest less often. 2. Make the `artifacts` container aware there are multiple ci-operators needing it. Currently the container terminates once a `/tmp/done` file is removed from it, we would need to extend this "signal" to something like a barrier. 3. We could also make ci-operator more robust and retry the whole pod step in this situation. This has an advantage of also handling other deletion causes, like evictions. [1] https://github.com/openshift/ci-operator/blob/5d370fdee1072b27704423e0245f2c332af77cec/pkg/steps/template.go#L508 [2] https://github.com/openshift/ci-operator/blob/master/pkg/steps/artifacts.go#L393 [3] https://github.com/openshift/ci-operator/blob/master/pkg/steps/artifacts.go#L322
I misread a piece of the code - the 5s I mentioned is not a delay to termination, but a delay between checks, so the actual window for other ci-operator instances is even shorter.
https://github.com/openshift/ci-operator/pull/348 is a band-aid that should reduce occurrences of the race to minimum. We saw last occurrence before merge (May 20 2019 23:15:55 CEST) and none since, so the band-aid helped (we used to have 5-10 hits each day). https://github.com/openshift/ci-operator/pull/351 improved the message for these situations (not just the race, but any event when pod is deleted under ci-operator's hand). Instead of: error: could not run steps: release "release-latest" failed: pod release-latest was already deleted this would be the output: error: could not wait for pod 'release-latest': it is no longer present on the cluster (usually a result of a race or resource pressure. re-running the job should help) error: pod was deleted while ci-operator step was waiting for it For "proper" architectural fixes, two JIRAs were created: - https://jira.coreos.com/browse/DPTP-411 Fix the race where artifacts pod may terminate while other ci-operators still need it - https://jira.coreos.com/browse/DPTP-412 Make ci-operator PodStep more resilient to pod deletions We are monitoring the hits in Sentry and will prioritize these JIRAs accordingly. Because the issue should not longer happen often enough to hamper CI service efficiency and the message is no longer confusing, moving to ON_QA.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:2922