Hide Forgot
Description of problem: Pods transition from Failed to Succeeded. Version-Release number of selected component (if applicable): seen in https://github.com/openshift/origin/pull/22387 (but I recall earlier cases as well) How reproducible: flaky Actual results: https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22387/pull-ci-openshift-origin-master-e2e-aws/6085 diff: object.ObjectMeta.ResourceVersion: a: "87909" b: "87951" object.Status.Phase: a: "Failed" b: "Succeeded" object.Status.Conditions[0].Reason: a: "" b: "PodCompleted" object.Status.Conditions[1].Reason: a: "ContainersNotReady" b: "PodCompleted" object.Status.Conditions[1].Message: a: "containers with unready status: [deployment]" b: "" object.Status.Conditions[2].Reason: a: "ContainersNotReady" b: "PodCompleted" object.Status.Conditions[2].Message: a: "containers with unready status: [deployment]" b: "" Expected results: Stay failed. Additional info:
Clayton saw it here yesterday https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/109/#openshift-tests-featuredeploymentconfig-deploymentconfigs-keep-the-deployer-pod-invariant-valid-conformance-should-deal-with-config-change-in-case-the-deployment-is-still-running-suiteopenshiftconformanceparallelminimal diff: object.ObjectMeta.ResourceVersion: a: "77980" b: "78041" object.Status.Phase: a: "Failed" b: "Succeeded" object.Status.Conditions[0].Reason: a: "" b: "PodCompleted" object.Status.Conditions[1].Reason: a: "ContainersNotReady" b: "PodCompleted" object.Status.Conditions[1].Message: a: "containers with unready status: [deployment]" b: "" object.Status.Conditions[2].Reason: a: "ContainersNotReady" b: "PodCompleted" object.Status.Conditions[2].Message: a: "containers with unready status: [deployment]" b: ""
I do not agree. I want to know why it’s happening before it gets kicked out. I see no evidence that this isn’t failing because of a serious bug in the platform. Show me this is a trivial flake, and I’ll be ok bumping it out. But if this is a bug in how Kubelet works that violates the guarantee stateful applications rely on then it is most definitely not acceptable to defer it.
Ryan, hate assigning this to you since we tried to figure this out for weeks about a year ago and got nowhere. But if you could take a look, it is currently a blocker. Not sure how we could prove it is not "a serious bug in the platform".
This is almost a two year old bug upstream and we have released countless number of OpenShift releases during that period of time. There is not any data on how to reproduce the error, which is why the issue has stagnated upstream. Claiming this issue is a blocker now is disingenuous to the history of the bug.
adding buildcop to whiteboard as "[Feature:DeploymentConfig] deploymentconfigs when run iteratively [Conformance] should only deploy the last deployment" failures are showing up neard the top of the `deck-build-log-hot` list.
https://github.com/kubernetes/kubernetes/issues/58711#issuecomment-491615546 It looks like what happens is: 1. API deletion is observed 2. Some containers begin to shutdown due to syncPod, exiting with code != 0 3. the dispatchWork loop invokes podIsTerminated, which returns true because all containers have exited 4. TerminatePod is then called, which clears container status (this is the bug) 5. The next status manager sync sees that container status is the default values (exit code == 0) and terminated, and tells the API that the pod succeeded. The fix is to properly merge termination state of a container in TerminatePod. The impact of this bug would be that anyone using pods as primitives in a higher order chain who need to interrupt the action of a pod might see the pod as successful instead of failed if they race against TerminatePod. This makes the pod phase unreliable, which could mean a higher level orchestrator (in this case, the Openshift deployment config, but also jobs, cron jobs, pods as a directed graph entry in tools like Argo, a knative build) could mistake a failure for a success.
The most common place this will happen is during drain (drain deletes pods). So this could impact an upgrade of the control plane: 1. KAO starts rolling out a new version N+1 to 3 master nodes via installer pods 2. A drain is invoked on one of the master nodes 3. The installer pod is terminated before it has completed (we need to verify the installer pod exits with exit code != 0 when SIGTERM is sent before it has completed) 4. Instead of reporting failure, the Kubelet reports success 5. The KAO thinks all three masters are at version N+1, but one of them is at N Therefor the urgent priority is justified for this because it could result in a required config change failing in production systems. If we do not fix it in 4.1.0, we will need to fix it ASAP and get all 4.1.0 users upgraded to a release with it ASAP (a control plane can degrade).
Clayton is trying to verify the fix that removes TerminatePod https://github.com/openshift/origin/commit/4228b5460dbb34e78d01b5bb7729308fc1f6a6b7 https://github.com/kubernetes/kubernetes/issues/58711#issuecomment-492082500 We also have a potential tight reproducer. It is a race so no reproducer will hit it every time, but we and increase the likelihood. We need this to verify (within reason) any proposed fix.
Because this intersects with drain in 4.1, the potential for catastrophic data loss for a customer system (a deployment config that requires a hook pod to correctly execute before rolling out a new version of a line-of-business system) or the failure to correctly update the control plane (Kube apiserver being rolled out to 2 of 3 nodes, but operator thinks it's rolled out to all 3, thus allows a potentially dangerous upgrade to go ahead) requires we fix this with all urgency. It needs soak time, so if we can't line up a fix today or early tomorrow we may want to make this a 4.1.1 required upgrade.
Upstream PR: https://github.com/kubernetes/kubernetes/pull/77870
Origin PR: https://github.com/openshift/origin/pull/22834
master Origin PR has merged 4.1 pick: https://github.com/openshift/origin/pull/22840
Nice work tracking this down and fixing it. We need to follow up with admission preventing such changes.
Pending promotion jobs are https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/1342 and https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.2/259 . Looks like those are the only blockers: $ curl -s https://prow.svc.ci.openshift.org/data.js | jq -r '.[].job | select(. | contains("machine-os-content"))' | sort | uniq -c 90 release-promote-openshift-machine-os-content-e2e-aws-4.1 90 release-promote-openshift-machine-os-content-e2e-aws-4.2
Here's the image they're testing: $ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Name: registry.svc.ci.openshift.org/rhcos/machine-os-content:latest Digest: sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62 Media Type: application/vnd.docker.distribution.manifest.v2+json Created: 40m ago Image Size: 623.9MB OS: linux Arch: amd64 Entrypoint: /noentry Labels: com.coreos.ostree-commit=f44f4f27cb99175f2fe998dac5cbab0f8178f18db8a18a04f9ff88022558a9fb version=410.8.20190516.0 which you can see from the build logs too: $ curl -s 'https://deck-ci.svc.ci.openshift.org/log?job=release-promote-openshift-machine-os-content-e2e-aws-4.1&id=1342' | head -n1 Will promote sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62, current is sha256:7b712447669fc18e8a343ba5e1c271c6e298e49d5cc391413087e0d608980d73
Promotion jobs passed, and here we have it: $ oc image info registry.svc.ci.openshift.org/ocp/4.1:machine-os-content | head -n2 Name: registry.svc.ci.openshift.org/ocp/4.1:machine-os-content Digest: sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62 $ oc image info registry.svc.ci.openshift.org/ocp/4.2:machine-os-content | head -n2 Name: registry.svc.ci.openshift.org/ocp/4.2:machine-os-content Digest: sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62
*** Bug 1705708 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0758