Bug 1694087

Summary:	[reliability] When a RestartNever pod is deleted, a failed container can be reported as having succeeded
Product:	OpenShift Container Platform	Reporter:	Tomáš Nožička <tnozicka>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED ERRATA	QA Contact:	Sunil Choudhary <schoudha>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.1.0	CC:	aos-bugs, calfonso, ccoleman, gblomqui, jokerman, mmccomas, sjenning, wking, wsun
Target Milestone:	---	Keywords:	BetaBlocker
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:46:34 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Tomáš Nožička 2019-03-29 13:41:01 UTC

Description of problem:

Pods transition from Failed to Succeeded.


Version-Release number of selected component (if applicable):

seen in https://github.com/openshift/origin/pull/22387
(but I recall earlier cases as well)


How reproducible:

flaky


Actual results:

https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/22387/pull-ci-openshift-origin-master-e2e-aws/6085

diff: 
object.ObjectMeta.ResourceVersion:
  a: "87909"
  b: "87951"
object.Status.Phase:
  a: "Failed"
  b: "Succeeded"
object.Status.Conditions[0].Reason:
  a: ""
  b: "PodCompleted"
object.Status.Conditions[1].Reason:
  a: "ContainersNotReady"
  b: "PodCompleted"
object.Status.Conditions[1].Message:
  a: "containers with unready status: [deployment]"
  b: ""
object.Status.Conditions[2].Reason:
  a: "ContainersNotReady"
  b: "PodCompleted"
object.Status.Conditions[2].Message:
  a: "containers with unready status: [deployment]"
  b: ""


Expected results:

Stay failed.

Additional info:

Comment 3 Tomáš Nožička 2019-05-02 07:23:09 UTC

Clayton saw it here yesterday

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/109/#openshift-tests-featuredeploymentconfig-deploymentconfigs-keep-the-deployer-pod-invariant-valid-conformance-should-deal-with-config-change-in-case-the-deployment-is-still-running-suiteopenshiftconformanceparallelminimal

diff: 
object.ObjectMeta.ResourceVersion:
  a: "77980"
  b: "78041"
object.Status.Phase:
  a: "Failed"
  b: "Succeeded"
object.Status.Conditions[0].Reason:
  a: ""
  b: "PodCompleted"
object.Status.Conditions[1].Reason:
  a: "ContainersNotReady"
  b: "PodCompleted"
object.Status.Conditions[1].Message:
  a: "containers with unready status: [deployment]"
  b: ""
object.Status.Conditions[2].Reason:
  a: "ContainersNotReady"
  b: "PodCompleted"
object.Status.Conditions[2].Message:
  a: "containers with unready status: [deployment]"
  b: ""

Comment 5 Clayton Coleman 2019-05-09 14:50:07 UTC

I do not agree.  I want to know why it’s happening before it gets kicked out.  I see no evidence that this isn’t failing because of a serious bug in the platform.  Show me this is a trivial flake, and I’ll be ok bumping it out.  But if this is a bug in how Kubelet works that violates the guarantee stateful applications rely on then it is most definitely not acceptable to defer it.

Comment 6 Seth Jennings 2019-05-09 16:56:51 UTC

Ryan, hate assigning this to you since we tried to figure this out for weeks about a year ago and got nowhere.  But if you could take a look, it is currently a blocker.  Not sure how we could prove it is not "a serious bug in the platform".

Comment 7 Ryan Phillips 2019-05-10 13:46:57 UTC

This is almost a two year old bug upstream and we have released countless number of OpenShift releases during that period of time. There is not any data on how to reproduce the error, which is why the issue has stagnated upstream. Claiming this issue is a blocker now is disingenuous to the history of the bug.

Comment 8 Gabe Montero 2019-05-10 13:48:53 UTC

adding buildcop to whiteboard as "[Feature:DeploymentConfig] deploymentconfigs when run iteratively [Conformance] should only deploy the last deployment" failures 
are showing up neard the top of the `deck-build-log-hot` list.

Comment 9 Clayton Coleman 2019-05-12 18:06:05 UTC

https://github.com/kubernetes/kubernetes/issues/58711#issuecomment-491615546

It looks like what happens is:

1. API deletion is observed
2. Some containers begin to shutdown due to syncPod, exiting with code != 0
3. the dispatchWork loop invokes podIsTerminated, which returns true because all containers have exited
4. TerminatePod is then called, which clears container status (this is the bug)
5. The next status manager sync sees that container status is the default values (exit code == 0) and terminated, and tells the API that the pod succeeded.

The fix is to properly merge termination state of a container in TerminatePod.

The impact of this bug would be that anyone using pods as primitives in a higher order chain who need to interrupt the action of a pod might see the pod as successful instead of failed if they race against TerminatePod.  This makes the pod phase unreliable, which could mean a higher level orchestrator (in this case, the Openshift deployment config, but also jobs, cron jobs, pods as a directed graph entry in tools like Argo, a knative build) could mistake a failure for a success.

Comment 10 Clayton Coleman 2019-05-12 19:08:07 UTC

The most common place this will happen is during drain (drain deletes pods).  So this could impact an upgrade of the control plane:

1. KAO starts rolling out a new version N+1 to 3 master nodes via installer pods
2. A drain is invoked on one of the master nodes
3. The installer pod is terminated before it has completed (we need to verify the installer pod exits with exit code != 0 when SIGTERM is sent before it has completed)
4. Instead of reporting failure, the Kubelet reports success
5. The KAO thinks all three masters are at version N+1, but one of them is at N

Therefor the urgent priority is justified for this because it could result in a required config change failing in production systems. If we do not fix it in 4.1.0, we will need to fix it ASAP and get all 4.1.0 users upgraded to a release with it ASAP (a control plane can degrade).

Comment 11 Seth Jennings 2019-05-14 15:12:39 UTC

Clayton is trying to verify the fix that removes TerminatePod

https://github.com/openshift/origin/commit/4228b5460dbb34e78d01b5bb7729308fc1f6a6b7

https://github.com/kubernetes/kubernetes/issues/58711#issuecomment-492082500

We also have a potential tight reproducer.  It is a race so no reproducer will hit it every time, but we and increase the likelihood. We need this to verify (within reason) any proposed fix.

Comment 12 Clayton Coleman 2019-05-14 15:24:16 UTC

Because this intersects with drain in 4.1, the potential for catastrophic data loss for a customer system (a deployment config that requires a hook pod to correctly execute before rolling out a new version of a line-of-business system) or the failure to correctly update the control plane (Kube apiserver being rolled out to 2 of 3 nodes, but operator thinks it's rolled out to all 3, thus allows a potentially dangerous upgrade to go ahead) requires we fix this with all urgency.  It needs soak time, so if we can't line up a fix today or early tomorrow we may want to make this a 4.1.1 required upgrade.

Comment 13 Seth Jennings 2019-05-14 15:58:31 UTC

Upstream PR:
https://github.com/kubernetes/kubernetes/pull/77870

Comment 14 Seth Jennings 2019-05-14 16:52:10 UTC

Origin PR:
https://github.com/openshift/origin/pull/22834

Comment 15 Seth Jennings 2019-05-15 15:55:47 UTC

master Origin PR has merged

4.1 pick:
https://github.com/openshift/origin/pull/22840

Comment 16 Tomáš Nožička 2019-05-15 16:21:10 UTC

Nice work tracking this down and fixing it. We need to follow up with admission preventing such changes.

Comment 19 W. Trevor King 2019-05-16 21:49:49 UTC

Pending promotion jobs are https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.1/1342 and https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-promote-openshift-machine-os-content-e2e-aws-4.2/259 .  Looks like those are the only blockers:

$ curl -s https://prow.svc.ci.openshift.org/data.js | jq -r '.[].job | select(. | contains("machine-os-content"))' | sort | uniq -c
     90 release-promote-openshift-machine-os-content-e2e-aws-4.1
     90 release-promote-openshift-machine-os-content-e2e-aws-4.2

Comment 20 W. Trevor King 2019-05-16 21:53:18 UTC

Here's the image they're testing:

$ oc image info registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Name:       registry.svc.ci.openshift.org/rhcos/machine-os-content:latest
Digest:     sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62
Media Type: application/vnd.docker.distribution.manifest.v2+json
Created:    40m ago
Image Size: 623.9MB
OS:         linux
Arch:       amd64
Entrypoint: /noentry
Labels:     com.coreos.ostree-commit=f44f4f27cb99175f2fe998dac5cbab0f8178f18db8a18a04f9ff88022558a9fb
            version=410.8.20190516.0

which you can see from the build logs too:

$ curl -s 'https://deck-ci.svc.ci.openshift.org/log?job=release-promote-openshift-machine-os-content-e2e-aws-4.1&id=1342' | head -n1
Will promote sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62, current is sha256:7b712447669fc18e8a343ba5e1c271c6e298e49d5cc391413087e0d608980d73

Comment 21 W. Trevor King 2019-05-16 23:24:53 UTC

Promotion jobs passed, and here we have it:

$ oc image info registry.svc.ci.openshift.org/ocp/4.1:machine-os-content | head -n2
Name:       registry.svc.ci.openshift.org/ocp/4.1:machine-os-content
Digest:     sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62
$ oc image info registry.svc.ci.openshift.org/ocp/4.2:machine-os-content | head -n2
Name:       registry.svc.ci.openshift.org/ocp/4.2:machine-os-content
Digest:     sha256:05c856bb59bf5bda754e34937958dfaf941c0e8cfc289fa0f1deca0d18411b62

Comment 24 Seth Jennings 2019-05-20 20:07:23 UTC

*** Bug 1705708 has been marked as a duplicate of this bug. ***

Comment 26 errata-xmlrpc 2019-06-04 10:46:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758