Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1933760

Summary:	pod should not transition Running->Pending
Product:	OpenShift Container Platform	Reporter:	Ryan Phillips <rphillips>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Node sub component:	Kubelet	QA Contact:	Weinan Liu <weinliu>
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	medium
Priority:	medium	CC:	aos-bugs, bparees, cglombek, cholman, deads, dgoodwin, obulatov, tsweeney, vnarsing, wking
Version:	4.8	Keywords:	DeliveryBlocker
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:		Environment:	[sig-node] pods should never transition back to pending
Last Closed:	2022-09-16 19:15:58 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Ryan Phillips 2021-03-01 16:27:09 UTC

Description of problem:
There is a test in openshift/origin cmd/openshift-tests/e2e.go called 'pods should never transition back to pending'. This test would find pods that illegally transitioned from Running->Pending. The prow upgrade job is shutting down a node, rebooting, and then the transition to pending is seen. This is not an illegal transition on reboot. We need to check to see within the event stream if a reboot has happened, and then clear the items in the failure variable (line: 521). We will only want to clear the variable if the reboot is close to the events - maybe 5-10 minutes.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.7-stable-to-4.8-ci/1366335507746787328

Slack: https://coreos.slack.com/archives/CK1AE4ZCK/p1614610010087200

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Elana Hashman 2021-03-10 21:54:01 UTC

Tests were split since bug was filed, changes should target pkg/synthetictests/kubelet.go in openshift/origin

Comment 2 Ryan Phillips 2021-03-10 23:16:25 UTC

*** Bug 1937503 has been marked as a duplicate of this bug. ***

Comment 3 Ryan Phillips 2021-03-12 15:34:12 UTC

*** Bug 1935855 has been marked as a duplicate of this bug. ***

Comment 4 Ben Parees 2021-03-16 15:34:22 UTC

raising severity, this is failing most/all 4.8 e2e upgrade jobs and blocking PRs.

Comment 5 Elana Hashman 2021-03-16 19:36:48 UTC

Not a blocker, this is just a substandard test case. This lifecycle check for pods is not guaranteed on node reboots, which is what seems to be causing the flakes.

I will try to get a fix out by EOD.

Comment 6 Elana Hashman 2021-03-17 00:01:09 UTC

Current investigation status:

What causing these tests to fail? My hypothesis is that we've made some change recently that avoids pods going into Unknown state when nodes reboot.

Previously, this condition would not occur on node reboots because the pods would get marked as Unknown. Hence, the corresponding test events would be filtered out: https://github.com/openshift/origin/blob/bd4fce643b74c1a623b2d12d8cae5d03557b0962/pkg/monitor/pod.go#L82-L102

Now, it seems that checking for pods previously being in Unknown state is not enough. We'd need to figure out if the node they were on was rebooted recently.

Because this is an invariant, we have very little context available: the only state we can access are the test events that were emitted (not to be confused with cluster events) https://github.com/openshift/origin/blob/8d133f8da551068d984922c59efdfceb6f950a77/pkg/synthetictests/kubelet.go#L73

Comment 7 Elana Hashman 2021-03-18 18:56:12 UTC

Slack thread: https://coreos.slack.com/archives/CK1AE4ZCK/p1615995334010600

Clayton has gathered some data with https://github.com/openshift/origin/pull/25978 - for example,

ns/openshift-cluster-node-tuning-operator pod/tuned-v6rzf node/ci-op-ccmvz3y2-db044-wclms-worker-b-5rmxf - pod moved back to Pending: old status: {
  "phase": "Running",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-17T22:52:36Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-17T23:08:16Z"
    },
    {
      "type": "ContainersReady",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-17T22:52:39Z"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-17T22:52:36Z"
    }
  ],
  "hostIP": "10.0.32.2",
  "podIP": "10.0.32.2",
  "podIPs": [
    {
      "ip": "10.0.32.2"
    }
  ],
  "startTime": "2021-03-17T22:52:36Z",
  "containerStatuses": [
    {
      "name": "tuned",
      "state": {
        "running": {
          "startedAt": "2021-03-17T22:52:38Z"
        }
      },
      "lastState": {},
      "ready": true,
      "restartCount": 0,
      "image": "registry.build02.ci.openshift.org/ci-op-ccmvz3y2/stable@sha256:03211964f7e6e1128c3279b99c13d658c130b62e79cfdce132ca3336bf64adad",
      "imageID": "registry.build02.ci.openshift.org/ci-op-ccmvz3y2/stable-initial@sha256:03211964f7e6e1128c3279b99c13d658c130b62e79cfdce132ca3336bf64adad",
      "containerID": "cri-o://550821e0f85583a14ea12fd14642e3f904cfff97035ebcf5c8df120c1c240185",
      "started": true
    }
  ],
  "qosClass": "Burstable"
}


Container status was running, and after node reboot pod appears to have moved directly back into pending with no intermediate terminating state.

We know crio wipes the container status on node reboot. It seems that there isn't a chance for the terminating statuses to get propagated to the API server, causing the appearance of an invalid state change. If that's the case, graceful kubelet shutdown (https://github.com/kubernetes/kubernetes/pull/99735) should fix this when we get the next rebase of 1.21 into 4.8. (https://github.com/openshift/kubernetes/pull/624)

Let me find the CI run logs and confirm that for this particular pod, the pod did correctly terminate according to the node logs.

For next steps, I'm inclined to wait and see if the rebase fixes this.

Comment 8 Elana Hashman 2021-03-18 22:01:09 UTC

As Clayton noted, while it is likely that the graceful shutdown manager will fix this in CI, it doesn't address the case where a node doesn't have a chance to gracefully shut down, where the pods could get into this state.

In the kubelet, we check for illegal phase transitions here: https://github.com/openshift/kubernetes/blob/e1bc274ae64069e21201a87e4505e68a433c6a9f/pkg/kubelet/kubelet_pods.go#L1519-L1532

I'm going to see if I can reproduce this.


For reference, we fixed a similar issue here: https://github.com/kubernetes/kubernetes/issues/58711 https://github.com/kubernetes/kubernetes/pull/88440

Comment 9 Elana Hashman 2021-03-18 23:52:34 UTC

Let me trace the code paths for this failure example:

https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/25978/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1372337843204001792

ns/loki pod/loki-promtail-52qsd node/ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j - pod moved back to Pending:
old status: {
  "phase": "Running",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T01:24:42Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T02:30:32Z"
    },
    {
      "type": "ContainersReady",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T01:25:09Z"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T01:24:42Z"
    }
  ],
  "hostIP": "10.0.32.3",
  "podIP": "10.128.2.5",
  "podIPs": [
    {
      "ip": "10.128.2.5"
    }
  ],
  "startTime": "2021-03-18T01:24:42Z",
  "containerStatuses": [
    {
      "name": "promtail",
      "state": {
        "running": {
          "startedAt": "2021-03-18T01:24:50Z"
        }
      },
      "lastState": {},
      "ready": true,
      "restartCount": 0,
      "image": "docker.io/grafana/promtail:2.1.0",
      "imageID": "docker.io/grafana/promtail@sha256:4443d937395b25fa6591ade8ed06dae4f520bfe4d281b5686369b704be81b571",
      "containerID": "cri-o://892de8c3f5496bc14e4eb9cd9b5027758378997259480432efb09e9c497a84ce",
      "started": true
    }
  ],
  "qosClass": "BestEffort"
}
new status: {
  "phase": "Pending",
  "conditions": [
    {
      "type": "Initialized",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T01:24:42Z"
    },
    {
      "type": "Ready",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T02:30:32Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [promtail]"
    },
    {
      "type": "ContainersReady",
      "status": "False",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T02:32:20Z",
      "reason": "ContainersNotReady",
      "message": "containers with unready status: [promtail]"
    },
    {
      "type": "PodScheduled",
      "status": "True",
      "lastProbeTime": null,
      "lastTransitionTime": "2021-03-18T01:24:42Z"
    }
  ],
  "hostIP": "10.0.32.3",
  "startTime": "2021-03-18T01:24:42Z",
  "containerStatuses": [
    {
      "name": "promtail",
      "state": {
        "waiting": {
          "reason": "ContainerCreating"
        }
      },
      "lastState": {},
      "ready": false,
      "restartCount": 0,
      "image": "grafana/promtail:2.1.0",
      "imageID": "",
      "started": false
    }
  ],
  "qosClass": "BestEffort"
}


Following kubelet logs from the journal... machine reboots and crio wipes:

Mar 18 02:32:14.244817 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j systemd[1]: crio-wipe.service: Succeeded.
Mar 18 02:32:14.245402 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j systemd[1]: Started CRI-O Auto Update Script.
Mar 18 02:32:14.245526 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j systemd[1]: crio-wipe.service: Consumed 791ms CPU time

Kubelet starts:

Mar 18 02:32:14.517622 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j systemd[1]: Started Open Container Initiative Daemon.
Mar 18 02:32:14.520050 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j systemd[1]: Starting Kubernetes Kubelet...


Kubelet starts first syncloop which includes pod:

Mar 18 02:32:20.884663 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:20.884589    1577 kubelet.go:1920] SyncLoop (ADD, "api"): "gcp-pd-csi-driver-node-hwc7g_openshift-cluster-csi-drivers(f2d47217-fc13-4e4a-98ab-11306dc8104c), ingress-canary-sj4ww_openshift-ingress-canary(73476a21-c936-4c75-8f9d-528be9c7fb00), sdn-cvw7z_openshift-sdn(7c3dd7e3-fbaa-49d7-af39-872fbe18f144), network-check-target-c98qg_openshift-network-diagnostics(f8e47426-d587-445f-b926-5abf32b056d3), loki-promtail-52qsd_loki(eab32d4a-5cfc-426e-a061-1c4e481cabac), tuned-fcvt8_openshift-cluster-node-tuning-operator(9a3e0a8b-c414-4d5f-8eb1-f00d01cf31e6), machine-config-daemon-qm8s2_openshift-machine-config-operator(b8106134-e717-453b-8c82-c3978eace98f), node-ca-48b7s_openshift-image-registry(d7ee8680-ddcf-45ba-947a-521046aaf37f), ds1-9nqw2_e2e-k8s-sig-apps-daemonset-upgrade-4178(fdd1d375-1263-4e53-98a1-68cffb52df50), dns-default-jpf6d_openshift-dns(9a88ca33-9c0c-48b2-a7da-cfbcd6ff6b17), network-metrics-daemon-nl8dx_openshift-multus(967811dc-dc63-4775-b6af-8e9e64bd6f67), ovs-x92f6_openshift-sdn(6874f9f2-77cd-4c8b-8f62-a08a8d105843), node-exporter-r6rx7_openshift-monitoring(b1cedf25-d1b4-4bf7-bfeb-c4a7aee21549), multus-mc8nx_openshift-multus(688aec6c-9227-4328-8740-1411781a4962)"

Mounts happen:

Mar 18 02:32:21.037996 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:21.037927    1577 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "journal" (UniqueName: "kubernetes.io/host-path/eab32d4a-5cfc-426e-a061-1c4e481cabac-journal") pod "loki-promtail-52qsd" (UID: "eab32d4a-5cfc-426e-a061-1c4e481cabac")
[...]
Mar 18 02:32:21.174702 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:21.174677    1577 operation_generator.go:672] MountVolume.SetUp succeeded for volume "grafanacom-secrets" (UniqueName: "kubernetes.io/secret/eab32d4a-5cfc-426e-a061-1c4e481cabac-grafanacom-secrets") pod "loki-promtail-52qsd" (UID: "eab32d4a-5cfc-426e-a061-1c4e481cabac")

Sandbox starts to run pod:

Mar 18 02:32:21.245272 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:21.244374    1577 kuberuntime_manager.go:443] No sandbox for pod "loki-promtail-52qsd_loki(eab32d4a-5cfc-426e-a061-1c4e481cabac)" can be found. Need to start a new one
Mar 18 02:32:21.245374 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:21.245143212Z" level=info msg="Running pod sandbox: loki/loki-promtail-52qsd/POD" id=506d4a85-9cfa-44cf-8f1c-ed8cf7c48341 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Mar 18 02:32:22.074787 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:22.074650506Z" level=info msg="Got pod network &{Name:loki-promtail-52qsd Namespace:loki ID:fcd637ce82b96f54e5e4defa3de89aab60cbf1674f541ee225914adcb1a54035 NetNS:/var/run/netns/1773eca7-2dd3-483d-a921-ffbdd3fc48ff Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}"

Pod update gets sent to KAS:

Mar 18 02:32:26.527944 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:26.527617    1577 kubelet.go:1927] SyncLoop (UPDATE, "api"): "loki-promtail-52qsd_loki(eab32d4a-5cfc-426e-a061-1c4e481cabac)"

Pod actually finishes starting:

Mar 18 02:32:26.528304 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: 2021-03-18T02:32:26Z [verbose] Add: loki:loki-promtail-52qsd:eab32d4a-5cfc-426e-a061-1c4e481cabac:(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/1773eca7-2dd3-483d-a921-ffbdd3fc48ff"}],"ips":[{"version":"4","interface":0,"address":"10.128.2.2/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.128.2.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}
Mar 18 02:32:26.528304 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: I0318 02:32:26.504287    1762 event.go:278] Event(v1.ObjectReference{Kind:"Pod", Namespace:"loki", Name:"loki-promtail-52qsd", UID:"eab32d4a-5cfc-426e-a061-1c4e481cabac", APIVersion:"v1", ResourceVersion:"54306", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.128.2.2/23]
Mar 18 02:32:26.529406 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:26.529344564Z" level=info msg="Got pod network &{Name:loki-promtail-52qsd Namespace:loki ID:fcd637ce82b96f54e5e4defa3de89aab60cbf1674f541ee225914adcb1a54035 NetNS:/var/run/netns/1773eca7-2dd3-483d-a921-ffbdd3fc48ff Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}]}"
Mar 18 02:32:27.047823 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:27.045477614Z" level=info msg="Ran pod sandbox fcd637ce82b96f54e5e4defa3de89aab60cbf1674f541ee225914adcb1a54035 with infra container: loki/loki-promtail-52qsd/POD" id=506d4a85-9cfa-44cf-8f1c-ed8cf7c48341 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Mar 18 02:32:27.083270 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:27.081517190Z" level=info msg="Creating container: loki/loki-promtail-52qsd/promtail" id=baa2b510-8c2a-485d-9b22-01a44bbeab10 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Mar 18 02:32:27.501073 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:27.498528304Z" level=info msg="Created container 573d8a018f6d27149d4e92b59f3807382dba18913a055c728917cec8d819ba92: loki/loki-promtail-52qsd/promtail" id=baa2b510-8c2a-485d-9b22-01a44bbeab10 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Mar 18 02:32:27.680309 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j crio[1540]: time="2021-03-18 02:32:27.679444241Z" level=info msg="Started container 573d8a018f6d27149d4e92b59f3807382dba18913a055c728917cec8d819ba92: loki/loki-promtail-52qsd/promtail" id=69a80189-a07a-4bc4-8a04-53e6e8c763ab name=/runtime.v1alpha2.RuntimeService/StartContainer
Mar 18 02:32:27.926651 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:27.925660    1577 kubelet.go:1958] SyncLoop (PLEG): "loki-promtail-52qsd_loki(eab32d4a-5cfc-426e-a061-1c4e481cabac)", event: &pleg.PodLifecycleEvent{ID:"eab32d4a-5cfc-426e-a061-1c4e481cabac", Type:"ContainerStarted", Data:"573d8a018f6d27149d4e92b59f3807382dba18913a055c728917cec8d819ba92"}
Mar 18 02:32:27.926651 ci-op-bj11sxqh-db044-zrw8w-worker-b-wlp4j hyperkube[1577]: I0318 02:32:27.925733    1577 kubelet.go:1958] SyncLoop (PLEG): "loki-promtail-52qsd_loki(eab32d4a-5cfc-426e-a061-1c4e481cabac)", event: &pleg.PodLifecycleEvent{ID:"eab32d4a-5cfc-426e-a061-1c4e481cabac", Type:"ContainerStarted", Data:"fcd637ce82b96f54e5e4defa3de89aab60cbf1674f541ee225914adcb1a54035"}



I'd have expected the old container statuses to persist between old and new phases; even if they are wiped locally from the CRI, when we started the pod sync loop they should have been present on the pod spec from the API server. So next step, try to figure out why they are missing?

Relevant code is all here: https://github.com/openshift/kubernetes/blob/e1bc274ae64069e21201a87e4505e68a433c6a9f/pkg/kubelet/kubelet_pods.go#L1521-L1522

Comment 10 Elana Hashman 2021-03-19 19:46:19 UTC

Updating title to reflect that all pods are affected, not just static ones.

We believe this issue was introduced by https://github.com/kubernetes/kubernetes/pull/94087 since we mainly started noticing it late in 4.7/on 4.8. It's also affecting 4.6, but not 4.5 (only on upgrade runs for 4.5): https://search.ci.openshift.org/?search=pods+should+never+transition+back+to+pending&maxAge=336h&context=1&type=junit&name=4.5&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

That change landed mid-cycle for 4.7 and was cherry-picked back to 4.6 to fix a correctness bug on reboots: https://bugzilla.redhat.com/show_bug.cgi?id=1868645


Our hypothesis is that we're hitting some sort of race condition at node startup time. Since crio wipes the previous container statuses, we usually get them from the API Server before generating a new status, and then the statuses transition as expected. But if we don't get the statuses, the kubelet doesn't know that the pod was previously running, which causes it to be marked as Pending, and we see the illegal transition. This seems to an accounting/API sync issue.

Since the issue seems to stem from k/k#94087 it's possible https://github.com/kubernetes/kubernetes/pull/99336 which refactors/speeds up the initial node sync may help with this.

I am going to run a cluster with v=5 to dump pod statuses and confirm if this is happening: https://github.com/openshift/kubernetes/blob/e1bc274ae64069e21201a87e4505e68a433c6a9f/pkg/kubelet/status/status_manager.go#L162-L176

Comment 11 Elana Hashman 2021-03-19 23:19:17 UTC

I've spent all afternoon staring at a v=5 journal on a node I restarted with no real results. I'm not actually sure if I managed to repro but I did get a better sense of the control flow of the status logic.

One thing that did surprise me is that I noticed was that we only get the "kubelet nodes not sync" message on the initial kubelet run. It wasn't present even once after reboot. I'm not sure if this means anything.

Comment 12 Elana Hashman 2021-03-26 23:06:40 UTC

No progress since my last update. Once patch above merges upstream, I plan to backport in order to fix https://bugzilla.redhat.com/show_bug.cgi?id=1927263

Right now the merge is pretty difficult, and should be easier with the 1.21-beta1 rebase. I will also need to build and run the patch locally against the OpenShift test suite as kubelet changes are ignored in CI.

Comment 13 Ryan Phillips 2021-03-31 15:06:35 UTC

*** Bug 1942746 has been marked as a duplicate of this bug. ***

Comment 14 Elana Hashman 2021-04-08 21:00:31 UTC

Transition issue remains with the patch above (https://github.com/kubernetes/kubernetes/pull/99336) applied: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/653/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1380209278765240320

So this will require a different fix.

Comment 15 Elana Hashman 2021-04-22 19:48:51 UTC

Dropping to severity High as this does not cause an unrecoverable outage/breakage. Leaving priority at Urgent. Hopefully will have a working patch by EOD. Testing at https://github.com/openshift/kubernetes/pull/705

Comment 16 Elana Hashman 2021-04-23 22:42:40 UTC

Made some great progress with the PR above.

Test failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/705/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1385651731089592320

ns/kube-system pod/apiserver-watcher-ci-op-kgnyjl15-3bd46-fk8jd-master-2 node/ci-op-kgnyjl15-3bd46-fk8jd-master-2 - invariant violation (bug): static pod should not transition Running->Pending with same UID


Kubelet journal link: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/705/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1385651731089592320/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-kgnyjl15-3bd46-fk8jd-master-2/journal

Debug logs from where this pod started, and was initialized as Pending:

Apr 23 20:30:37.802503 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:30:37.802483    1587 kubelet.go:462] "Kubelet nodes not sync"
Apr 23 20:30:37.802672 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:30:37.802615    1587 kubelet_pods.go:1481] "pod phases" running=0 succeeded=0 unknown=0 stopped=0 waiting=1 pendingInit=0
Apr 23 20:30:37.802672 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:30:37.802662    1587 kubelet_pods.go:1547] "Got phase for pod" pod="kube-system/apiserver-watcher-ci-op-kgnyjl15-3bd46-fk8jd-master-2" phase=Pending


We have waiting=1, everything else set to 0 for this static pod. 


For a non-static pod:

ns/openshift-multus pod/multus-krf87 node/ci-op-kgnyjl15-3bd46-fk8jd-master-2 - pod moved back to Pending

Apr 23 20:31:45.254437 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:31:45.254374    1587 kubelet_pods.go:1481] "pod phases" running=0 succeeded=0 unknown=0 stopped=0 waiting=1 pendingInit=1
Apr 23 20:31:45.254437 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:31:45.254395    1587 kubelet_pods.go:1547] "Got phase for pod" pod="openshift-multus/multus-krf87" phase=Pending

waiting=1, pendingInit=1


We did not anticipate the case where a pod was previously running (unbeknownst to the kubelet) and is now waiting/pendingInit post-reboot: https://github.com/openshift/kubernetes/blob/e1bc274ae64069e21201a87e4505e68a433c6a9f/pkg/kubelet/kubelet_pods.go#L1469-L1472

Comment 17 Elana Hashman 2021-04-23 23:37:58 UTC

With crio-wipe, getPhase doesn't have enough information available to determine that the pod was previously running. For any pods that begin execution before the node is marked as Ready (e.g. static pods or pods with special tolerations), it's possible we'd want to change this logic.

We could check it one level up, in generateAPIPodStatus: https://github.com/openshift/kubernetes/blob/e1bc274ae64069e21201a87e4505e68a433c6a9f/pkg/kubelet/kubelet_pods.go#L1520-L1530

However, this code disagrees with our current tests. Upstream says that only a transition back from terminated is considered illegal. Our tests say that Running -> Pending is illegal.

If this is where we've isolated the problem to, I'm inclined to say that the upstream code should be authoritative here, and that our test is wrong.


However, on Slack, Clayton suggested: https://coreos.slack.com/archives/CK1AE4ZCK/p1619109977129600?thread_ts=1619109911.127600&cid=CK1AE4ZCK

> reboot should not clear lastState in a graceful reboot shutdown state for a daemonset

I double-checked, and the daemonset pods which are affected by the illegal phase transition still don't seem to have a previous status:

Apr 23 20:31:27.077547 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:31:27.077530    1587 kubelet.go:462] "Kubelet nodes not sync"
Apr 23 20:31:27.494442 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: W0423 20:31:27.494398    1587 conversion.go:111] Could not get instant cpu stats: cumulative stats decrease
Apr 23 20:31:27.862101 ci-op-kgnyjl15-3bd46-fk8jd-master-2 hyperkube[1587]: I0423 20:31:27.862053    1587 kubelet.go:1951] "SyncLoop ADD" source="api" pods=[openshift-etcd/etcd-ci-op-kgnyjl15-3bd46-fk8jd-master-2 openshift-dns/dns-default-gczz7 openshift-kube-scheduler/openshift-kube-scheduler-ci-op-kgnyjl15-3bd46-fk8jd-master-2 openshift-network-diagnostics/network-check-target-q4ptp openshift-dns/node-resolver-54rxf openshift-cluster-node-tuning-operator/tuned-t9wj2 openshift-multus/multus-krf87 openshift-kube-controller-manager/revision-pruner-9-ci-op-kgnyjl15-3bd46-fk8jd-master-2 openshift-etcd/revision-pruner-4-ci-op-kgnyjl15-3bd46-fk8jd-master-2 openshift-kube-apiserver/kube-apiserver-ci-op-kgnyjl15-3bd46-fk8jd-master-2 kube-system/apiserver-watcher-ci-op-kgnyjl15-3bd46-fk8jd-master-2 openshift-cluster-csi-drivers/gcp-pd-csi-driver-node-pm7p8 openshift-kube-apiserver/revision-pruner-10-ci-op-kgnyjl15-3bd46-fk8jd-master-2 openshift-controller-manager/controller-manager-2qqr6 openshift-multus/multus-admission-controller-qr77z openshift-kube-controller-manager/kube-controller-manager-ci-op-kgnyjl15-3bd46-fk8jd-master-2 loki/loki-promtail-ps96h openshift-image-registry/node-ca-xlbqh e2e-k8s-sig-apps-daemonset-upgrade-7896/ds1-78f7p openshift-machine-config-operator/machine-config-server-657gg openshift-sdn/sdn-v7dpc openshift-sdn/ovs-p2pgr openshift-multus/network-metrics-daemon-znqf2 openshift-monitoring/node-exporter-c2nf5 openshift-machine-config-operator/machine-config-daemon-zkmcm openshift-sdn/sdn-controller-pccmk openshift-kube-scheduler/revision-pruner-10-ci-op-kgnyjl15-3bd46-fk8jd-master-2]


(Note timestamp is earlier than the 20:31:45.254437 for the multus pod example above.)

That seems like it could potentially be addressed by the shutdown manager or another bug fix. Maybe we're hitting a race condition for the daemonset pods?

This line makes me think we may be hitting one: https://github.com/openshift/kubernetes/blob/e1bc274ae64069e21201a87e4505e68a433c6a9f/pkg/kubelet/kubelet_pods.go#L1519

Comment 18 Clayton Coleman 2021-04-27 20:58:23 UTC

> With crio-wipe, getPhase doesn't have enough information available to determine that the pod was previously running. For any pods that begin execution before the node is marked as Ready (e.g. static pods or pods with special tolerations), it's possible we'd want to change this logic.

If crio-wipe is causing a problem (losing status in a way that impacts workloads) it's definitely something we need to fix - either by not crio-wiping (which is not part of Kube expected behavior), or by fixing the kubelet.

> However, this code disagrees with our current tests. Upstream says that only a transition back from terminated is considered illegal. Our tests say that Running -> Pending is illegal.

Since all the tests here are either myself, ryan, derek, or seth I can say for certain that upstream is us here :)

I think this is a gap.  As a user of kube (wearing my kube architect hat) I do not expect to see lastTerminationState wiped on a clean or unclean reboot.  It's worse for users (they see the transition) and for automation (you stop knowing that the pod ran previously) and it's unnecessary (in order to launch a daemonset pod the kubelet has to get the pod from the apiserver - so if the kubelet sees that previous status was foo, kubelet wiping it is wrong).

I *think* the correct behavior for non-static is that when we do an initial calculation of the effect phase the previous status of the pod (from the apiserver) is relevant and needs to be merged with whatever we have on the kubelet at the time (which in crio-wipe case is nothing).  Note that daemonset (as mentioned in slack) is not special, it just so happens in our controlled shutdown everything is drained.  In an uncontrolled restart with crio-wipe normal pods would be impacted and the bug is because of crio-wipe changing the user facing behavior.

For static pods... we don't need the apiserver's status, nor do we want to wait.  However, at some point we have to write the mirror pod back to the apiserver.  When we merge that status, I expected the lastTerminationState of the current mirror pod to reflect the previous state in the apiserver, if the mirror pod is effectively the same (we didn't change the container, although there's an argument as a human I *want* to see that we could explore since that would make small changes to static pods more understandabale).  As with all static pod behavior, we must be very careful not to take an accidental dependency on the apiserver status for the pod to start though, so we should be deliberate here.

Comment 19 Elana Hashman 2021-04-29 20:01:55 UTC

> For static pods... we don't need the apiserver's status, nor do we want to wait.  However, at some point we have to write the mirror pod back to the apiserver.  When we merge that status, I expected the lastTerminationState of the current mirror pod to reflect the previous state in the apiserver, if the mirror pod is effectively the same (we didn't change the container, although there's an argument as a human I *want* to see that we could explore since that would make small changes to static pods more understandabale).  As with all static pod behavior, we must be very careful not to take an accidental dependency on the apiserver status for the pod to start though, so we should be deliberate here.

I'm currently puzzled by why the static pod UIDs don't change between boots. I'd expect them to update. https://github.com/openshift/kubernetes/pull/718 had no effect on the test flaking.


> I *think* the correct behavior for non-static is that when we do an initial calculation of the effect phase the previous status of the pod (from the apiserver) is relevant and needs to be merged with whatever we have on the kubelet at the time (which in crio-wipe case is nothing).  Note that daemonset (as mentioned in slack) is not special, it just so happens in our controlled shutdown everything is drained.  In an uncontrolled restart with crio-wipe normal pods would be impacted and the bug is because of crio-wipe changing the user facing behavior.

Currently trying to get a better idea of what's racing here. I feel like, since we've waited to sync with the API server to admit the daemonset pods, we *should* have the status, so maybe we're hitting a race. Testing out https://github.com/openshift/kubernetes/pull/705 to try to get more data on what we're seeing on that code path.

Comment 20 Elana Hashman 2021-04-30 21:39:46 UTC

Okay, I think I've narrowed things down!

With this debugging statement, I was able to get a lot more logs: https://github.com/openshift/kubernetes/pull/705/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1521

Let's trace one of the known reset containers, loki/loki-promtail-fh4dg: "ns/loki pod/loki-promtail-fh4dg node/ci-op-1iq03rrb-3bd46-q55l9-master-1 - pod moved back to Pending" on https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/705/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1387858384518123520

Pod gets added back after reboot:

Apr 29 22:38:45.061921 ci-op-1iq03rrb-3bd46-q55l9-master-1 hyperkube[1617]: I0429 22:38:45.061885    1617 kubelet.go:1944] "SyncLoop ADD" source="api" pods=[openshift-multus/network-metrics-daemon-6gknh ... loki/loki-promtail-fh4dg ... openshift-machine-config-operator/machine-config-daemon-qrrzg]

We generate a pod status:

Apr 29 22:38:45.066206 ci-op-1iq03rrb-3bd46-q55l9-master-1 hyperkube[1617]: I0429 22:38:45.066169    1617 kubelet_pods.go:1521] "Generating pod status" pod="loki/loki-promtail-fh4dg"
podStatus=&{ID:ccbb9c70-fe9f-4a2f-9a21-72d46b96af55 Name: Namespace: IPs:[] ContainerStatuses:[] SandboxStatuses:[]}
containerStatuses=[
{Name:oauth-proxy
 State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-04-29 21:10:46 +0000 UTC,} Terminated:nil}
 LastTerminationState:{Waiting:nil Running:nil Terminated:nil}
 Ready:true
 RestartCount:0
 Image:quay.io/openshift/origin-oauth-proxy:4.7
 ImageID:quay.io/openshift/origin-oauth-proxy@sha256:82719931600d90a8c3eb26b1632cc99020bc12921fa6ad13c9d4929fe2d0f025
 ContainerID:cri-o://c4fb73f9792b91a663120b8cfeaea17d09582b0d41f1717c37e8a3a3615673d9
 Started:0xc001a99c3d}
{Name:promtail
 State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-04-29 21:10:37 +0000 UTC,} Terminated:nil}
 LastTerminationState:{Waiting:nil Running:nil Terminated:nil}
 Ready:true
 RestartCount:0
 Image:quay.io/vrutkovs/promtail:2.2.1
 ImageID:quay.io/vrutkovs/promtail@sha256:abcec23ad2c6915fbe8f636ab7ec855301ad1748f027fafb6de7fa1369024983
 ContainerID:cri-o://28aa88267b5aa31e1282ba2e7d470142931d5a872d9c85061fdd6e351b2a512a
 Started:0xc001a99c3e}]

We see that we clearly have two running containers. But then, somehow, we calculate this as two waiting containers:

Apr 29 22:38:45.066329 ci-op-1iq03rrb-3bd46-q55l9-master-1 hyperkube[1617]: I0429 22:38:45.066243    1617 kubelet_pods.go:1481] "Current pod phases" running=0 succeeded=0 unknown=0 stopped=0 waiting=2 pendingInit=0

Thus, pod gets set to Pending:

Apr 29 22:38:45.066329 ci-op-1iq03rrb-3bd46-q55l9-master-1 hyperkube[1617]: I0429 22:38:45.066268    1617 kubelet_pods.go:1539] "Got phase for pod" pod="loki/loki-promtail-fh4dg" phase=Pending


Looks like the pod status is racing underneath between the two checks; the only possible way to get containers set as Waiting is for them to have a non-nil Status.Waiting. I'll see if I can figure out where it's changing. I'm suspicious that something's happening in https://github.com/kubernetes/kubernetes/blob/ef9d6f8c934ccb89d763e2ff3d459205acbbcda0/pkg/kubelet/kubelet_pods.go#L1578-L1581


I also found that the podCache is only intended to store local container runtime state: https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/container/cache.go#L130-L135

This means that we'll need to rely on the status manager for any remote state. It looks clear from the above that we are able to retrieve the API Server statuses, but something is happening to them and they're getting dropped.

Comment 21 Elana Hashman 2021-05-03 23:04:38 UTC

Confirmed from extra debug logs:

For the static pods transitioning back to pending, they did not have a chance to sync any statuses from the API Server, so they are unaware that they were previously running, i.e. no previous statuses to work with.


For a daemonset pod, confirmed the statuses are racing:

Log from https://github.com/openshift/kubernetes/pull/705/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1522 

containerStatuses=[{Name:oauth-proxy State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-04-30 22:35:37 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/openshift/origin-oauth-proxy:4.7 ImageID:quay.io/openshift/origin-oauth-proxy@sha256:82719931600d90a8c3eb26b1632cc99020bc12921fa6ad13c9d4929fe2d0f025 ContainerID:cri-o://954399340a9d9dc1b6285e606a3c32e2957ebb1a9a2952adca2bc060fa1fb8b2 Started:0xc000e52a05} {Name:promtail State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-04-30 22:35:11 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:quay.io/vrutkovs/promtail:2.2.1 ImageID:quay.io/vrutkovs/promtail@sha256:abcec23ad2c6915fbe8f636ab7ec855301ad1748f027fafb6de7fa1369024983 ContainerID:cri-o://2b42d4958888e779664e166c4fc33a5cf6a00a0fcb541171f52e7bdb64dc8ec4 Started:0xc000e52a06}]

Log a few lines later: https://github.com/openshift/kubernetes/pull/705/files#diff-e81aa7518bebe9f4412cb375a9008b3481b19ec3e851d3187b3021ee94148f0dR1417

containerStatuses=[{Name:oauth-proxy State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/openshift/origin-oauth-proxy:4.7 ImageID: ContainerID: Started:<nil>} {Name:promtail State:{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,} Running:nil Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:false RestartCount:0 Image:quay.io/vrutkovs/promtail:2.2.1 ImageID: ContainerID: Started:<nil>}]

So the statuses are changing underneath us and wiping the previous running status from the API server. Will look into convertStatusToAPIStatus next.

Comment 22 Elana Hashman 2021-05-07 21:10:31 UTC

Test patch up on https://github.com/openshift/kubernetes/pull/705

I think I found the missing fallthrough case that's causing the problem.

If that works, I'll update the unit tests and propose upstream.

Comment 24 Elana Hashman 2021-05-15 01:29:16 UTC

I've been reading through the implementation history of this code because it's been nearly unchanged for the past 5+ years and stumbled upon https://github.com/kubernetes/kubernetes/issues/21125

git blame says this code has really only changed twice since that issue was filed, in https://github.com/kubernetes/kubernetes/pull/54593 and https://github.com/kubernetes/kubernetes/pull/54597


With my latest debug code, I am seeing all the previously running container statuses from the API get reset to Waiting state (https://github.com/openshift/kubernetes/blame/69980073e96fdae619512e9898a4e159a92119e0/pkg/kubelet/kubelet_pods.go#L1729 but I would expect them to keep the old statuses given L1742-3, so I think there is a subtle bug. Pushed more debug logs to dig into on Monday...

(latest logs https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_kubernetes/705/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1392566207206395904/artifacts/e2e-gcp-upgrade/gather-extra/artifacts/nodes/ci-op-dtjhxnz6-3bd46-bnsh2-master-0/journal )

Comment 25 Elana Hashman 2021-05-20 21:12:49 UTC

Okay, finally got to the bottom of this. The issue is this code:

https://github.com/openshift/kubernetes/blob/88a3e8c4e43f08c8a4ed22bb6cb61cf0eb273bb2/pkg/kubelet/kubelet_pods.go#L1742-L1747

We are throwing away the old statuses because the previous containers' termination state never got set, at least for the daemonset pods I'm looking at. For example:

May 15 01:31:49.432125 ci-op-ltqkm11x-3bd46-5t5nk-master-1 hyperkube[1633]: I0515 01:31:49.426386    1633 kubelet_pods.go:1732] "old statuses map" pod="openshift-sdn/sdn-pnvwk" oldStatuses=map[kube-rbac-proxy:{Name:kube-rbac-proxy State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-05-15 01:11:02 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.build02.ci.openshift.org/ci-op-ltqkm11x/stable@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65 ImageID:registry.build02.ci.openshift.org/ci-op-ltqkm11x/stable-initial@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65 ContainerID:cri-o://eb39a4e345d90064785d993139627e57f4bd935c0a2184865c553e1e0aafa183 Started:0xc00175cd55} sdn:{Name:sdn State:{Waiting:nil Running:&ContainerStateRunning{StartedAt:2021-05-15 01:11:02 +0000 UTC,} Terminated:nil} LastTerminationState:{Waiting:nil Running:nil Terminated:nil} Ready:true RestartCount:0 Image:registry.build02.ci.openshift.org/ci-op-ltqkm11x/stable@sha256:bb6eadf0b21fab80c9be60d5ac941bc781719874929574965e860e98b36793c7 ImageID:registry.build02.ci.openshift.org/ci-op-ltqkm11x/stable-initial@sha256:bb6eadf0b21fab80c9be60d5ac941bc781719874929574965e860e98b36793c7 ContainerID:cri-o://8a82055c12314f4a75c6e0ce95a49a70d0f5cf739ca273e12ccd8907e25da719 Started:0xc00175cd56}]
May 15 01:31:49.432125 ci-op-ltqkm11x-3bd46-5t5nk-master-1 hyperkube[1633]: I0515 01:31:49.426573    1633 kubelet_pods.go:1742] "setting old container status" pod="openshift-sdn/sdn-pnvwk" container={Name:sdn ...}
May 15 01:31:49.446793 ci-op-ltqkm11x-3bd46-5t5nk-master-1 hyperkube[1633]: I0515 01:31:49.426653    1633 kubelet_pods.go:1742] "setting old container status" pod="openshift-sdn/sdn-pnvwk" container={Name:kube-rbac-proxy ...}
May 15 01:31:49.454549 ci-op-ltqkm11x-3bd46-5t5nk-master-1 hyperkube[1633]: I0515 01:31:49.426724    1633 kubelet_pods.go:1763] "reset statuses are" pod="openshift-sdn/sdn-pnvwk" statuses=map[kube-rbac-proxy:&ContainerStatus{Name:kube-rbac-proxy,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:registry.build02.ci.openshift.org/ci-op-ltqkm11x/stable@sha256:3869910c1e208b125bdecd4ac2d8b2cae42efe221c704491b86aa9b18ce95a65,ImageID:,ContainerID:,Started:nil,} sdn:&ContainerStatus{Name:sdn,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:registry.build02.ci.openshift.org/ci-op-ltqkm11x/stable@sha256:bb6eadf0b21fab80c9be60d5ac941bc781719874929574965e860e98b36793c7,ImageID:,ContainerID:,Started:nil,}]

Logs from https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_kubernetes/705/pull-ci-openshift-kubernetes-master-e2e-gcp-upgrade/1393347450973982720

Comment 26 Elana Hashman 2021-05-22 00:40:37 UTC

For the static pods, I'm only seeing 3/5 have the phase transition issue. For the two that don't have it (apiserver/etcd), I see the following errors:

May 20 23:30:34.789963 ci-op-y66rclst-3bd46-t7dw2-master-0 hyperkube[1655]: I0520 23:30:34.785768    1655 status_manager.go:576] "Pod was deleted and then recreated, skipping status update" pod="openshift-kube-apiserver/kube-apiserver-ci-op-y66rclst-3bd46-t7dw2-master-0" oldPodUID=ff20e830-1eae-48a6-aa10-9bf955964f0d podUID=c800bf0b-a552-466f-983f-404dda77196a
May 20 23:30:34.912742 ci-op-y66rclst-3bd46-t7dw2-master-0 hyperkube[1655]: I0520 23:30:34.912685    1655 status_manager.go:576] "Pod was deleted and then recreated, skipping status update" pod="openshift-etcd/etcd-ci-op-y66rclst-3bd46-t7dw2-master-0" oldPodUID=a5835318-f92e-4f1d-9a79-38091ed9afb4 podUID=af9a1212-a3bf-490e-a804-dc7e13dccdf3

I wonder why this did not apply to the other static pods?


For the non-static pods that regressed:

ns/openshift-multus pod/multus-8nkkl node/ci-op-y66rclst-3bd46-t7dw2-master-0 - pod moved back to Pending

This one appears to have had at least one container terminate (init containers?) while another's status came back as running, and the phase collapses down to:

May 20 23:30:35.867883 ci-op-y66rclst-3bd46-t7dw2-master-0 hyperkube[1655]: I0520 23:30:35.849042    1655 kubelet_pods.go:1482] "Current pod phases" running=1 succeeded=0 unknown=0 stopped=0 waiting=0 pendingInit=6
May 20 23:30:35.867883 ci-op-y66rclst-3bd46-t7dw2-master-0 hyperkube[1655]: I0520 23:30:35.849058    1655 kubelet_pods.go:1540] "Got phase for pod" pod="openshift-multus/multus-8nkkl" phase=Pending

Since there are containers pendingInit, the "running" bit gets totally ignored.

Comment 27 Elana Hashman 2021-05-24 22:46:49 UTC

https://github.com/openshift/library-go/commit/235599ab1db5b1d08cdcac3b151c330f10ba13c4 appears to be why static pods retain the same UID after reboots. https://github.com/kubernetes/kubernetes/pull/87461 might make it no longer necessary.

For daemonsets, the patch I attempted for retaining old statuses for daemonsets worked for ~70% of the pods, but didn't work for the ones with the init containers. Between some of the refactoring work Clayton is working on (where we've found much of the kubelet just ignores the existence of init containers) and the graceful shutdown work, I am hoping this will mostly be taken care of. I will work to upstream my patch and add some unit tests for that code.

Comment 28 Elana Hashman 2021-06-09 22:37:41 UTC

Continuing to dig.

My previous patch helped, but started causing kubelet.go unit tests to fail, so I needed to rethink the approach. In the meantime, I added some initContainer getPhase coverage.

I believe the issue is that this section of code[0] *should* be triggering to set an unknown previous termination status, but for some reason is not. It also has a special check on initContainers which seems incomplete as we don't actually iterate through initContainerStatuses anywhere; this could explain why we're having issues with daemonsets with initContainers. Going to try to figure out what is causing this edge case with more debugging, and add unit test scenarios to catch the case once I figure it out.

[0]: https://github.com/openshift/kubernetes/blob/c5e3b15168178cd15aef264fae773cb8fc556c7c/pkg/kubelet/kubelet_pods.go#L1768-L1773

Comment 29 Elana Hashman 2021-06-12 00:32:15 UTC

Assuming my latest test patch worked, WIP PR up at https://github.com/kubernetes/kubernetes/pull/102821

Comment 30 Elana Hashman 2021-06-14 23:13:26 UTC

Update: CI showed a large (>100 -> 22) reduction in pending pods with this patch.

Running a full suite of OpenShift CI once more with https://github.com/openshift/kubernetes/pull/808

Upstream PR as above, added unit tests and marked as ready for review. Will backport when ready, or we can consider taking this in a 4.9 rebase.

Comment 31 Elana Hashman 2021-07-14 22:29:29 UTC

I believe at this point we're waiting for this to land in a 4.9 rebase. The changes have merged. From there I believe there may still be some remaining phase failures that we will need to triage and decide if they are still in the scope of this BZ.

Comment 34 Elana Hashman 2021-09-29 22:47:55 UTC

*** Bug 2009090 has been marked as a duplicate of this bug. ***

Comment 35 Devan Goodwin 2021-10-07 18:29:22 UTC

Any updates on this? Still quite a few hits for this out there https://search.ci.openshift.org/?search=illegally+transitioned+to+Pending&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Should this be in POST?

Comment 36 Elana Hashman 2021-11-01 20:27:16 UTC

Still shows up as a flake in CI but it's debatable as to whether the pods that are still affected are actually hitting a bug. The remaining pods running into this are static pods with identical UIDs and some occasional edge cases. I'm not actually sure if the test invariant is what we want, especially after the pod lifecycle refactor that landed in 1.23.

Comment 37 David Eads 2021-11-02 15:08:03 UTC

Pods have a lifecycle which goes from running -> terminated -> pending.  Going directly from running -> pending means that either
 1. the previous instance is still running
 2. at some point the pod was terminated

We build operators and other controllers on top of this lifecycle.   The invariant (which was part of the lifecycle since 1.0 so far as I know), is long-standing and it's not clear what the motivation for changing such an invariant would be.  While a refactor may have broken this invariant and we caught it late, the change still appears to break a valuable aspect of the pod lifecycle that layers abopve have been able to build on top of for some time.

The fact that victims are static pods reduces the scope of the problem, but increases the importance of getting these aspects correct, since those pods are likely to be critical, likely to conflict with multiple instance (it has corrupted on disk data in previous releases), and definitely have external controllers built around them.

Comment 38 Elana Hashman 2021-11-02 20:51:33 UTC

Per above comments on the bug, I've identified two types of pods that are still affected by the issue:

1) Static pods with fixed UIDs
2) Daemonset pods with init containers (so even if previous container statuses are located, since the init container needs to run again, the pod will go from Running -> Pending)


I believe these are the only two kinds of pods still affected after the attached patch landed upstream and got rebased into OpenShift, although it would be good for me to double check and confirm.

In both cases, it seems valid to me that the pods go from Running to Pending after a node reboot. In 2), I wouldn't expect a restarted pod with no running containers that needed to run its init containers again to be described as Running, but if that's not the consensus that can be patched easily enough. I believe we wanted to see if that case would be addressed by graceful node shutdowns.

Comment 39 Clayton Coleman 2021-11-08 14:34:57 UTC

For 1, the correct behavior of a static pod on reboot + crio-wipe is clean slate pending because by definition static pod has no state.  

For 2:

1. We added crio-wipe (possibly without completely thinking through the implications of clearing the pod state for init containers)
2. crio-wipe is resetting init containers for pods that remain on the node
3. which causes kubelet to be unable to confirm the init containers have already run to completion, which it MUST do in that state (correct behavior)
4. kubelet recognizes that a pod waiting for initialization is by definition pending (because of crio-wipe resetting its state)
5. user observes running -> pending -> running over a controlled reboot + crio-wipe

So the core issue identified here originally is no longer at play - kubelet is doing the correct thing in scenario 2.

What is not clear is whether crio-wipe in ALL cases should be deleting init containers. Do we have a design for the introduction of crio-wipe that covers implications of the wipe and justifications for and against?  We need to establish whether one of these things is true:

1. Wiping init containers is good, because in controlled drain + reboot scenarios we want to return the node to a clean state, and undrained daemonset init containers may have captured previous state incorrectly
2. Wiping init containers is bad, because init containers are intended to be part of the contract between file system initialization and the pod and generally have to be "run-once" and it makes no sense to make it wipe init containers more easily
3. Leaving daemonset pods on nodes during drain+reboot is wrong - because we want to completely reset the state of a node, and not deleting those pods means that users can't see that the node changed

We should make sure we work through what the right outcome is (or accumulate others) and move that to a new bug based on the decision.

Comment 40 Elana Hashman 2021-11-12 20:05:36 UTC

David, Clayton, Ryan and I met today to discuss the above scenario. We agreed that the following 5 scenarios should be supported:

1. Any deleted pod will never go Running -> Pending.
2. A non-daemonset API pod will never go Running -> Pending.
3. Static pods (with the same UID) are allowed to go Running -> Pending if the node rebooted.
4. If a daemonset pod doesn't have init containers, it will never go Running -> Pending.
5. Daemonset pods with init containers are allowed to go Running -> Pending if the node rebooted.

David will update the invariant tests in openshift/origin to reflect this consensus.

Clayton will update the documentation on init containers and behaviour on reboots to reflect this.

I will add an e2e test to upstream Kubernetes to validate this behaviour on a reboot with crio-wipe (or the containerd equivalent).


Once all this is completed, this BZ can be closed.

Comment 42 Elana Hashman 2022-01-07 22:34:47 UTC

I filed an issue to track the e2e test in upstream: https://github.com/kubernetes/kubernetes/issues/106390

I will sync back with David and Clayton about their actions, I am not sure if this is completed yet.