Bug 2009090

Summary: Increased CI reports of "illegally transitioned to Pending"
Product: OpenShift Container Platform Reporter: Devan Goodwin <dgoodwin>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: medium    
Priority: medium CC: aos-bugs, ehashman, wking
Version: 4.9   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-09-29 22:47:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Devan Goodwin 2021-09-29 22:31:30 UTC
In working to improve the CI signal, technical release team has come across a frequent suspicious report of pods illegally transitioning to pending.

See search results: 

https://search.ci.openshift.org/?search=illegally+transitioned+to+Pending&maxAge=48h&context=1&type=bug%2Bjunit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Problem has been seen in the past (https://bugzilla.redhat.com/show_bug.cgi?id=1933760, which links to https://github.com/kubernetes/kubernetes/pull/102821), but the supposed fix should be present in both 4.9 and 4.10.

Comment 2 Elana Hashman 2021-09-29 22:34:24 UTC
Present in 4.9, see e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1997478#c25

> Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.504009    1388 kubelet_pods.go:1484] "Pod attempted illegal phase transition" pod="openshift-network-diagnostics/network-check-arget-xqqn8" originalStatusPhase=Failed apiStatusPhase=Pending apiStatus="&PodStatus{Phase:Pending,Conditions:[]PodCondition{},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:network-check-target-container,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da24fe6e3f9bf5bf398ea11d384c9ef4d5e964e827408f5e7eedffa3e67e2b26,ImageID:,ContainerID:,Started:nil,},},QOSClass:Burstable,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"

Comment 3 Elana Hashman 2021-09-29 22:37:24 UTC
Note that https://bugzilla.redhat.com/show_bug.cgi?id=1933760 is an unrelated bug, in which some pods accidentally get reset to Pending from Running. Those error messages stem from event processing as seen by the API server in the OpenShift tests.

In this case, we are seeing the kubelet directly log "Pod attempted illegal phase transition", which is invoked here: https://github.com/openshift/kubernetes/blob/f181eb2582e1649676395f25710c14b427b3369c/pkg/kubelet/kubelet_pods.go#L1480-L1484

This log message is only possible when a terminal pod (failed or succeeded) attempts a phase transition.

Comment 4 Elana Hashman 2021-09-29 22:47:55 UTC
Ah, no, disregard comment above as crossed wires. This does not appear to be any higher impact on 4.10 than it was on 4.9 or 4.8. That bug still has not been fixed. The test is marked as flaky but as this shouldn't actually have any impact on the pods themselves (only the test accounting). Marking as dupe of 1933760.

We had one patch go in which significantly improved the symptoms we were seeing in 1933760 but it's not fully fixed. Static pods may not be fixable and it's unclear if it's a genuine bug.

I will file a separate bug for the phenomenon in https://bugzilla.redhat.com/show_bug.cgi?id=2009090#c3

*** This bug has been marked as a duplicate of bug 1933760 ***