Bug 2029470 - Recover from suddenly appearing old operand revision WAS: kube-scheduler-operator test failure: Node's not achieving new revision
Summary: Recover from suddenly appearing old operand revision WAS: kube-scheduler-oper...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-scheduler
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.0
Assignee: Jan Chaloupka
QA Contact: RamaKasturi
URL:
Whiteboard:
: 2042579 (view as bug list)
Depends On: 2028595
Blocks: 2029468
TreeView+ depends on / blocked
 
Reported: 2021-12-06 15:21 UTC by Stefan Schimanski
Modified: 2022-08-10 10:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 2028595
Environment:
Last Closed: 2022-08-10 10:40:31 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1300 0 None Merged bug 2029470: update library-go to get rapid installer pod fixes 2022-02-08 16:34:05 UTC
Github openshift cluster-kube-controller-manager-operator pull 597 0 None Merged bug 2029470: update library-go to get rapid installer pod fixes 2022-02-03 09:05:54 UTC
Github openshift cluster-kube-scheduler-operator pull 407 0 None Merged bug 2029470: update library-go to get rapid installer pod fixes 2022-02-03 09:05:56 UTC
Github openshift library-go pull 1300 0 None Merged wait for other installers to finish 2022-02-01 10:46:25 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:40:47 UTC

Description Stefan Schimanski 2021-12-06 15:21:59 UTC
+++ This bug was initially created as a clone of Bug #2028595 +++

Spotted in CI run: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1466323246101041152

Test failure:

: operator conditions kube-scheduler expand_less 	0s
Operator progressing (NodeInstaller): NodeInstallerProgressing: 1 nodes are at revision 0; 1 nodes are at revision 4; 1 nodes are at revision 6; 0 nodes have achieved new revision 7

Digging in we find that the kube-scheduler-operator is reporting:

2021-12-02T08:50:48.957393046Z I1202 08:50:48.957356       1 installer_controller.go:512] "ip-10-0-129-204.us-east-2.compute.internal" is in transition to 7, but has not made progress because waiting for static pod of revision 7, found 6

This repeats for 20 minutes until:

2021-12-02T09:10:11.339203742Z I1202 09:10:11.339165       1 installer_controller.go:512] "ip-10-0-129-204.us-east-2.compute.internal" is in transition to 7, but has not made progress because waiting for static pod of revision 7, found 6

At which point the test has failed.

In the relevant must-gather: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.10-e2e-aws-serial/1466323246101041152/artifacts/e2e-aws-serial/gather-must-gather/artifacts/must-gather.tar

The installer pod claims to have written the revision 7 static pod manifest at:

2021-12-02T08:50:26.659271543Z I1202 08:50:26.659252       1 cmd.go:489] Writing static pod manifest "/etc/kubernetes/manifests/kube-scheduler-pod.yaml" ...

But in the must-gather we see that the running pod is still revision 6 well over 20 minutes later:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    kubectl.kubernetes.io/default-logs-container: kube-scheduler
    kubernetes.io/config.hash: cce59ead72804895d31dba4208b395aa
    kubernetes.io/config.mirror: cce59ead72804895d31dba4208b395aa
    kubernetes.io/config.seen: "2021-12-02T08:50:47.930740314Z"
    kubernetes.io/config.source: file
    target.workload.openshift.io/management: '{"effect": "PreferredDuringScheduling"}'
  creationTimestamp: "2021-12-02T08:50:47Z"
  labels:
    app: openshift-kube-scheduler
    revision: "6"
    scheduler: "true"
...


In the kubelet logs for the relevant master we see:

 Dec 02 08:50:47.935456 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:47.935427    1371 status_manager.go:611] "Pod was deleted and then recreated, skipping status update" pod="openshift-kube-scheduler/openshift-kube-scheduler-ip-10-0-129-204.us-east-2.compute.internal" oldPodUID=f11e314508a00e4ea9bf37d76b82b162 podUID=cce59ead72804895d31dba4208b395aa

It appears the kubelet is not respecting the new static pod definition.



Looking at CI as a whole and assuming this test failure will typically be the same failure, it appears this is involved in roughly 3.5% of all 4.10 failed jobs: https://search.ci.openshift.org/?search=nodes+have+achieved+new+revision&maxAge=48h&context=1&type=junit&name=4.10&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

--- Additional comment from Jan Chaloupka on 2021-12-02 19:41:55 CET ---

Pulling the must-gather and checking kubelet_service.log/crio_service.log from registry-ci-openshift-org-ocp-4-10-2021-12-02-081913-sha256-e78391c0bf790fea89fc8267f6e4a4b0497430642a2c329f2268d8eae23f465c/host_service_logs/masters.

kubelet_service.log for installer-6-ip-10-0-129-204.us-east-2.compute.internal:
```
Dec 02 08:50:14.597970 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:14.597891    1371 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:14.915700 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:14.915568    1371 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal"
Dec 02 08:50:20.195393 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:20.195360    1371 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:46.897346 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:46.897307    1371 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:47.074528 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:47.074499    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:e613e2cb-e2e5-4a33-bac5-3d046411b324 Type:ContainerStarted Data:922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14}
Dec 02 08:50:48.086749 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:48.086714    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:e613e2cb-e2e5-4a33-bac5-3d046411b324 Type:ContainerStarted Data:0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5}
Dec 02 08:50:48.086904 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:48.086878    1371 kuberuntime_container.go:720] "Killing container with a grace period" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal" podUID=e613e2cb-e2e5-4a33-bac5-3d046411b324 containerName="installer" containerID="cri-o://0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5" gracePeriod=30
Dec 02 08:50:49.093406 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:49.093345    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:e613e2cb-e2e5-4a33-bac5-3d046411b324 Type:ContainerDied Data:0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5}
Dec 02 08:50:49.093406 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:49.093367    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:e613e2cb-e2e5-4a33-bac5-3d046411b324 Type:ContainerDied Data:922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14}
Dec 02 08:50:49.152892 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:49.152862    1371 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:49.154890 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:49.154519    1371 kubelet.go:2086] "SyncLoop REMOVE" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
```

kubelet_service.log for installer-7-ip-10-0-129-204.us-east-2.compute.internal:
```
Dec 02 08:50:24.001183 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:24.001152    1371 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:24.328780 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:24.328746    1371 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal"
Dec 02 08:50:25.539685 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:25.530416    1371 kubelet.go:2083] "SyncLoop UPDATE" source="api" pods=[openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:26.000898 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:26.000826    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:d2d64d26-6c1f-43f5-a080-e224dccbcb1c Type:ContainerStarted Data:2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81}
Dec 02 08:50:27.006226 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:27.006015    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:d2d64d26-6c1f-43f5-a080-e224dccbcb1c Type:ContainerDied Data:18aecb076f180ca3220d5c90a4c6b8fc75e00bae87cae450430f5e4075a8e0e4}
Dec 02 08:50:28.014048 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:28.014008    1371 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal" event=&{ID:d2d64d26-6c1f-43f5-a080-e224dccbcb1c Type:ContainerDied Data:2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81}
Dec 02 08:50:28.014456 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:28.014171    1371 kuberuntime_manager.go:502] "No ready sandbox for pod can be found. Need to start a new one" pod="openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal"
Dec 02 09:10:59.764265 ip-10-0-129-204 hyperkube[1371]: I1202 09:10:59.764236    1371 logs.go:319] "Finished parsing log file" path="/var/log/pods/openshift-kube-scheduler_installer-7-ip-10-0-129-204.us-east-2.compute.internal_d2d64d26-6c1f-43f5-a080-e224dccbcb1c/installer/0.log"
```

crio_service.log for installer-6-ip-10-0-129-204.us-east-2.compute.internal:
```
Dec 02 08:50:14.916182 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:14.916118428Z" level=info msg="Running pod sandbox: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/POD" id=83fe7ed6-c9b7-4f00-9afc-760212ac3aed name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 02 08:50:14.937758 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:14.937725376Z" level=info msg="Got pod network &{Name:installer-6-ip-10-0-129-204.us-east-2.compute.internal Namespace:openshift-kube-scheduler ID:922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14 UID:e613e2cb-e2e5-4a33-bac5-3d046411b324 NetNS:/var/run/netns/426e0e6e-e178-4a28-819c-3665ec20eba2 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 08:50:14.937758 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:14.937753811Z" level=info msg="Adding pod openshift-kube-scheduler_installer-6-ip-10-0-129-204.us-east-2.compute.internal to CNI network \"multus-cni-network\" (type=multus)"
Dec 02 08:50:46.898928 ip-10-0-129-204 crio[1340]: 2021-12-02T08:50:46Z [verbose] Add: openshift-kube-scheduler:installer-6-ip-10-0-129-204.us-east-2.compute.internal:e613e2cb-e2e5-4a33-bac5-3d046411b324:openshift-sdn(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/426e0e6e-e178-4a28-819c-3665ec20eba2"}],"ips":[{"version":"4","interface":0,"address":"10.129.0.51/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.0.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}
Dec 02 08:50:46.898928 ip-10-0-129-204 crio[1340]: I1202 08:50:46.873425   22770 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-scheduler", Name:"installer-6-ip-10-0-129-204.us-east-2.compute.internal", UID:"e613e2cb-e2e5-4a33-bac5-3d046411b324", APIVersion:"v1", ResourceVersion:"15668", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.129.0.51/23] from openshift-sdn
Dec 02 08:50:46.899435 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:46.899395831Z" level=info msg="Got pod network &{Name:installer-6-ip-10-0-129-204.us-east-2.compute.internal Namespace:openshift-kube-scheduler ID:922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14 UID:e613e2cb-e2e5-4a33-bac5-3d046411b324 NetNS:/var/run/netns/426e0e6e-e178-4a28-819c-3665ec20eba2 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 08:50:46.899528 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:46.899512888Z" level=info msg="Checking pod openshift-kube-scheduler_installer-6-ip-10-0-129-204.us-east-2.compute.internal for CNI network multus-cni-network (type=multus)"
Dec 02 08:50:46.907676 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:46.907616868Z" level=info msg="Ran pod sandbox 922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14 with infra container: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/POD" id=83fe7ed6-c9b7-4f00-9afc-760212ac3aed name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 02 08:50:46.912470 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:46.912447861Z" level=info msg="Creating container: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer" id=07941e6b-9315-4f7c-b9a2-7e55166f53c4 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Dec 02 08:50:47.232347 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:47.232302366Z" level=info msg="Created container 0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer" id=07941e6b-9315-4f7c-b9a2-7e55166f53c4 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Dec 02 08:50:47.250378 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:47.250319005Z" level=info msg="Started container" PID=24486 containerID=0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5 description=openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer id=1b2bf1bc-90e7-4917-8640-61b6a79dee4b name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14
Dec 02 08:50:48.265838 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:48.261799038Z" level=info msg="Stopped container 0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer" id=793729ab-ee6b-4c7a-bc34-dc3876e1f157 name=/runtime.v1alpha2.RuntimeService/StopContainer
Dec 02 08:50:48.270138 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:48.270108504Z" level=info msg="Got pod network &{Name:installer-6-ip-10-0-129-204.us-east-2.compute.internal Namespace:openshift-kube-scheduler ID:922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14 UID:e613e2cb-e2e5-4a33-bac5-3d046411b324 NetNS:/var/run/netns/426e0e6e-e178-4a28-819c-3665ec20eba2 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 08:50:48.270328 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:48.270288839Z" level=info msg="Deleting pod openshift-kube-scheduler_installer-6-ip-10-0-129-204.us-east-2.compute.internal from CNI network \"multus-cni-network\" (type=multus)"
Dec 02 08:50:48.361546 ip-10-0-129-204 crio[1340]: 2021-12-02T08:50:48Z [verbose] Del: openshift-kube-scheduler:installer-6-ip-10-0-129-204.us-east-2.compute.internal:e613e2cb-e2e5-4a33-bac5-3d046411b324:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}
Dec 02 08:50:49.131160 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:49.130557683Z" level=info msg="Removed container 0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer" id=7409cfa9-6dda-46be-8531-50548021deb8 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
```

crio_service.log for installer-7-ip-10-0-129-204.us-east-2.compute.internal:
```
Dec 02 08:50:24.329304 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:24.329240449Z" level=info msg="Running pod sandbox: openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal/POD" id=6d3f070e-3405-40fd-88db-1dcb3de2d0b3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 02 08:50:24.349843 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:24.349805929Z" level=info msg="Got pod network &{Name:installer-7-ip-10-0-129-204.us-east-2.compute.internal Namespace:openshift-kube-scheduler ID:2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81 UID:d2d64d26-6c1f-43f5-a080-e224dccbcb1c NetNS:/var/run/netns/8fb53701-b7c0-47c5-b9ac-cc066ab157d9 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 08:50:24.349843 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:24.349843781Z" level=info msg="Adding pod openshift-kube-scheduler_installer-7-ip-10-0-129-204.us-east-2.compute.internal to CNI network \"multus-cni-network\" (type=multus)"
Dec 02 08:50:25.523921 ip-10-0-129-204 crio[1340]: 2021-12-02T08:50:25Z [verbose] Add: openshift-kube-scheduler:installer-7-ip-10-0-129-204.us-east-2.compute.internal:d2d64d26-6c1f-43f5-a080-e224dccbcb1c:openshift-sdn(openshift-sdn):eth0 {"cniVersion":"0.3.1","interfaces":[{"name":"eth0","sandbox":"/var/run/netns/8fb53701-b7c0-47c5-b9ac-cc066ab157d9"}],"ips":[{"version":"4","interface":0,"address":"10.129.0.50/23"}],"routes":[{"dst":"0.0.0.0/0","gw":"10.129.0.1"},{"dst":"224.0.0.0/4"},{"dst":"10.128.0.0/14"}],"dns":{}}
Dec 02 08:50:25.523921 ip-10-0-129-204 crio[1340]: I1202 08:50:25.508458   23113 event.go:282] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-kube-scheduler", Name:"installer-7-ip-10-0-129-204.us-east-2.compute.internal", UID:"d2d64d26-6c1f-43f5-a080-e224dccbcb1c", APIVersion:"v1", ResourceVersion:"15767", FieldPath:""}): type: 'Normal' reason: 'AddedInterface' Add eth0 [10.129.0.50/23] from openshift-sdn
Dec 02 08:50:25.526438 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:25.526398521Z" level=info msg="Got pod network &{Name:installer-7-ip-10-0-129-204.us-east-2.compute.internal Namespace:openshift-kube-scheduler ID:2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81 UID:d2d64d26-6c1f-43f5-a080-e224dccbcb1c NetNS:/var/run/netns/8fb53701-b7c0-47c5-b9ac-cc066ab157d9 Networks:[] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 08:50:25.526722 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:25.526702613Z" level=info msg="Checking pod openshift-kube-scheduler_installer-7-ip-10-0-129-204.us-east-2.compute.internal for CNI network multus-cni-network (type=multus)"
Dec 02 08:50:25.549754 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:25.549721898Z" level=info msg="Ran pod sandbox 2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81 with infra container: openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal/POD" id=6d3f070e-3405-40fd-88db-1dcb3de2d0b3 name=/runtime.v1alpha2.RuntimeService/RunPodSandbox
Dec 02 08:50:25.555095 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:25.555065782Z" level=info msg="Creating container: openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal/installer" id=2bb0c487-9c13-4682-ab52-3923421b3926 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Dec 02 08:50:25.964434 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:25.964384878Z" level=info msg="Created container 18aecb076f180ca3220d5c90a4c6b8fc75e00bae87cae450430f5e4075a8e0e4: openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal/installer" id=2bb0c487-9c13-4682-ab52-3923421b3926 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Dec 02 08:50:25.997697 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:25.990253651Z" level=info msg="Started container" PID=23202 containerID=18aecb076f180ca3220d5c90a4c6b8fc75e00bae87cae450430f5e4075a8e0e4 description=openshift-kube-scheduler/installer-7-ip-10-0-129-204.us-east-2.compute.internal/installer id=9412e0dc-aae4-409d-9de8-c6844b5dd043 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81
Dec 02 08:50:27.013027 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:27.012998220Z" level=info msg="Got pod network &{Name:installer-7-ip-10-0-129-204.us-east-2.compute.internal Namespace:openshift-kube-scheduler ID:2e5a71fe0e6f9c14aad0bb4fec8c6afe94638337b0d94a50af70c019af1c5f81 UID:d2d64d26-6c1f-43f5-a080-e224dccbcb1c NetNS:/var/run/netns/8fb53701-b7c0-47c5-b9ac-cc066ab157d9 Networks:[{Name:multus-cni-network Ifname:eth0}] RuntimeConfig:map[multus-cni-network:{IP: MAC: PortMappings:[] Bandwidth:<nil> IpRanges:[]}] Aliases:map[]}"
Dec 02 08:50:27.013251 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:27.013221890Z" level=info msg="Deleting pod openshift-kube-scheduler_installer-7-ip-10-0-129-204.us-east-2.compute.internal from CNI network \"multus-cni-network\" (type=multus)"
Dec 02 08:50:27.251672 ip-10-0-129-204 crio[1340]: 2021-12-02T08:50:27Z [verbose] Del: openshift-kube-scheduler:installer-7-ip-10-0-129-204.us-east-2.compute.internal:d2d64d26-6c1f-43f5-a080-e224dccbcb1c:openshift-sdn:eth0 {"cniVersion":"0.3.1","name":"openshift-sdn","type":"openshift-sdn"}
```

Notice the following lines and the time they were logged:

```
Dec 02 08:50:14.597970 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:14.597891    1371 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:14.915700 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:14.915568    1371 kuberuntime_manager.go:483] "No sandbox for pod can be found. Need to start a new one" pod="openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal"
Dec 02 08:50:20.195393 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:20.195360    1371 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:46.897346 ip-10-0-129-204 hyperkube[1371]: I1202 08:50:46.897307    1371 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal]
Dec 02 08:50:47.250378 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:47.250319005Z" level=info msg="Started container" PID=24486 containerID=0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5 description=openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer id=1b2bf1bc-90e7-4917-8640-61b6a79dee4b name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=922f082d193b55071f363a1d3d387946a955bea3e8614e1577658fd621498d14
Dec 02 08:50:48.265838 ip-10-0-129-204 crio[1340]: time="2021-12-02 08:50:48.261799038Z" level=info msg="Stopped container 0376b6c22d4639e39aac9a93f2e1895a517a1ef09dc781fd3f3ee1f70d6f8bf5: openshift-kube-scheduler/installer-6-ip-10-0-129-204.us-east-2.compute.internal/installer" id=793729ab-ee6b-4c7a-bc34-dc3876e1f157 name=/runtime.v1alpha2.RuntimeService/StopContainer
```

08:50:14.597891 - installer-6-ip-10-0-129-204.us-east-2.compute.internal created
08:50:14.915568 - installer-6-ip-10-0-129-204.us-east-2.compute.internal waiting for its sandbox
08:50:20.195360 - installer-6-ip-10-0-129-204.us-east-2.compute.internal requested to be deleted
08:50:24.001152 - installer-7-ip-10-0-129-204.us-east-2.compute.internal created
08:50:28.014171 - installer-7-ip-10-0-129-204.us-east-2.compute.internal waiting for its sandbox
08:50:25.997697 - installer-7-ip-10-0-129-204.us-east-2.compute.internal installer container started to run
08:50:27.013251 - installer-7-ip-10-0-129-204.us-east-2.compute.internal installer container stopped running/got deleted
08:50:47.250319 - installer-6-ip-10-0-129-204.us-east-2.compute.internal installer container started to run
08:50:48.265838 - installer-6-ip-10-0-129-204.us-east-2.compute.internal installer container stopped running

Based on the timing, the revision 7 installer got running and created the static manifests (based on the container logs). However, it was followed by running the revision 6 installer. Which rewritten the manifests to revision 6. So we got into a state where revision 7 pod is expected but it will never get created. Although, the installer controller sees the revision 7 installer completed successfully.

--- Additional comment from Devan Goodwin on 2021-12-06 14:39:15 CET ---

Great info thanks Jan, looks like a pretty serious scheduler bug. Moving to kube-scheduler as my kubelet guess was apparently incorrect.

Comment 1 Jan Chaloupka 2022-01-07 12:19:18 UTC
Due to higher priority tasks I have been able to resolve this issue in time. Moving to the next sprint.

Comment 2 Jan Chaloupka 2022-01-19 20:00:52 UTC
*** Bug 2042579 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2022-08-10 10:40:31 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.