Created attachment 1778593 [details] kubelet.log Created attachment 1778593 [details] kubelet.log Description of problem: kube-apiserver fail to start once the operator replace the bootstrap static pod with the real OCP pod Version-Release number of selected component (if applicable): 4.8.0-fc0 How reproducible: Happened few times not 100% Steps to Reproduce: 1. Install SNO using https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters or use https://github.com/eranco74/bootstrap-in-place-poc 2. wait for the bootstrap kube-apiserver static pod to be replaced by the operator Actual results: [root@vm-11-46 core]# crictl ps -a | grep kube-apiserver 91c2e440e0b59 39f312f5f06f9b367c5b242dfdedc2e496ea2bcd7075f0cfee3f246480f2bb98 About an hour ago Exited kube-apiserver-operator 1 d637eccd49012 7c917bc61f3f0 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2177994f1b0d139349f91719b3b55f220892f73ef4fe2d7417bfff321b0ae3cd About an hour ago Running kube-apiserver-insecure-readyz 0 c7cee32686a23 2866a498e8065 cc3c89a8e169fc56abf9155faae231c4830a73ff0cba8f407c0d9188ade6e227 About an hour ago Exited kube-apiserver 0 c7cee32686a23 Expected results: Expected the pod to start running Additional info:
Created attachment 1778597 [details] kube-apiserver.log
Created attachment 1778598 [details] crio.log
Created attachment 1778888 [details] kube-apiserver audit.log
kube-apiserver-operator exiting is probably the issue. Please attach the log. I see it running for some time: May 02 13:28:03 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[1840]: I0502 13:28:03.529215 1840 kubelet.go:1984] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver-operator-c57588fc-r6hbl" event=&{ID:3eb38758-c28e-42d8-ab0b-fcb5d0a5fe72 Type:ContainerStarted Data:f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3} May 02 13:28:35 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[8289]: I0502 13:28:35.557779 8289 kubelet.go:1984] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver-operator-c57588fc-r6hbl" event=&{ID:3eb38758-c28e-42d8-ab0b-fcb5d0a5fe72 Type:ContainerStarted Data:f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3} May 02 13:29:53 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[8289]: I0502 13:29:53.955937 8289 scope.go:111] "RemoveContainer" containerID="f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3" May 02 13:30:22 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[8289]: I0502 13:30:22.095178 8289 kubelet.go:1984] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver-operator-c57588fc-r6hbl" event=&{ID:3eb38758-c28e-42d8-ab0b-fcb5d0a5fe72 Type:ContainerDied Data:f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3}
Created attachment 1778904 [details] kube-apiserver-operator-0.log
Created attachment 1778905 [details] kube-apiserver-operator-1.log
Reproduced, your hunch about the `setup` container was currect. This is from the setup container pod (hard to find without API...) 2021-05-03T14:05:30.177252876+00:00 stdout P Fixing audit permissions. 2021-05-03T14:05:30.180475386+00:00 stdout P Waiting for port :6443 and :6080 to be released. 2021-05-03T14:05:30.183950595+00:00 stdout P . 2021-05-03T14:05:31.188598661+00:00 stdout P . 2021-05-03T14:05:32.193115194+00:00 stdout P . 2021-05-03T14:05:33.197805882+00:00 stdout P . Seems that the port wasn't released: [root@test-infra-cluster-assisted-installer-master-0 core]# netstat -ntlp | egrep "6443|6080" tcp6 0 0 :::6080 :::* LISTEN 2685/cluster-kube-a
Found the same issue on VM with IPv6 [core@master-0-0 ~]$ netstat -tulpn | grep 6008 (Not all processes could be identified, non-owned process info will not be shown, you would have to be root to see it all.) [core@master-0-0 ~]$ sudo netstat -tulpn | grep 6008 [core@master-0-0 ~]$ sudo crictl logs 773530268d02d Fixing audit permissions.[core@master-0-0 ~]$ sudo crictl logs 15329118936a1 Fixing audit permissions.Waiting for port :6443 and :6080 to be released..........................................................................................................[core@master-0-0 ~]$ [core@master-0-0 ~]$ [core@master-0-0 ~]$ sudo netstat -tulpn | grep 6080 tcp6 0 0 :::6080 :::* LISTEN 3468/cluster-kube-a [core@master-0-0 ~]$ Stopping the insecure-readyz had the installation move further This was seen with version: 4.8.0-fc.0
# oc get clusterversion oc get no NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-05-06-210840 True False 3h23m Cluster version is 4.8.0-0.nightly-2021-05-06-210840 # oc get no NAME STATUS ROLES AGE VERSION sno-0-0 Ready master,worker 3h46m v1.21.0-rc.0+291e731 $ oc debug node/sno-0-0 sh-4.4# crictl ps -a | grep kube-apiserver 12dccd622bcb3 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 3 hours ago Running kube-apiserver-operator 4 80843ff7042b4 7c2d7378a1b43 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 3 hours ago Running kube-apiserver-check-endpoints 0 86dc6aa64c7c3 5951f9661c78b 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 3 hours ago Running kube-apiserver-insecure-readyz 0 86dc6aa64c7c3 e492a0b5cdcf6 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 3 hours ago Running kube-apiserver-cert-regeneration-controller 0 86dc6aa64c7c3 bb769aa2f7514 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 3 hours ago Running kube-apiserver-cert-syncer 0 86dc6aa64c7c3 b74c304ed7efb 1cf193fc9385133ea3e2c636885fa0596ba77f138cd68125730bddd873f5ff5b 3 hours ago Running kube-apiserver 0 86dc6aa64c7c3 3cdab6f3a4541 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 3 hours ago Exited kube-apiserver-operator 3 80843ff7042b4 sh-4.4# netstat -ntlp | egrep "6443|6080" tcp6 0 0 :::6080 :::* LISTEN 174717/cluster-kube tcp6 0 0 :::6443 :::* LISTEN 174538/kube-apiserv From above, we can see the all containers of kube-apiserver and kube-apiserver-operator have been running for three hours at least, and no related port are in use by others, so move the bug VERIFIED.
Still can see this issue in 4.8.0-fc.3. kube-apiserver not running when installing SNO. It happens in 49 clusters of our 1030 SNO clusters provisioning. ``` # crictl ps --all | grep apiserver 0c5fcf8e22366 3edcc8028cb4df5a3699537d12bb099d43bc4e13a3c30e11efecdef83d97f5a4 6 hours ago Running openshift-apiserver-operator 1 3643e4560b448 37631caa1da67 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 6 hours ago Running kube-apiserver-operator 1 91a6be88dc7f6 1b0d8b9f55274 3edcc8028cb4df5a3699537d12bb099d43bc4e13a3c30e11efecdef83d97f5a4 6 hours ago Exited openshift-apiserver-operator 0 3643e4560b448 54c1130538464 150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8 6 hours ago Exited kube-apiserver-operator 0 91a6be88dc7f6 979fd97440c4d quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c17486313a82d91260da08357f5a3139723ad50ecd1078fdff5ad4b99f40a6a 6 hours ago Running kube-apiserver-insecure-readyz 0 9405df913fcf2 b0375f1e1d6fe 1cf193fc9385133ea3e2c636885fa0596ba77f138cd68125730bddd873f5ff5b 6 hours ago Exited kube-apiserver 0 9405df913fcf2 ```
The same symptom is recurring: - after bootstrap, ckao updates /etc/kubernetes/manifests/kube-apiserver-pod.yaml to the post-bootstrap apiserver pod definition - kubelet kills the kube-apiserver container but leaves the kube-apiserver-insecure-readyz container running and bound to port 6080 - the new pod's init blocks indefinitely waiting for port 6080 to be released, preventing launch of the new apiserver I had assumed the error was more frequent than 49/1030, and that the fix passing qa verification was an indication that using different namespaces for the bootstrap and post-bootstrap apiserver pods was the issue. I realize now that the problem is likely a more subtle issue with how the kubelet handles updates to static pods. Reassigning to the node team for further investigation.
OVN built SNO clusters have the same issue as well and witnessed a 2/60 clusters failed in one test. You can get the cluster api responsive in the same fashion as mentioned earlier by deleting the kube-apiserver-insecure-readyz container to free up the ports. It does seem that OVN clusters have more containers up and running to include etcd which I believe was not witnessed on the SNO clusters with OpenShiftSDN.
$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-06-13-101614 True False 3h53m Cluster version is 4.8.0-0.nightly-2021-06-13-101614 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-59-58.us-east-2.compute.internal Ready master,worker 4h10m v1.21.0-rc.0+120883f 10.0.59.58 <none> Red Hat Enterprise Linux CoreOS 48.84.202106120900-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-8.rhaos4.8.git7e41de0.el8 $ oc get pods -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-2-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h9m installer-3-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h7m installer-4-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h6m installer-5-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h5m installer-6-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h4m installer-7-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h1m kube-apiserver-ip-10-0-59-58.us-east-2.compute.internal 5/5 Running 0 3h56m revision-pruner-6-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 4h1m revision-pruner-7-ip-10-0-59-58.us-east-2.compute.internal 0/1 Completed 0 3h54m sh-4.4# crictl ps -a | grep kube-apiserver 35d1b0ffc201d ce3ad50558f1f253ca768163cb79008aea45cfe13b3eea02d19c31e878e3a523 4 hours ago Running kube-apiserver-operator 3 f013c89d8326f 6368cd78253d2 ce3ad50558f1f253ca768163cb79008aea45cfe13b3eea02d19c31e878e3a523 4 hours ago Running kube-apiserver-check-endpoints 0 eebadc52288f7 e43750a8e69ba ce3ad50558f1f253ca768163cb79008aea45cfe13b3eea02d19c31e878e3a523 4 hours ago Running kube-apiserver-insecure-readyz 0 eebadc52288f7 39e387d917234 ce3ad50558f1f253ca768163cb79008aea45cfe13b3eea02d19c31e878e3a523 4 hours ago Running kube-apiserver-cert-regeneration-controller 0 eebadc52288f7 a0dcb777e19fa ce3ad50558f1f253ca768163cb79008aea45cfe13b3eea02d19c31e878e3a523 4 hours ago Running kube-apiserver-cert-syncer 0 eebadc52288f7 5cf88d67e1a49 0276ab94cc8e4bdf5ca5abfd8c8af757e871cb56668073e4508e9f22f3dae029 4 hours ago Running kube-apiserver 0 eebadc52288f7 a25cfb24f4aae ce3ad50558f1f253ca768163cb79008aea45cfe13b3eea02d19c31e878e3a523 4 hours ago Exited kube-apiserver-operator 2 f013c89d8326f sh-4.4# netstat -ntlp | egrep "6443|6080" tcp6 0 0 :::6443 :::* LISTEN 86436/kube-apiserve tcp6 0 0 :::6080 :::* LISTEN 86765/cluster-kube-
While scale testing SNO installations the issue reproduced (multiple times) with 4.8.0-fc.9 [root@sno01026 core]# crictl ps -a | grep kube-apiserver 3cd17620dbcf4 974fb7506a86cacbca89c7400a6376014937efa54ca42ee4d841eb7c8a347a17 10 hours ago Exited kube-apiserver-operator 0 75820b8c9e0e6 5897782113bd5 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a904afec0717bae17a923d552ccc706a2c3b611d8bef812c61cf6fcae37e243 10 hours ago Running kube-apiserver-insecure-readyz 0 eef75e5837035 82dfce86d87e4 114ce83a67d55310c214de1cf069948d11bb4f674ce29623f87f1cbf43dbbf91 10 hours ago Exited kube-apiserver 0 eef75e5837035 [root@sno01026 core]# crictl ps -a | grep setup 7cd5b34a2f4d9 114ce83a67d55310c214de1cf069948d11bb4f674ce29623f87f1cbf43dbbf91 2 minutes ago Exited setup 89 714b13d7d4d2e 11f8d736e3306 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:82e283359e4209d649feecc8fee3b3663e205c1537cbd98d921f003770714662 10 hours ago Exited setup 0 eef75e5837035 [root@sno01026 core]# crictl logs 7cd5b34a2f4d9 Fixing audit permissions.Waiting for port :6443 and :6080 to be released..........................................................................................................
The port being used is 6080: [root@sno01026 core]# netstat -ntlp | egrep "6080|6443" tcp6 0 0 :::6080 :::* LISTEN 2810/cluster-kube-a
The process is still running... [root@sno01026 core]# ps -ef | grep 2810 root 2810 2798 0 00:46 ? 00:00:03 cluster-kube-apiserver-operator insecure-readyz --insecure-port=6080 --delegate-url=https://localhost:6443/readyz .root 455717 455416 0 11:52 pts/0 00:00:00 grep --color=auto 2810
pod is running as well: [root@sno01026 core]# crictl ps | grep kube-apiserver 5897782113bd5 quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:1a904afec0717bae17a923d552ccc706a2c3b611d8bef812c61cf6fcae37e243 11 hours ago Running kube-apiserver-insecure-readyz 0 eef75e5837035
Checked by running multiple SNO installation with 4.9.0-0.nightly-2021-06-20-015701 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.0-0.nightly-2021-06-20-015701 True False 84m Cluster version is 4.9.0-0.nightly-2021-06-20-015701 $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-57-251.us-east-2.compute.internal Ready master,worker 110m v1.21.0-rc.0+120883f 10.0.57.251 <none> Red Hat Enterprise Linux CoreOS 48.84.202106162024-0 (Ootpa) 4.18.0-305.3.1.el8_4.x86_64 cri-o://1.21.1-9.rhaos4.8.gitdfcd2b6.el8 $ oc get pods -n openshift-kube-apiserver NAME READY STATUS RESTARTS AGE installer-10-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 89m installer-4-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 104m installer-5-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 104m installer-6-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 98m installer-7-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 98m installer-8-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 90m installer-9-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 90m kube-apiserver-ip-10-0-57-251.us-east-2.compute.internal 5/5 Running 0 85m revision-pruner-10-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 84m revision-pruner-5-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 99m revision-pruner-7-ip-10-0-57-251.us-east-2.compute.internal 0/1 Completed 0 92m $ oc debug node/ip-10-0-57-251.us-east-2.compute.internal Starting pod/ip-10-0-57-251us-east-2computeinternal-debug ... ... sh-4.4# crictl ps -a | grep kube-apiserver b0ab23ad959bf 07ed1912e5129c7a31d357971e36f570346d3b7be688ed0daf7bfa8a6f5899a2 About an hour ago Running kube-apiserver-check-endpoints 0 3695ae008ff38 e39937eb50152 07ed1912e5129c7a31d357971e36f570346d3b7be688ed0daf7bfa8a6f5899a2 About an hour ago Running kube-apiserver-insecure-readyz 0 3695ae008ff38 d063270fc7437 07ed1912e5129c7a31d357971e36f570346d3b7be688ed0daf7bfa8a6f5899a2 About an hour ago Running kube-apiserver-cert-regeneration-controller 0 3695ae008ff38 57c5586bb7a08 07ed1912e5129c7a31d357971e36f570346d3b7be688ed0daf7bfa8a6f5899a2 About an hour ago Running kube-apiserver-cert-syncer 0 3695ae008ff38 5cfb6b61f1aba ff68e4d95a5dec3517731fed9e921cf1d0ea592193b8e4125c24a72ccb445e8b About an hour ago Running kube-apiserver 0 3695ae008ff38 87b22000447fc 07ed1912e5129c7a31d357971e36f570346d3b7be688ed0daf7bfa8a6f5899a2 About an hour ago Running kube-apiserver-operator 3 26756935b4dca 1edc68ef6c5e2 07ed1912e5129c7a31d357971e36f570346d3b7be688ed0daf7bfa8a6f5899a2 2 hours ago Exited kube-apiserver-operator 2 26756935b4dca sh-4.4# netstat -ntlp | egrep "6443|6080" tcp6 0 0 :::6080 :::* LISTEN 114963/cluster-kube tcp6 0 0 :::6443 :::* LISTEN 114818/kube-apiserv
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759