Bug 1956081 - kube-apiserver setup fail while installing SNO due to port being used [NEEDINFO]
Summary: kube-apiserver setup fail while installing SNO due to port being used
Keywords:
Status: ASSIGNED
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.8
Hardware: Unspecified
OS: Unspecified
urgent
medium
Target Milestone: ---
: 4.8.0
Assignee: Ryan Phillips
QA Contact: Sunil Choudhary
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-05-02 15:58 UTC by Eran Cohen
Modified: 2021-05-14 18:30 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:
Target Upstream Version:
rphillips: needinfo? (ercohen)


Attachments (Terms of Use)
kubelet.log (14.73 MB, text/plain)
2021-05-02 15:58 UTC, Eran Cohen
no flags Details
kube-apiserver.log (1.84 MB, text/plain)
2021-05-02 16:03 UTC, Eran Cohen
no flags Details
crio.log (594.15 KB, text/plain)
2021-05-02 16:03 UTC, Eran Cohen
no flags Details
kube-apiserver audit.log (16.81 MB, text/plain)
2021-05-03 09:38 UTC, Eran Cohen
no flags Details
kube-apiserver-operator-0.log (418.64 KB, text/plain)
2021-05-03 10:38 UTC, Eran Cohen
no flags Details
kube-apiserver-operator-1.log (3.78 MB, text/plain)
2021-05-03 10:39 UTC, Eran Cohen
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 1125 0 None open Bug 1956081: Update bootstrap manifest namespace for SNO compatibility 2021-05-03 17:52:55 UTC

Internal Links: 1956455

Description Eran Cohen 2021-05-02 15:58:44 UTC
Created attachment 1778593 [details]
kubelet.log

Created attachment 1778593 [details]
kubelet.log

Description of problem:
kube-apiserver fail to start once the operator replace the bootstrap static pod with the real OCP pod

Version-Release number of selected component (if applicable):
4.8.0-fc0

How reproducible:
Happened few times not 100%

Steps to Reproduce:
1. Install SNO using https://qaprodauth.cloud.redhat.com/openshift/assisted-installer/clusters or use https://github.com/eranco74/bootstrap-in-place-poc
2. wait for the bootstrap kube-apiserver static pod to be replaced by the operator  

Actual results:
[root@vm-11-46 core]# crictl ps -a | grep kube-apiserver
91c2e440e0b59       39f312f5f06f9b367c5b242dfdedc2e496ea2bcd7075f0cfee3f246480f2bb98                                                                                                     About an hour ago   Exited              kube-apiserver-operator                       1                   d637eccd49012
7c917bc61f3f0       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2177994f1b0d139349f91719b3b55f220892f73ef4fe2d7417bfff321b0ae3cd                                               About an hour ago   Running             kube-apiserver-insecure-readyz                0                   c7cee32686a23
2866a498e8065       cc3c89a8e169fc56abf9155faae231c4830a73ff0cba8f407c0d9188ade6e227                                                                                                     About an hour ago   Exited              kube-apiserver                                0                   c7cee32686a23

Expected results:
Expected the pod to start running

Additional info:

Comment 1 Eran Cohen 2021-05-02 16:03:10 UTC
Created attachment 1778597 [details]
kube-apiserver.log

Comment 2 Eran Cohen 2021-05-02 16:03:39 UTC
Created attachment 1778598 [details]
crio.log

Comment 5 Eran Cohen 2021-05-03 09:38:08 UTC
Created attachment 1778888 [details]
kube-apiserver audit.log

Comment 6 Stefan Schimanski 2021-05-03 09:46:11 UTC
kube-apiserver-operator exiting is probably the issue. Please attach the log. I see it running for some time:

May 02 13:28:03 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[1840]: I0502 13:28:03.529215    1840 kubelet.go:1984] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver-operator-c57588fc-r6hbl" event=&{ID:3eb38758-c28e-42d8-ab0b-fcb5d0a5fe72 Type:ContainerStarted Data:f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3}
May 02 13:28:35 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[8289]: I0502 13:28:35.557779    8289 kubelet.go:1984] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver-operator-c57588fc-r6hbl" event=&{ID:3eb38758-c28e-42d8-ab0b-fcb5d0a5fe72 Type:ContainerStarted Data:f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3}
May 02 13:29:53 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[8289]: I0502 13:29:53.955937    8289 scope.go:111] "RemoveContainer" containerID="f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3"
May 02 13:30:22 vm-11-45.lab.eng.tlv2.redhat.com hyperkube[8289]: I0502 13:30:22.095178    8289 kubelet.go:1984] "SyncLoop (PLEG): event for pod" pod="openshift-kube-apiserver-operator/kube-apiserver-operator-c57588fc-r6hbl" event=&{ID:3eb38758-c28e-42d8-ab0b-fcb5d0a5fe72 Type:ContainerDied Data:f52ece4441134048326be6f61a1df90a8ff34f7586b1bf6737a01a5d99ae5cb3}

Comment 8 Eran Cohen 2021-05-03 10:38:55 UTC
Created attachment 1778904 [details]
kube-apiserver-operator-0.log

Comment 9 Eran Cohen 2021-05-03 10:39:31 UTC
Created attachment 1778905 [details]
kube-apiserver-operator-1.log

Comment 10 Eran Cohen 2021-05-03 14:09:46 UTC
Reproduced, your hunch about the `setup` container was currect.
This is from the setup container pod (hard to find without API...)

2021-05-03T14:05:30.177252876+00:00 stdout P Fixing audit permissions.
2021-05-03T14:05:30.180475386+00:00 stdout P Waiting for port :6443 and :6080 to be released.
2021-05-03T14:05:30.183950595+00:00 stdout P .
2021-05-03T14:05:31.188598661+00:00 stdout P .
2021-05-03T14:05:32.193115194+00:00 stdout P .
2021-05-03T14:05:33.197805882+00:00 stdout P .

Seems that the port wasn't released:

[root@test-infra-cluster-assisted-installer-master-0 core]# netstat -ntlp | egrep "6443|6080" 
tcp6       0      0 :::6080                 :::*                    LISTEN      2685/cluster-kube-a

Comment 12 Constantin Vultur 2021-05-04 08:52:55 UTC
Found the same issue on VM with IPv6

[core@master-0-0 ~]$ netstat -tulpn | grep 6008 
(Not all processes could be identified, non-owned process info
 will not be shown, you would have to be root to see it all.)
[core@master-0-0 ~]$ sudo netstat -tulpn | grep 6008 
[core@master-0-0 ~]$ sudo crictl logs 773530268d02d
Fixing audit permissions.[core@master-0-0 ~]$ sudo crictl logs 15329118936a1
Fixing audit permissions.Waiting for port :6443 and :6080 to be released..........................................................................................................[core@master-0-0 ~]$ 
[core@master-0-0 ~]$ 
[core@master-0-0 ~]$ sudo netstat -tulpn | grep 6080
tcp6       0      0 :::6080                 :::*                    LISTEN      3468/cluster-kube-a 
[core@master-0-0 ~]$ 

Stopping the insecure-readyz had the installation move further

This was seen with version: 4.8.0-fc.0

Comment 13 Ke Wang 2021-05-07 06:06:41 UTC
# oc get clusterversion
oc get no
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-05-06-210840   True        False         3h23m   Cluster version is 4.8.0-0.nightly-2021-05-06-210840

# oc get no
NAME      STATUS   ROLES           AGE     VERSION
sno-0-0   Ready    master,worker   3h46m   v1.21.0-rc.0+291e731

$ oc debug node/sno-0-0

sh-4.4# crictl ps -a | grep kube-apiserver
12dccd622bcb3       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                             3 hours ago         Running             kube-apiserver-operator                       4                   80843ff7042b4
7c2d7378a1b43       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                             3 hours ago         Running             kube-apiserver-check-endpoints                0                   86dc6aa64c7c3
5951f9661c78b       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                             3 hours ago         Running             kube-apiserver-insecure-readyz                0                   86dc6aa64c7c3
e492a0b5cdcf6       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                             3 hours ago         Running             kube-apiserver-cert-regeneration-controller   0                   86dc6aa64c7c3
bb769aa2f7514       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                             3 hours ago         Running             kube-apiserver-cert-syncer                    0                   86dc6aa64c7c3
b74c304ed7efb       1cf193fc9385133ea3e2c636885fa0596ba77f138cd68125730bddd873f5ff5b                                                             3 hours ago         Running             kube-apiserver                                0                   86dc6aa64c7c3
3cdab6f3a4541       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                             3 hours ago         Exited              kube-apiserver-operator                       3                   80843ff7042b4


sh-4.4# netstat -ntlp | egrep "6443|6080" 
tcp6       0      0 :::6080                 :::*                    LISTEN      174717/cluster-kube 
tcp6       0      0 :::6443                 :::*                    LISTEN      174538/kube-apiserv 

From above, we can see the all containers of kube-apiserver and kube-apiserver-operator have been running for three hours at least, and no related port are in use by others, so move the bug VERIFIED.

Comment 14 hanzhang 2021-05-11 22:25:52 UTC
Still can see this issue in 4.8.0-fc.3. kube-apiserver not running when installing SNO.

It happens in 49 clusters of our 1030 SNO clusters provisioning. 

```
# crictl ps --all | grep apiserver
0c5fcf8e22366       3edcc8028cb4df5a3699537d12bb099d43bc4e13a3c30e11efecdef83d97f5a4                                                         6 hours ago          Running             openshift-apiserver-operator                  1                   3643e4560b448
37631caa1da67       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                         6 hours ago          Running             kube-apiserver-operator                       1                   91a6be88dc7f6
1b0d8b9f55274       3edcc8028cb4df5a3699537d12bb099d43bc4e13a3c30e11efecdef83d97f5a4                                                         6 hours ago          Exited              openshift-apiserver-operator                  0                   3643e4560b448
54c1130538464       150619bf7992913dc2ff940280fe02e51de8f3a5c74f7109bf41c9cf4716fbf8                                                         6 hours ago          Exited              kube-apiserver-operator                       0                   91a6be88dc7f6
979fd97440c4d       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:3c17486313a82d91260da08357f5a3139723ad50ecd1078fdff5ad4b99f40a6a   6 hours ago          Running             kube-apiserver-insecure-readyz                0                   9405df913fcf2
b0375f1e1d6fe       1cf193fc9385133ea3e2c636885fa0596ba77f138cd68125730bddd873f5ff5b                                                         6 hours ago          Exited              kube-apiserver                                0                   9405df913fcf2

```

Comment 15 Maru Newby 2021-05-11 22:54:20 UTC
The same symptom is recurring:

 - after bootstrap, ckao updates /etc/kubernetes/manifests/kube-apiserver-pod.yaml to the post-bootstrap apiserver pod definition
 - kubelet kills the kube-apiserver container but leaves the kube-apiserver-insecure-readyz container running and bound to port 6080
 - the new pod's init blocks indefinitely waiting for port 6080 to be released, preventing launch of the new apiserver

I had assumed the error was more frequent than 49/1030, and that the fix passing qa verification was an indication that using different namespaces for the bootstrap and post-bootstrap apiserver pods was the issue. I realize now that the problem is likely a more subtle issue with how the kubelet handles updates to static pods. Reassigning to the node team for further investigation.

Comment 21 Alex Krzos 2021-05-12 21:03:43 UTC
OVN built SNO clusters have the same issue as well and witnessed a 2/60 clusters failed in one test.  You can get the cluster api responsive in the same fashion as mentioned earlier by deleting the kube-apiserver-insecure-readyz container to free up the ports.  It does seem that OVN clusters have more containers up and running to include etcd which I believe was not witnessed on the SNO clusters with OpenShiftSDN.


Note You need to log in before you can comment on or make changes to this bug.