Bug 1851066

Summary: 4.4: A restarted kube-apiserver doesn't wait for the port to be available; crashloops
Product: OpenShift Container Platform Reporter: Stefan Schimanski <sttts>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: aos-bugs, cscribne, kewang, mfojtik, openshift-bugzilla-robot, smalleni, vlaad, wking, xxia, zyu
Target Milestone: ---   
Target Release: 4.4.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1837992 Environment:
Last Closed: 2020-07-06 20:47:17 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1837992    
Bug Blocks:    

Comment 1 W. Trevor King 2020-06-25 14:41:43 UTC
Will this get a 4.4 backport of [1], to match the 4.6 bug 1837992?

[1]: https://github.com/openshift/origin/pull/25002

Comment 2 W. Trevor King 2020-06-25 16:16:14 UTC
Dropping the dup bug 1851071 from the blocker set, now that bug 1837992 is back to targeting 4.5.0.

Comment 6 Xingxing Xia 2020-06-28 02:49:31 UTC
The backport should also include the code of bug 1844288 but above PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/891 does not. Moving to Assigned directly for that.

Comment 7 Ke Wang 2020-06-29 08:20:40 UTC
The bug 1851831 with PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/892 includes related code.

Comment 9 Ke Wang 2020-07-01 04:52:09 UTC
This bug caused cluster installation run into the error  “Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition”. To solve this, need PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/892 land to the latest payload.

Comment 10 Xingxing Xia 2020-07-01 07:43:17 UTC
*** Bug 1851831 has been marked as a duplicate of this bug. ***

Comment 11 Xingxing Xia 2020-07-01 07:44:53 UTC
I'm closing bug 1851831 as dup of bug 1851066, move 1851066 to Assigned because:
1. The errata with 1851066 code cannot be shipped if without 1851831 code. Otherwise may cause customer issue of 1851831. Bug 1851066#c6 moved it to Assigned just for this reason.
2. But so far the errata's new candidate build 4.4.0-0.nightly-2020-06-30-153030 as shown in bug 1851066#c8 still does not include 1851831 code. Even latest nightly 4.4.0-0.nightly-2020-07-01-051030 does not, either.

So closing 1851831 as dup of 1851066, expecting 1851831 PR link attached in this 1851066 to be included in the errata, as 1851066#c6 wanted, instead of separated bugs causing release gap (which brings trouble to both ART and the QE errata owner) and potential customer issue.

Comment 13 Scott Dodson 2020-07-01 12:41:27 UTC
Adding https://github.com/openshift/cluster-kube-apiserver-operator/pull/892 to the linked PRs so that when/if we decide this needs to go to 4.3.z we have a complete set of PRs necessary. Though I think at this point it doesn't make sense to backport to 4.3.

Comment 15 Ke Wang 2020-07-02 02:31:31 UTC
Verification as below,

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-07-01-085659   True        False         16m     Cluster version is 4.4.0-0.nightly-2020-07-01-085659

$ kubeapiserver_pod=$(oc get pod -n openshift-kube-apiserver | grep kube-apiserver | head -1 | awk '{print $1}')

$ oc get pods -n openshift-kube-apiserver $kubeapiserver_pod -o yaml | grep -n -C8 'Waiting for port :6443'
26-spec:
27-  containers:
28-  - args:
29-    - |-
30-      if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then
31-        echo "Copying system trust bundle"
32-        cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
33-      fi
34:      echo -n "Waiting for port :6443 to be released."
35-      tries=0
36-      while [ -n "$(ss -Htan '( sport = 6443 )')" ]; do
37-        echo -n "."
38-        sleep 1
39-        (( tries += 1 ))
40-        if [[ "${tries}" -gt 105 ]]; then
41-          echo "timed out waiting for port :6443 to be released"
42-          exit 1
--
--
186-  dnsPolicy: ClusterFirst
187-  enableServiceLinks: true
188-  hostNetwork: true
189-  initContainers:
190-  - args:
191-    - |
192-      echo -n "Fixing audit permissions."
193-      chmod 0700 /var/log/kube-apiserver
194:      echo -n "Waiting for port :6443 and :6080 to be released."
195-      while [ -n "$(ss -Htan '( sport = 6443 or sport = 6080 )')" ]; do
196-        echo -n "."
197-        sleep 1
198-      done
199-    command:
200-    - /usr/bin/timeout
201-    - "105"
202-    - /bin/bash

The fix works fine, move the bug verified.

Comment 16 Ke Wang 2020-07-02 02:38:33 UTC
For crashloops checking, we can see RESTARTS is 0, as expected.

$ oc get pods -A -l apiserver
NAMESPACE                  NAME                                                         READY   STATUS    RESTARTS   AGE
openshift-apiserver        apiserver-7c75bb7d79-2z4sr                                   1/1     Running   0          38m
openshift-apiserver        apiserver-7c75bb7d79-s46zv                                   1/1     Running   0          38m
openshift-apiserver        apiserver-7c75bb7d79-vkthh                                   1/1     Running   0          37m
openshift-kube-apiserver   kube-apiserver-ip-10-0-.-75....compute.internal              4/4     Running   0          28m
openshift-kube-apiserver   kube-apiserver-ip-10-0-.-222....compute.internal             4/4     Running   0          24m
openshift-kube-apiserver   kube-apiserver-ip-10-0-.-7....compute.internal               4/4     Running   0          29m

Comment 19 errata-xmlrpc 2020-07-06 20:47:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786