Bug 1851066 - 4.4: A restarted kube-apiserver doesn't wait for the port to be available; crashloops
Summary: 4.4: A restarted kube-apiserver doesn't wait for the port to be available; cr...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.z
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
: 1851831 (view as bug list)
Depends On: 1837992
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-25 14:14 UTC by Stefan Schimanski
Modified: 2020-07-06 20:47 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1837992
Environment:
Last Closed: 2020-07-06 20:47:17 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-kube-apiserver-operator pull 891 0 None closed Bug 1851066: 4.4: use "ss" instead of "lsof" to check port; check on container start 2020-10-29 20:34:42 UTC
Github openshift cluster-kube-apiserver-operator pull 892 0 None closed [release-4.4] Bug 1851831: static pod: don't wait for 6080 in apiserver container 2020-10-29 20:34:42 UTC
Github openshift origin pull 25216 0 None closed Bug 1851066: images/hyperkube: install iproute 2020-10-29 20:34:43 UTC
Red Hat Product Errata RHBA-2020:2786 0 None None None 2020-07-06 20:47:39 UTC

Comment 1 W. Trevor King 2020-06-25 14:41:43 UTC
Will this get a 4.4 backport of [1], to match the 4.6 bug 1837992?

[1]: https://github.com/openshift/origin/pull/25002

Comment 2 W. Trevor King 2020-06-25 16:16:14 UTC
Dropping the dup bug 1851071 from the blocker set, now that bug 1837992 is back to targeting 4.5.0.

Comment 6 Xingxing Xia 2020-06-28 02:49:31 UTC
The backport should also include the code of bug 1844288 but above PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/891 does not. Moving to Assigned directly for that.

Comment 7 Ke Wang 2020-06-29 08:20:40 UTC
The bug 1851831 with PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/892 includes related code.

Comment 9 Ke Wang 2020-07-01 04:52:09 UTC
This bug caused cluster installation run into the error  “Bootstrap failed to complete: failed to wait for bootstrapping to complete: timed out waiting for the condition”. To solve this, need PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/892 land to the latest payload.

Comment 10 Xingxing Xia 2020-07-01 07:43:17 UTC
*** Bug 1851831 has been marked as a duplicate of this bug. ***

Comment 11 Xingxing Xia 2020-07-01 07:44:53 UTC
I'm closing bug 1851831 as dup of bug 1851066, move 1851066 to Assigned because:
1. The errata with 1851066 code cannot be shipped if without 1851831 code. Otherwise may cause customer issue of 1851831. Bug 1851066#c6 moved it to Assigned just for this reason.
2. But so far the errata's new candidate build 4.4.0-0.nightly-2020-06-30-153030 as shown in bug 1851066#c8 still does not include 1851831 code. Even latest nightly 4.4.0-0.nightly-2020-07-01-051030 does not, either.

So closing 1851831 as dup of 1851066, expecting 1851831 PR link attached in this 1851066 to be included in the errata, as 1851066#c6 wanted, instead of separated bugs causing release gap (which brings trouble to both ART and the QE errata owner) and potential customer issue.

Comment 13 Scott Dodson 2020-07-01 12:41:27 UTC
Adding https://github.com/openshift/cluster-kube-apiserver-operator/pull/892 to the linked PRs so that when/if we decide this needs to go to 4.3.z we have a complete set of PRs necessary. Though I think at this point it doesn't make sense to backport to 4.3.

Comment 15 Ke Wang 2020-07-02 02:31:31 UTC
Verification as below,

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-07-01-085659   True        False         16m     Cluster version is 4.4.0-0.nightly-2020-07-01-085659

$ kubeapiserver_pod=$(oc get pod -n openshift-kube-apiserver | grep kube-apiserver | head -1 | awk '{print $1}')

$ oc get pods -n openshift-kube-apiserver $kubeapiserver_pod -o yaml | grep -n -C8 'Waiting for port :6443'
26-spec:
27-  containers:
28-  - args:
29-    - |-
30-      if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then
31-        echo "Copying system trust bundle"
32-        cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
33-      fi
34:      echo -n "Waiting for port :6443 to be released."
35-      tries=0
36-      while [ -n "$(ss -Htan '( sport = 6443 )')" ]; do
37-        echo -n "."
38-        sleep 1
39-        (( tries += 1 ))
40-        if [[ "${tries}" -gt 105 ]]; then
41-          echo "timed out waiting for port :6443 to be released"
42-          exit 1
--
--
186-  dnsPolicy: ClusterFirst
187-  enableServiceLinks: true
188-  hostNetwork: true
189-  initContainers:
190-  - args:
191-    - |
192-      echo -n "Fixing audit permissions."
193-      chmod 0700 /var/log/kube-apiserver
194:      echo -n "Waiting for port :6443 and :6080 to be released."
195-      while [ -n "$(ss -Htan '( sport = 6443 or sport = 6080 )')" ]; do
196-        echo -n "."
197-        sleep 1
198-      done
199-    command:
200-    - /usr/bin/timeout
201-    - "105"
202-    - /bin/bash

The fix works fine, move the bug verified.

Comment 16 Ke Wang 2020-07-02 02:38:33 UTC
For crashloops checking, we can see RESTARTS is 0, as expected.

$ oc get pods -A -l apiserver
NAMESPACE                  NAME                                                         READY   STATUS    RESTARTS   AGE
openshift-apiserver        apiserver-7c75bb7d79-2z4sr                                   1/1     Running   0          38m
openshift-apiserver        apiserver-7c75bb7d79-s46zv                                   1/1     Running   0          38m
openshift-apiserver        apiserver-7c75bb7d79-vkthh                                   1/1     Running   0          37m
openshift-kube-apiserver   kube-apiserver-ip-10-0-.-75....compute.internal              4/4     Running   0          28m
openshift-kube-apiserver   kube-apiserver-ip-10-0-.-222....compute.internal             4/4     Running   0          24m
openshift-kube-apiserver   kube-apiserver-ip-10-0-.-7....compute.internal               4/4     Running   0          29m

Comment 19 errata-xmlrpc 2020-07-06 20:47:17 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2786


Note You need to log in before you can comment on or make changes to this bug.