Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1837992

Summary: A restarted kube-apiserver doesn't wait for the port to be available; crashloops
Product: OpenShift Container Platform Reporter: Casey Callendrello <cdc>
Component: kube-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.5CC: aos-bugs, dblack, mfojtik, smalleni, wking, xxia
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1851066 1851071 (view as bug list) Environment:
Last Closed: 2020-07-13 17:40:28 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1851066, 1851071, 1851915    

Description Casey Callendrello 2020-05-20 10:43:23 UTC
Description of problem:
If the kube-apiserver container is restarted, the port (:6443) isn't yet available, so it crashloops for a few times until it's finally able to start.

This causes pod logs to be lost, obscuring the real reason for the restart.

We handle this case when the whole pod is restarted with an InitContainer. We should handle the same case for container restarts.

Version-Release number of selected component (if applicable):


How reproducible: easy


Steps to Reproduce:
1. kill kube-apiserver
2. watch pod logs.
3.


I should have a fix for this shortly.

Comment 1 Jacob Tanenbaum 2020-05-20 16:32:54 UTC
*** Bug 1834908 has been marked as a duplicate of this bug. ***

Comment 2 Stefan Schimanski 2020-05-28 13:58:52 UTC
This is in the merge queue for the whole day, with infra issues blocking merge.

Comment 6 Ke Wang 2020-06-01 07:33:25 UTC
Verified with OCP 4.5.0-0.nightly-2020-05-30-025738, checked the PR changes in kube-apiserver pod, 

$ kubeapiserver_pod=$(oc get pod -n openshift-kube-apiserver | grep kube-apiserver | head -1 | awk '{print $1}')

$ oc get pods -n openshift-kube-apiserver $kubeapiserver_pod -o yaml | grep -n -C8 'Waiting for port :6443 and :6080 to be released'
344-spec:
345-  containers:
346-  - args:
347-    - |-
348-      if [ -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt ]; then
349-        echo "Copying system trust bundle"
350-        cp -f /etc/kubernetes/static-pod-certs/configmaps/trusted-ca-bundle/ca-bundle.crt /etc/pki/ca-trust/extracted/pem/tls-ca-bundle.pem
351-      fi
352:      echo -n "Waiting for port :6443 and :6080 to be released."
353-      tries=0
354-      while [ -n "$(ss -Htan '( sport = 6443 or sport = 6080 )')" ]; do
355-        echo -n "."
356-        sleep 1
357-        (( tries += 1 ))
358-        if [[ "${tries}" -gt 105 ]]; then
359-          echo "timed out waiting for port :6443 and :6080 to be released"
360-          exit 1
--
504-  dnsPolicy: ClusterFirst
505-  enableServiceLinks: true
506-  hostNetwork: true
507-  initContainers:
508-  - args:
509-    - |
510-      echo -n "Fixing audit permissions."
511-      chmod 0700 /var/log/kube-apiserver
512:      echo -n "Waiting for port :6443 and :6080 to be released."
513-      while [ -n "$(ss -Htan '( sport = 6443 or sport = 6080 )')" ]; do
514-        echo -n "."
515-        sleep 1
516-      done
517-    command:
518-    - /usr/bin/timeout
519-    - "105"
520-    - /bin/bash


Redeployed the kube-apiserver pod and check if the it works as expected.

$oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "forced test 1" } ]'

$ oc logs -n openshift-kube-apiserver $kubeapiserver_pod | grep 'Waiting for port :6443 and :6080 to be released'
Waiting for port :6443 and :6080 to be released.

From above, we can see the kube-apiserver waited for the port to be available without crashloops, so move the bug verified.

Comment 7 W. Trevor King 2020-06-25 16:13:13 UTC
This bug was verified with a 4.5 nightly when the target was 4.5.0.  It's also attached to a 4.5 errata.  Moving the target back to 4.5.0, reversing Stefan's change from the 4th.

Comment 8 W. Trevor King 2020-06-25 16:13:46 UTC
*** Bug 1851071 has been marked as a duplicate of this bug. ***

Comment 9 errata-xmlrpc 2020-07-13 17:40:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409