Bug 1963730

Summary: kube-apiserver failed to load SNI cert and key
Product: OpenShift Container Platform Reporter: Shay Ulmer <sulmer>
Component: kube-apiserverAssignee: Abu Kashem <akashem>
Status: CLOSED ERRATA QA Contact: Ke Wang <kewang>
Severity: high Docs Contact:
Priority: high    
Version: 4.8CC: aos-bugs, apaladug, dhellmann, dramseur, ercohen, lszaszki, mbargenq, mfojtik, pawankum, rfreiman, rogbas, sdodson, sttts, wlewis, xxia
Target Milestone: ---Keywords: Reopened, ServiceDeliveryImpact
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: tag-ci
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1971624 2108622 (view as bug list) Environment:
Last Closed: 2021-10-15 15:30:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1971624    
Bug Blocks: 2108622    
Attachments:
Description Flags
cert-syncer none

Description Shay Ulmer 2021-05-23 19:16:23 UTC
Description of problem:
Cluster which was upgraded to 4.8.0-fc.3 14 days ago.
I got paged for a ClusterOperatorDegraded alert (kube-apiserver ClusterOperator).
Upon investigation, I found out that one of the kube-apiserver pods was in a CrashLoopBackoff state.
The logs showed:

Error: failed to load SNI cert and key: open /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/tls.crt: no such file or directory

The issue resolved itself after 4-6 hours.
I'm attaching a must-gather which should contain the interesting logs.
Must gather drive link:
https://drive.google.com/file/d/1QlrWV8336H5wl_T3-r_8SK_6RRIh564x/view?usp=sharing

Version-Release number of selected component (if applicable):
4.8.0-fc.3

How reproducible:
Unknown

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Lukasz Szaszkiewicz 2021-05-24 09:53:44 UTC
I am increasing the priority. An outage of 6 hours is a lot. Worth investigating just to make sure we don't have a bug.

Comment 2 Lukasz Szaszkiewicz 2021-05-27 09:43:04 UTC
I have looked into the attached must-gather. It appears that the required resources for installing an API server were missing. Despite that, the cluster-kube-apiserver-operator decided to install a new version (178). 

That caused kube-apiserver-ip-10-163-204-142.ec2.internal to fail with "Error: failed to load SNI cert and key: missing content for serving cert "sni-serving-cert::/etc/kubernetes/static-pod-certs/secrets/internal-loadbalancer-serving-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/internal-loadbalancer-serving-certkey/tls.key"

The other API servers were on the previous revision 177.

Relevant logs from the operator:

InstallerController reconciliation failed: missing required resources: [configmaps: aggregator-client-ca,check-endpoints-kubeconfig,client-ca,control-plane-node-kubeconfig, configmaps: bound-sa-token-signing-certs-178,config-178,etcd-serving-ca-178,kube-apiserver-audit-policies-178,kube-apiserver-cert-syncer-kubeconfig-178,kube-apiserver-pod-178,kubelet-serving-ca-178,sa-token-signing-certs-178]


I am not sure how could I check why the required resources were missing but we should probably fix the operator to not install a new version when things are missing.

Comment 3 Wally 2021-05-27 13:37:51 UTC
Setting blocker- given low sev.

Comment 4 Lukasz Szaszkiewicz 2021-05-27 14:43:40 UTC
It turns out that we do actually check for required resources before installing a new pod. It looks like something must have overwritten the secret. It looks like a race between the cert rotation controller and a different component.

Comment 5 Eran Cohen 2021-05-27 16:35:59 UTC
Seems that if this scenario will happen on a single node OpenShift it will never recover.
Any idea why the cert rotation controller failed to place the new secret?

Comment 8 Stefan Schimanski 2021-05-31 13:22:19 UTC
Continuing with the analysis of #2:

1. The secret internal-loadbalancer-serving-certkey is written to disk by the cert-syncer container in the kube-apiserver static pod.

2. The cert-syncer uses the kube-apiserver-cert-syncer-kubeconfig kubeconfig:

    name: kube-apiserver-cert-syncer
    image: {{.OperatorImage}}
    imagePullPolicy: IfNotPresent
    terminationMessagePolicy: FallbackToLogsOnError
    command: ["cluster-kube-apiserver-operator", "cert-syncer"]
    args:
      - --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig
      - --namespace=$(POD_NAMESPACE)
      - --destination-dir=/etc/kubernetes/static-pod-certs

3. The kube-apiserver-cert-syncer-kubeconfig file talks to localhost:6443:

   apiVersion: v1
   kind: ConfigMap
   metadata:
     name: kube-apiserver-cert-syncer-kubeconfig
     namespace: openshift-kube-apiserver
   data:
     kubeconfig: |
       apiVersion: v1
       clusters:
         - cluster:
             certificate-authority: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-server-ca/ca-bundle.crt
             server: https://localhost:6443
             tls-server-name: localhost-recovery
           name: loopback

4. In the cert-syncer logs on node 10.163.204.142 (where kube-apiserver crash-loops), we see

   2021-05-23T12:52:59.963579828Z E0523 12:52:59.963544       1 reflector.go:138] k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/secrets?limit=500&resourceVersion=0": dial tcp [::1]:6443: connect: connection refused

Hence, kube-apiserver and the cert-syncer seem to be in a dead-lock.

Comment 9 Stefan Schimanski 2021-05-31 13:45:57 UTC
The installer pod also creates the internal-loadbalancer-serving-certkey files:

    - --cert-secrets=internal-loadbalancer-serving-certkey

For whatever reason we miss the revision 178 installer logs (we have them for 177) on the 10-163-204-142 node.

Comment 10 Stefan Schimanski 2021-05-31 15:03:08 UTC
Have reviewed the installer source code, but without luck. In the events one sees that the installer finishes cleanly. So we must assume that it creates the file as it clearly exists at that point in time. It's not optional, so the installer would have failed if it had not been there. In theory, it could be empty (= no files).

Kubelet doesn't show anything about mount problems. So it is still unclear why that internal-loadbalancer-serving-certkey files do not exist in the file system of the kube-apiserver container.

Comment 12 Stefan Schimanski 2021-05-31 18:32:14 UTC
From c11 I got the relevant installer lines and they look innocent as expected:

I0523 11:16:44.672461       1 cmd.go:220] Creating directory \""/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/internal-loadbalancer-serving-certkey
I0523 11:16:44.672479       1 cmd.go:225] Writing secret manifest \""/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/internal-loadbalancer-serving-certkey/tls.key
I0523 11:16:44.672533       1 cmd.go:225] Writing secret manifest \""/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/internal-loadbalancer-serving-certkey/tls.crt

Comment 13 Lukasz Szaszkiewicz 2021-06-01 10:32:24 UTC
Created attachment 1788453 [details]
cert-syncer

Comment 14 Lukasz Szaszkiewicz 2021-06-01 10:39:55 UTC
I think we have found it. It appears to be a race between the installer pod and the cert-syncer container. Both were writing to the same path around the same time ~11:16:44. See the attached files.

Since there is no coordination between the two processes and the default "write method" is to truncate. It might have happened that the race left the content of the file empty.

We are going to fix the installer and the cert-syncer to use a different path.

Comment 18 Ke Wang 2021-07-19 07:54:05 UTC
This bug's PR is dev-approved and not yet merged, so I'm following issue DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of issue OCPQE-815 by using the bot to build one OCP image with the PR and installed a cluster.  Here is the verification steps:

1. Did a quick search from CI tests in the past 14 days [1], there is no any related error logs except finding the bugs that already exists。 

[1] https://search.ci.openshift.org/?search=load+SNI+cert+and+key&maxAge=336h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job


2. I ran one script to force kube-apiserver rollout every 30 minutes, each rollout results in secret manifest files write operation, 
$ cat test.sh
for i in {1..18};do
    oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]'
    sleep 1800
done 
masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')
for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -nr 'Writing secret manifest.*internal-loadbalancer-serving-certkey' /var/log/";done > write-secret.txt
for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -nr 'failed to load SNI cert and key' /var/log/";done > error.txt

$ grep 'fail' error.txt
no results found.

3. And I checked the 'Writing secret manifest' operation in another cluster, observed the installer pod and the cert-syncer container did 'Writing secret manifest' in order without any race.

4.I have observed all of our e2e and upgrade CI tests recently and no reproductions have been found, the conditions that cause this bug are really rare. The easiest way to prevent race conditions due to "read-modify-write" instructions is by ensuring that such operations are atomic,  an atomic way to write file was already used in this bug's PR. 

Based on the four points above, I think the bug was fixed with the PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1153.

The bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically or, if not working, by me manually.

Comment 19 pawankum 2021-07-22 06:30:02 UTC
(In reply to Ke Wang from comment #18)
> This bug's PR is dev-approved and not yet merged, so I'm following issue
> DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of
> issue OCPQE-815 by using the bot to build one OCP image with the PR and
> installed a cluster.  Here is the verification steps:
> 
> 1. Did a quick search from CI tests in the past 14 days [1], there is no any
> related error logs except finding the bugs that already exists。 
> 
> [1]
> https://search.ci.openshift.org/
> ?search=load+SNI+cert+and+key&maxAge=336h&context=1&type=all&name=&excludeNam
> e=&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> 
> 2. I ran one script to force kube-apiserver rollout every 30 minutes, each
> rollout results in secret manifest files write operation, 
> $ cat test.sh
> for i in {1..18};do
>     oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace",
> "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date
> --rfc-3339=ns )"'"} ]'
>     sleep 1800
> done 
> masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk
> '{print $1}')
> for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash
> -c "grep -nr 'Writing secret
> manifest.*internal-loadbalancer-serving-certkey' /var/log/";done >
> write-secret.txt
> for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash
> -c "grep -nr 'failed to load SNI cert and key' /var/log/";done > error.txt
> 
> $ grep 'fail' error.txt
> no results found.
> 
> 3. And I checked the 'Writing secret manifest' operation in another cluster,
> observed the installer pod and the cert-syncer container did 'Writing secret
> manifest' in order without any race.
> 
> 4.I have observed all of our e2e and upgrade CI tests recently and no
> reproductions have been found, the conditions that cause this bug are really
> rare. The easiest way to prevent race conditions due to "read-modify-write"
> instructions is by ensuring that such operations are atomic,  an atomic way
> to write file was already used in this bug's PR. 
> 
> Based on the four points above, I think the bug was fixed with the PR
> https://github.com/openshift/cluster-kube-apiserver-operator/pull/1153.
> 
> The bug is pre-merge verified. After the PR gets merged, the bug will be
> moved to VERIFIED by the bot automatically or, if not working, by me
> manually.

Hello Wang,

IHAC who is having a similar issue, where out of three masters, kube-apiserver is running successful on 2 nodes but getting issue on one node. Is there any workaround for this kind of scenario till 4.8 upgrade? I tried this script but no luck. Tried to copy the certs from other masters and tried rebooting the VM too.


Regards,
Pawan

Comment 20 Ke Wang 2021-08-03 07:04:34 UTC
Hi pawankum, what OCP release are you using? I tested OCP image built with the PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1153, I just checked the PR which has been landed on 4.8 release. You can have a try with 4.8.

Comment 23 Ke Wang 2021-08-11 02:31:21 UTC
Per comment 18, move the bug VERIFIED.

Comment 25 errata-xmlrpc 2021-08-16 18:32:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.5 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3121

Comment 26 Rogerio Bastos 2021-08-19 21:11:01 UTC
A similar issue has been identified in a customer cluster on ARO (Azure Redhat Openshift) while rolling out a cluster upgrade to 4.6.36. Restarting the pod does not solve the issue and it continues Crashlooping.


I0819 21:01:00.118440      17 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.200.76.6:2379  <nil> 0 <nil>} {https://10.200.76.7:2379  <nil> 0 <nil>} {https://10.200.76.9:2379  <nil> 0 <nil>} {https://localhost:2379  <nil> 0 <nil>}]
I0819 21:01:00.127683      17 store.go:1378] Monitoring apiservices.apiregistration.k8s.io count at <storage-prefix>//apiregistration.k8s.io/apiservices
Error: missing content for serving cert "aggregator-proxy-cert::/etc/kubernetes/static-pod-certs/secrets/aggregator-client/tls.crt::/etc/kubernetes/static-pod-certs/secrets/aggregator-client/tls.key"
I0819 21:01:00.141321       1 main.go:198] Termination finished with exit code 1
I0819 21:01:00.141415       1 main.go:151] Deleting termination lock file "/var/log/kube-apiserver/.terminating"



I noticed a PR (https://github.com/openshift/cluster-kube-apiserver-operator/pull/1145) has been merged in Jun-2021 and released in 4.8.5 addressing the issue. Can we expect it to be backported to 4.6.z?

Comment 27 DRamseur 2021-10-14 23:42:02 UTC
@lszaszki  - What needs to be done to get this backported to 4.6.z?

Comment 28 Scott Dodson 2021-10-15 15:30:27 UTC
We NEVER re-open a bug that's been CLOSED ERRATA. If that was done because you want to see this backported then you should clone the bug. If it was done because you believe that the bug is not fixed in the version stated (4.8.5) then please open a new bug and reference this bug.

Comment 29 Anand Paladugu 2021-11-17 18:24:34 UTC
Team,  I have a OCP 4.6 customer, that during upgrade (to 4.7.36) is seeing kube-apiserver pod on one master fail with the following.

I1111 16:51:18.430740      20 dynamic_serving_content.go:111] Loaded a new cert/key pair for "sni-serving-cert::/etc/kubernetes/static-pod-resources/secrets/localhost-recovery-serving-certkey/tls.crt::/etc/kubernetes/static-pod-resources/secrets/localhost-recovery-serving-certkey/tls.key"
Error: failed to load SNI cert and key: tls: failed to parse private key


Restart of pod did not help and the above certs on the failing master and other masters compare fine and they are valid.  Is this possibly a race condition ?

Comment 30 Red Hat Bugzilla 2023-09-15 01:34:13 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days