Hide Forgot
Description of problem: Cluster which was upgraded to 4.8.0-fc.3 14 days ago. I got paged for a ClusterOperatorDegraded alert (kube-apiserver ClusterOperator). Upon investigation, I found out that one of the kube-apiserver pods was in a CrashLoopBackoff state. The logs showed: Error: failed to load SNI cert and key: open /etc/kubernetes/static-pod-certs/secrets/user-serving-cert-000/tls.crt: no such file or directory The issue resolved itself after 4-6 hours. I'm attaching a must-gather which should contain the interesting logs. Must gather drive link: https://drive.google.com/file/d/1QlrWV8336H5wl_T3-r_8SK_6RRIh564x/view?usp=sharing Version-Release number of selected component (if applicable): 4.8.0-fc.3 How reproducible: Unknown Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I am increasing the priority. An outage of 6 hours is a lot. Worth investigating just to make sure we don't have a bug.
I have looked into the attached must-gather. It appears that the required resources for installing an API server were missing. Despite that, the cluster-kube-apiserver-operator decided to install a new version (178). That caused kube-apiserver-ip-10-163-204-142.ec2.internal to fail with "Error: failed to load SNI cert and key: missing content for serving cert "sni-serving-cert::/etc/kubernetes/static-pod-certs/secrets/internal-loadbalancer-serving-certkey/tls.crt::/etc/kubernetes/static-pod-certs/secrets/internal-loadbalancer-serving-certkey/tls.key" The other API servers were on the previous revision 177. Relevant logs from the operator: InstallerController reconciliation failed: missing required resources: [configmaps: aggregator-client-ca,check-endpoints-kubeconfig,client-ca,control-plane-node-kubeconfig, configmaps: bound-sa-token-signing-certs-178,config-178,etcd-serving-ca-178,kube-apiserver-audit-policies-178,kube-apiserver-cert-syncer-kubeconfig-178,kube-apiserver-pod-178,kubelet-serving-ca-178,sa-token-signing-certs-178] I am not sure how could I check why the required resources were missing but we should probably fix the operator to not install a new version when things are missing.
Setting blocker- given low sev.
It turns out that we do actually check for required resources before installing a new pod. It looks like something must have overwritten the secret. It looks like a race between the cert rotation controller and a different component.
Seems that if this scenario will happen on a single node OpenShift it will never recover. Any idea why the cert rotation controller failed to place the new secret?
Continuing with the analysis of #2: 1. The secret internal-loadbalancer-serving-certkey is written to disk by the cert-syncer container in the kube-apiserver static pod. 2. The cert-syncer uses the kube-apiserver-cert-syncer-kubeconfig kubeconfig: name: kube-apiserver-cert-syncer image: {{.OperatorImage}} imagePullPolicy: IfNotPresent terminationMessagePolicy: FallbackToLogsOnError command: ["cluster-kube-apiserver-operator", "cert-syncer"] args: - --kubeconfig=/etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-cert-syncer-kubeconfig/kubeconfig - --namespace=$(POD_NAMESPACE) - --destination-dir=/etc/kubernetes/static-pod-certs 3. The kube-apiserver-cert-syncer-kubeconfig file talks to localhost:6443: apiVersion: v1 kind: ConfigMap metadata: name: kube-apiserver-cert-syncer-kubeconfig namespace: openshift-kube-apiserver data: kubeconfig: | apiVersion: v1 clusters: - cluster: certificate-authority: /etc/kubernetes/static-pod-resources/configmaps/kube-apiserver-server-ca/ca-bundle.crt server: https://localhost:6443 tls-server-name: localhost-recovery name: loopback 4. In the cert-syncer logs on node 10.163.204.142 (where kube-apiserver crash-loops), we see 2021-05-23T12:52:59.963579828Z E0523 12:52:59.963544 1 reflector.go:138] k8s.io/client-go.0/tools/cache/reflector.go:167: Failed to watch *v1.Secret: failed to list *v1.Secret: Get "https://localhost:6443/api/v1/namespaces/openshift-kube-apiserver/secrets?limit=500&resourceVersion=0": dial tcp [::1]:6443: connect: connection refused Hence, kube-apiserver and the cert-syncer seem to be in a dead-lock.
The installer pod also creates the internal-loadbalancer-serving-certkey files: - --cert-secrets=internal-loadbalancer-serving-certkey For whatever reason we miss the revision 178 installer logs (we have them for 177) on the 10-163-204-142 node.
Have reviewed the installer source code, but without luck. In the events one sees that the installer finishes cleanly. So we must assume that it creates the file as it clearly exists at that point in time. It's not optional, so the installer would have failed if it had not been there. In theory, it could be empty (= no files). Kubelet doesn't show anything about mount problems. So it is still unclear why that internal-loadbalancer-serving-certkey files do not exist in the file system of the kube-apiserver container.
From c11 I got the relevant installer lines and they look innocent as expected: I0523 11:16:44.672461 1 cmd.go:220] Creating directory \""/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/internal-loadbalancer-serving-certkey I0523 11:16:44.672479 1 cmd.go:225] Writing secret manifest \""/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/internal-loadbalancer-serving-certkey/tls.key I0523 11:16:44.672533 1 cmd.go:225] Writing secret manifest \""/etc/kubernetes/static-pod-resources/kube-apiserver-certs/secrets/internal-loadbalancer-serving-certkey/tls.crt
Created attachment 1788453 [details] cert-syncer
I think we have found it. It appears to be a race between the installer pod and the cert-syncer container. Both were writing to the same path around the same time ~11:16:44. See the attached files. Since there is no coordination between the two processes and the default "write method" is to truncate. It might have happened that the race left the content of the file empty. We are going to fix the installer and the cert-syncer to use a different path.
This bug's PR is dev-approved and not yet merged, so I'm following issue DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of issue OCPQE-815 by using the bot to build one OCP image with the PR and installed a cluster. Here is the verification steps: 1. Did a quick search from CI tests in the past 14 days [1], there is no any related error logs except finding the bugs that already exists。 [1] https://search.ci.openshift.org/?search=load+SNI+cert+and+key&maxAge=336h&context=1&type=all&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job 2. I ran one script to force kube-apiserver rollout every 30 minutes, each rollout results in secret manifest files write operation, $ cat test.sh for i in {1..18};do oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date --rfc-3339=ns )"'"} ]' sleep 1800 done masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}') for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -nr 'Writing secret manifest.*internal-loadbalancer-serving-certkey' /var/log/";done > write-secret.txt for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash -c "grep -nr 'failed to load SNI cert and key' /var/log/";done > error.txt $ grep 'fail' error.txt no results found. 3. And I checked the 'Writing secret manifest' operation in another cluster, observed the installer pod and the cert-syncer container did 'Writing secret manifest' in order without any race. 4.I have observed all of our e2e and upgrade CI tests recently and no reproductions have been found, the conditions that cause this bug are really rare. The easiest way to prevent race conditions due to "read-modify-write" instructions is by ensuring that such operations are atomic, an atomic way to write file was already used in this bug's PR. Based on the four points above, I think the bug was fixed with the PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1153. The bug is pre-merge verified. After the PR gets merged, the bug will be moved to VERIFIED by the bot automatically or, if not working, by me manually.
(In reply to Ke Wang from comment #18) > This bug's PR is dev-approved and not yet merged, so I'm following issue > DPTP-660 to do the pre-merge verifying for QE pre-merge verification goal of > issue OCPQE-815 by using the bot to build one OCP image with the PR and > installed a cluster. Here is the verification steps: > > 1. Did a quick search from CI tests in the past 14 days [1], there is no any > related error logs except finding the bugs that already exists。 > > [1] > https://search.ci.openshift.org/ > ?search=load+SNI+cert+and+key&maxAge=336h&context=1&type=all&name=&excludeNam > e=&maxMatches=5&maxBytes=20971520&groupBy=job > > > 2. I ran one script to force kube-apiserver rollout every 30 minutes, each > rollout results in secret manifest files write operation, > $ cat test.sh > for i in {1..18};do > oc patch kubeapiserver/cluster --type=json -p '[ {"op": "replace", > "path": "/spec/forceRedeploymentReason", "value": "roll-'"$( date > --rfc-3339=ns )"'"} ]' > sleep 1800 > done > masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk > '{print $1}') > for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash > -c "grep -nr 'Writing secret > manifest.*internal-loadbalancer-serving-certkey' /var/log/";done > > write-secret.txt > for node in $masters; do echo $node;oc debug no/$node -- chroot /host bash > -c "grep -nr 'failed to load SNI cert and key' /var/log/";done > error.txt > > $ grep 'fail' error.txt > no results found. > > 3. And I checked the 'Writing secret manifest' operation in another cluster, > observed the installer pod and the cert-syncer container did 'Writing secret > manifest' in order without any race. > > 4.I have observed all of our e2e and upgrade CI tests recently and no > reproductions have been found, the conditions that cause this bug are really > rare. The easiest way to prevent race conditions due to "read-modify-write" > instructions is by ensuring that such operations are atomic, an atomic way > to write file was already used in this bug's PR. > > Based on the four points above, I think the bug was fixed with the PR > https://github.com/openshift/cluster-kube-apiserver-operator/pull/1153. > > The bug is pre-merge verified. After the PR gets merged, the bug will be > moved to VERIFIED by the bot automatically or, if not working, by me > manually. Hello Wang, IHAC who is having a similar issue, where out of three masters, kube-apiserver is running successful on 2 nodes but getting issue on one node. Is there any workaround for this kind of scenario till 4.8 upgrade? I tried this script but no luck. Tried to copy the certs from other masters and tried rebooting the VM too. Regards, Pawan
Hi pawankum, what OCP release are you using? I tested OCP image built with the PR https://github.com/openshift/cluster-kube-apiserver-operator/pull/1153, I just checked the PR which has been landed on 4.8 release. You can have a try with 4.8.
Per comment 18, move the bug VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.8.5 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:3121
A similar issue has been identified in a customer cluster on ARO (Azure Redhat Openshift) while rolling out a cluster upgrade to 4.6.36. Restarting the pod does not solve the issue and it continues Crashlooping. I0819 21:01:00.118440 17 endpoint.go:68] ccResolverWrapper: sending new addresses to cc: [{https://10.200.76.6:2379 <nil> 0 <nil>} {https://10.200.76.7:2379 <nil> 0 <nil>} {https://10.200.76.9:2379 <nil> 0 <nil>} {https://localhost:2379 <nil> 0 <nil>}] I0819 21:01:00.127683 17 store.go:1378] Monitoring apiservices.apiregistration.k8s.io count at <storage-prefix>//apiregistration.k8s.io/apiservices Error: missing content for serving cert "aggregator-proxy-cert::/etc/kubernetes/static-pod-certs/secrets/aggregator-client/tls.crt::/etc/kubernetes/static-pod-certs/secrets/aggregator-client/tls.key" I0819 21:01:00.141321 1 main.go:198] Termination finished with exit code 1 I0819 21:01:00.141415 1 main.go:151] Deleting termination lock file "/var/log/kube-apiserver/.terminating" I noticed a PR (https://github.com/openshift/cluster-kube-apiserver-operator/pull/1145) has been merged in Jun-2021 and released in 4.8.5 addressing the issue. Can we expect it to be backported to 4.6.z?
@lszaszki - What needs to be done to get this backported to 4.6.z?
We NEVER re-open a bug that's been CLOSED ERRATA. If that was done because you want to see this backported then you should clone the bug. If it was done because you believe that the bug is not fixed in the version stated (4.8.5) then please open a new bug and reference this bug.
Team, I have a OCP 4.6 customer, that during upgrade (to 4.7.36) is seeing kube-apiserver pod on one master fail with the following. I1111 16:51:18.430740 20 dynamic_serving_content.go:111] Loaded a new cert/key pair for "sni-serving-cert::/etc/kubernetes/static-pod-resources/secrets/localhost-recovery-serving-certkey/tls.crt::/etc/kubernetes/static-pod-resources/secrets/localhost-recovery-serving-certkey/tls.key" Error: failed to load SNI cert and key: tls: failed to parse private key Restart of pod did not help and the above certs on the failing master and other masters compare fine and they are valid. Is this possibly a race condition ?