Bug 2026387 - node tuning operator metrics endpoint serving old certificates after certificate rotation
Summary: node tuning operator metrics endpoint serving old certificates after certific...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Tuning Operator
Version: 4.7
Hardware: All
OS: All
medium
medium
Target Milestone: ---
: 4.10.0
Assignee: dagray
QA Contact: liqcui
URL:
Whiteboard:
Depends On:
Blocks: 2033652
TreeView+ depends on / blocked
 
Reported: 2021-11-24 14:26 UTC by Andreas Nowak
Modified: 2022-03-10 16:30 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-03-10 16:30:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-node-tuning-operator pull 297 0 None open Bug 2026387: Handle certificate rotation in pkg/metrics/server.go 2021-12-16 20:33:35 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-10 16:30:48 UTC

Comment 2 liqcui 2021-12-23 12:32:21 UTC
Verified Result:
 oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-21-130047   True        False         304d    Cluster version is 4.10.0-0.nightly-2021-12-21-130047

After change the ntp server time and sync the master and worker nodes's time, no csr generated.
[root@ip-10-0-26-251 .ssh]# oc get nodes
NAME                                         STATUS   ROLES    AGE    VERSION
ip-10-0-132-119.us-east-2.compute.internal   Ready    worker   304d   v1.22.1+6859754
ip-10-0-135-42.us-east-2.compute.internal    Ready    master   304d   v1.22.1+6859754
ip-10-0-166-194.us-east-2.compute.internal   Ready    master   304d   v1.22.1+6859754
ip-10-0-191-218.us-east-2.compute.internal   Ready    worker   304d   v1.22.1+6859754
ip-10-0-194-98.us-east-2.compute.internal    Ready    worker   304d   v1.22.1+6859754
ip-10-0-214-31.us-east-2.compute.internal    Ready    master   304d   v1.22.1+6859754

The certificate didn't rotate automatically. and kubelet couldn't startup with below error:
Oct 23 12:26:05 ip-10-0-135-42 hyperkube[1588]: E1023 12:26:05.793101    1588 transport.go:112] kubernetes.io/kube-apiserver-client-kubelet: Current certificate is expired
ansport.go:112] "No valid client certificate is found but the server is not responsive. A restart may be necessary to retrieve new initial credentials." lastCertificateAvailabilityTime="2021-12-23 12:08:55.149157194 +0000 UTC m=+0.584949896" shutdownThreshold="5m0s

~
No pending csr was found.
oc get csr
NAME                                             AGE   SIGNERNAME                            REQUESTOR                                                                         REQUESTEDDURATION   CONDITION
system:openshift:openshift-authenticator-b69rj   37m   kubernetes.io/kube-apiserver-client   system:serviceaccount:openshift-authentication-operator:authentication-operator   <none>              Approved
system:openshift:openshift-monitoring-7qdqs      37m   kubernetes.io/kube-apiserver-client   system:serviceaccount:openshift-monitoring:cluster-monitoring-operator            <none>              Approved

Comment 4 liqcui 2021-12-31 12:44:14 UTC
Fail to reproduce on AWS, open a new bug to track this issue https://bugzilla.redhat.com/show_bug.cgi?id=2036361

I tried to reproduce the BZ several times, the kubelet couldn't startup after certificate rotation. the openshift console couldn't load and login, threw 401 error.

Console Error:
{"error":"server_error","error_description":"The authorization server encountered an unexpected condition that prevented it from fulfilling the request.","state":"39a034bf"}

My main test steps:
1. On NTP server:
[root@ip-10-0-31-53 ec2-user]# cat /etc/chrony.conf 
driftfile /var/lib/chrony/drift
makestep 1.0 3
allow 10.0.0.0/12
local stratum 1
logdir /var/log/chrony
manual
[root@ip-10-0-31-53 ec2-user]# systemctl restart chronyd

2. Create MachineConfiguration.
[ocpadmin@ec2-18-217-45-133 nto]$ oc create -f master-mc.yaml 
machineconfig.machineconfiguration.openshift.io/99-master-chrony created
[ocpadmin@ec2-18-217-45-133 nto]$ oc create -f worker-mc.yaml 
machineconfig.machineconfiguration.openshift.io/99-worker-chrony created
[ocpadmin@ec2-18-217-45-133 nto]$ cat worker-mc.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: 99-worker-chrony
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,pool%20ip-10-0-31-53.us-east-2.compute.internal%20iburst%20%0Adriftfile%20%2Fvar%2Flib%2Fchrony%2Fdrift%0Amakestep%201.0%203%0Artcsync%0Alogdir%20%2Fvar%2Flog%2Fchrony%0A
        mode: 420
        overwrite: true
        path: /etc/chrony.conf
[ocpadmin@ec2-18-217-45-133 nto]$ cat master-mc.yaml
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: master
  name: 99-master-chrony
spec:
  config:
    ignition:
      version: 3.2.0
    storage:
      files:
      - contents:
          source: data:,pool%20ip-10-0-31-53.us-east-2.compute.internal%20iburst%20%0Adriftfile%20%2Fvar%2Flib%2Fchrony%2Fdrift%0Amakestep%201.0%203%0Artcsync%0Alogdir%20%2Fvar%2Flog%2Fchrony%0A
        mode: 420
        overwrite: true
        path: /etc/chrony.conf

3. Change NTP Server Date
[root@ip-10-0-31-53 ec2-user]# date
Mon Oct 31 01:34:15 EDT 2022

4. Restart chronyd and kubelet on master server.

5. Check the csr has been approved.
[root@ip-10-0-31-53 ec2-user]# oc get csr
NAME                                             AGE   SIGNERNAME                            REQUESTOR                                                                         REQUESTEDDURATION   CONDITION
system:openshift:openshift-authenticator-998s5   14m   kubernetes.io/kube-apiserver-client   system:serviceaccount:openshift-authentication-operator:authentication-operator   <none>              Approved
system:openshift:openshift-monitoring-mn9xj      14m   kubernetes.io/kube-apiserver-client   system:serviceaccount:openshift-monitoring:cluster-monitoring-operator            <none>              Approved

The kubelet couldn't startup, fail to reproduce.

Any specified hardware that customer deploy the OCP?

Comment 9 liqcui 2022-01-07 12:58:47 UTC
Thanks Andreas and David's suggestion. 

I have verified on my cluster as below

[ocpadmin@ec2-18-217-45-133 ~]$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-07-004348   True        False         41m     Cluster version is 4.10.0-0.nightly-2022-01-07-004348


oc describe service/node-tuning-operator  -n openshift-cluster-node-tuning-operator | grep Endpoints
Endpoints:         10.129.0.21:60000

export METRICS_ENDPOINT="10.129.0.21:60000"

oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null" | tee openssl_output_before.txt

oc debug node/liqcui-vmc410-bqjzp-worker-f5k78  -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_before.txt
Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ...
To use host binaries, run `chroot /host`
notBefore=Jan  7 11:04:47 2022 GMT
notAfter=Jan  7 11:04:48 2024 GMT

$ oc get pods -n openshift-cluster-node-tuning-operator
NAME                                            READY   STATUS    RESTARTS   AGE
cluster-node-tuning-operator-57878bc8f5-66p6r   1/1     Running   0          54m
tuned-4tdqr                                     1/1     Running   0          44m
tuned-gxxm4                                     1/1     Running   0          51m
tuned-mjnnh                                     1/1     Running   0          44m
tuned-tz4bz                                     1/1     Running   0          51m
tuned-wxvvp                                     1/1     Running   0          51m


$ oc get secret -n openshift-cluster-node-tuning-operator
NAME                                           TYPE                                  DATA   AGE
builder-dockercfg-np4g6                        kubernetes.io/dockercfg               1      48m
builder-token-qsv2z                            kubernetes.io/service-account-token   4      48m
builder-token-wc2pj                            kubernetes.io/service-account-token   4      48m
cluster-node-tuning-operator-dockercfg-fp2b5   kubernetes.io/dockercfg               1      48m
cluster-node-tuning-operator-token-9b4zs       kubernetes.io/service-account-token   4      48m
cluster-node-tuning-operator-token-md65z       kubernetes.io/service-account-token   4      57m
default-dockercfg-kvwd8                        kubernetes.io/dockercfg               1      48m
default-token-4gz8m                            kubernetes.io/service-account-token   4      58m
default-token-c6tld                            kubernetes.io/service-account-token   4      48m
deployer-dockercfg-xxkwc                       kubernetes.io/dockercfg               1      48m
deployer-token-76xsd                           kubernetes.io/service-account-token   4      48m
deployer-token-fjkmk                           kubernetes.io/service-account-token   4      48m
node-tuning-operator-tls                       kubernetes.io/tls                     2      55m
tuned-dockercfg-55gms                          kubernetes.io/dockercfg               1      48m
tuned-token-hc8vm                              kubernetes.io/service-account-token   4      57m
tuned-token-sfswb                              kubernetes.io/service-account-token   4      48m
[ocpadmin@ec2-18-217-45-133 ~]$ oc delete secret node-tuning-operator-tls -n openshift-cluster-node-tuning-operator
secret "node-tuning-operator-tls" deleted


 oc logs  cluster-node-tuning-operator-57878bc8f5-66p6r -n openshift-cluster-node-tuning-operator|tail -10
I0107 11:10:22.425041       1 trace.go:205] Trace[691670059]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (07-Jan-2022 11:09:52.423) (total time: 30001ms):
Trace[691670059]: ---"Objects listed" error:Get "https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-cluster-node-tuning-operator/daemonsets?resourceVersion=12519": dial tcp 172.30.0.1:443: i/o timeout 30001ms (11:10:22.424)
Trace[691670059]: [30.001192828s] [30.001192828s] END
E0107 11:10:22.425061       1 reflector.go:138] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.DaemonSet: failed to list *v1.DaemonSet: Get "https://172.30.0.1:443/apis/apps/v1/namespaces/openshift-cluster-node-tuning-operator/daemonsets?resourceVersion=12519": dial tcp 172.30.0.1:443: i/o timeout
I0107 11:12:34.929912       1 controller.go:595] created profile liqcui-vmc410-bqjzp-worker-f5k78 [openshift-node]
I0107 11:12:36.653687       1 controller.go:595] created profile liqcui-vmc410-bqjzp-worker-s4c5k [openshift-node]
I0107 12:01:06.917646       1 server.go:144] cert and key changed, need to restart the server.
I0107 12:01:06.917698       1 server.go:107] restarting metrics server to rotate certificates
I0107 12:01:06.917706       1 server.go:60] stopping metrics server
I0107 12:01:07.991194       1 server.go:51] starting metrics server
[ocpadmin@ec2-18-217-45-133 ~]$ 

oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt
Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ...
To use host binaries, run `chroot /host`
notBefore=Jan  7 12:00:05 2022 GMT
notAfter=Jan  7 12:00:06 2024 GMT

oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null" | tee openssl_output_after.txt

oc get secret/node-tuning-operator-tls -o json -n openshift-cluster-node-tuning-operator | jq -r '.data | ."tls.crt"' | base64 -d |  sed '/-END CERTIFICATE-/q' > cert_secret_after.txt
[ocpadmin@ec2-18-217-45-133 ~]$  diff cert_after.txt cert_secret_after.txt

oc delete secret node-tuning-operator-tls -n openshift-cluster-node-tuning-operator
secret "node-tuning-operator-tls" deleted
[ocpadmin@ec2-18-217-45-133 ~]$ date
Fri Jan  7 12:12:23 UTC 2022
[ocpadmin@ec2-18-217-45-133 ~]$ oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt2
Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ...
To use host binaries, run `chroot /host`
notBefore=Jan  7 12:12:20 2022 GMT
notAfter=Jan  7 12:12:21 2024 GMT

##########################################
ocpadmin@ec2-18-217-45-133 ~]$ oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt
Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ...
To use host binaries, run `chroot /host`
notBefore=Jan  7 12:25:07 2022 GMT
notAfter=Jan  7 12:25:08 2024 GMT

[ocpadmin@ec2-18-217-45-133 ~]$ oc get nodes
NAME                               STATUS   ROLES    AGE    VERSION
liqcui-vmc410-bqjzp-master-0       Ready    master   107m   v1.22.1+6859754
liqcui-vmc410-bqjzp-master-1       Ready    master   107m   v1.22.1+6859754
liqcui-vmc410-bqjzp-master-2       Ready    master   107m   v1.22.1+6859754
liqcui-vmc410-bqjzp-worker-f5k78   Ready    worker   97m    v1.22.1+6859754
liqcui-vmc410-bqjzp-worker-s4c5k   Ready    worker   97m    v1.22.1+6859754
[ocpadmin@ec2-18-217-45-133 ~]$ oc delete secret/signing-key -n openshift-service-ca
secret "signing-key" deleted
The certifiate was rotated by delete signing-key
oc debug node/liqcui-vmc410-bqjzp-worker-f5k78 -- /bin/bash -c "/host/bin/openssl s_client -connect $METRICS_ENDPOINT 2>/dev/null </dev/null | openssl x509 -noout -dates" | tee cert_dates_after.txt
Starting pod/liqcui-vmc410-bqjzp-worker-f5k78-debug ...
To use host binaries, run `chroot /host`
notBefore=Jan  7 12:52:00 2022 GMT
notAfter=Jan  7 12:52:01 2024 GMT

Comment 12 errata-xmlrpc 2022-03-10 16:30:34 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056


Note You need to log in before you can comment on or make changes to this bug.