Bug 1784052 - maintaining the certificates can get wedged if someone manually messes up the secrets we write to
Summary: maintaining the certificates can get wedged if someone manually messes up the...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: kube-apiserver
Version: 4.3.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.4.0
Assignee: Stefan Schimanski
QA Contact: Ke Wang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-12-16 15:14 UTC by David Eads
Modified: 2020-05-04 11:20 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-05-04 11:20:19 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift library-go pull 649 0 None closed bug 1784052: delete secrets if immutable fields don't match 2020-10-26 21:29:09 UTC
Red Hat Product Errata RHBA-2020:0581 0 None None None 2020-05-04 11:20:43 UTC

Description David Eads 2019-12-16 15:14:59 UTC
A loop for maintaining the certificates can get wedged if someone manually messes up the secrets we write to.


```
    "conditions": [
      {
        "type": "Degraded",
        "status": "True",
        "lastTransitionTime": "2019-12-14T11:01:33Z",
        "reason": "CertRotation_AggregatorProxyClientCert__RotationError",
        "message": "CertRotation_AggregatorProxyClientCert_Degraded: Secret \"aggregator-client-signer\" is invalid: type: Invalid value: \"kubernetes.io/tls\": field is immutable"
      },

```

I suspect this is caused by user naughtiness, but we should still stomp.

Comment 3 Ke Wang 2020-01-14 09:24:04 UTC
Hi David, I am working on this bug, bug I got less from the bug and openshift library-go pull 649. Could you give me some hints, How to verify for this change?  Thanks.

Comment 4 Ke Wang 2020-01-15 09:45:28 UTC
From the https://github.com/openshift/library-go/pull/649 file changes, to verify the code change, need to apply secret, right? 

I suppose just like this,  here are my verification steps,

oc new-project myprj

oc new-app ruby~https://github.com/sclorg/ruby-ex.git

oc get secret
NAME                       TYPE                                  DATA   AGE
builder-dockercfg-46l87    kubernetes.io/dockercfg               1      5h56m
builder-token-8jwkx        kubernetes.io/service-account-token   4      5h56m
builder-token-pxh4n        kubernetes.io/service-account-token   4      5h56m
default-dockercfg-z9k85    kubernetes.io/dockercfg               1      5h56m
default-token-p9m6z        kubernetes.io/service-account-token   4      5h56m
default-token-whjws        kubernetes.io/service-account-token   4      5h56m
deployer-dockercfg-9skpg   kubernetes.io/dockercfg               1      5h56m
deployer-token-5dbqm       kubernetes.io/service-account-token   4      5h56m
deployer-token-rx7ms       kubernetes.io/service-account-token   4      5h56m

Select one secret and export it to a yaml file.
oc get secrets default-token-whjws -o yaml > default-token-whjws.yaml

Manually messed up the secrets file, add a useless item test: "xxxx', see below,
...
kind: Secret    
metadata:    
  annotations:    
    kubernetes.io/service-account.name: default    
    kubernetes.io/service-account.uid: fe373075-955b-4224-83b1-21f616fa77c1    
  creationTimestamp: "2020-01-15T03:09:06Z"    
  name: default-token-whjws    
  namespace: myprj    
  test: "xxxx"    
  resourceVersion: "38142"    
  selfLink: /api/v1/namespaces/myprj/secrets/default-token-whjws    
  uid: ec5454f2-b12d-4103-a3b8-9aae37e579ee    
type: kubernetes.io/service-account-token

Tried to apply the change to secret,
$ oc apply -f default-token-whjws.yaml
Warning: oc apply should be used on resource created by either oc create --save-config or oc apply
secret/default-token-whjws configured

Checked the changed secret,  found that the change was deleted.
oc get secrets default-token-whjws -o yaml
kind: Secret    
...
  creationTimestamp: "2020-01-15T03:09:06Z"    
  name: default-token-whjws    
  namespace: myprj     
  resourceVersion: "38142"    
  selfLink: /api/v1/namespaces/myprj/secrets/default-token-whjws    
  uid: ec5454f2-b12d-4103-a3b8-9aae37e579ee    
type: kubernetes.io/service-account-token

I am not sure if the behavior is as expected as bug fix, and I don't know how to check DeleteEvent for this behavior.  Could you give me some advise about this?  Thanks.

Comment 5 Xingxing Xia 2020-01-17 08:53:03 UTC
Ke, I admit this bug indeed is not easy to verify if we are not familiar with code like Dev. I tried to read the code https://github.com/openshift/library-go/pull/649/files#diff-f8cab7977ffc2a35816f633d9f8f166dL210 under " func ApplySecret ":
	existing, err := client.Secrets(required.Namespace).Get(required.Name, metav1.GetOptions{}) ## get secret
...
	existingCopy := existing.DeepCopy() ## make an object copy
...
	existingCopy.Type = required.Type   ## copy Type
...
 	actual, err := client.Secrets(required.Namespace).Update(existingCopy)  ## update secret with the copied object
...
 	if !strings.Contains(err.Error(), "field is immutable") {  ## if the copied object changed immutable field
...
	deleteErr := client.Secrets(required.Namespace).Delete(existingCopy.Name, nil)  ## delete it
	reportDeleteEvent(recorder, existingCopy, deleteErr)  ## report the delete
...

After knowing above logic, in latest 4.4 env, let's try:
oc get secret aggregator-client-signer -n openshift-kube-apiserver-operator -o yaml > secret.yaml
sed -i "s/type: SecretTypeTLS/type: Opaque/" secret.yaml

Since this secret is cluster-auto-managed/created secret, we need to try below several times until " secret/aggregator-client-signer created "
oc delete secret aggregator-client-signer -n openshift-kube-apiserver-operator; oc create -f secret.yaml
-n openshift-kube-apiserver-operator
secret "aggregator-client-signer" deleted
Error from server (AlreadyExists): error when creating "secret.yaml": secrets "aggregator-client-signer" already exists                
[xxia 2020-01-17 16:10:58 my]$ oc delete secret aggregator-client-signer -n openshift-kube-apiserver-operator; oc create -f secret.yaml
-n openshift-kube-apiserver-operator
secret "aggregator-client-signer" deleted
secret/aggregator-client-signer created

oc logs deployment/kube-apiserver-operator -n openshift-kube-apiserver-operator > kas-o.log
vi kas-o.log ## you will see " field is immutable ", SecretDeleted, SecretCreated
I0117 08:11:08.328368       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2433149b-30c2-4470-a74a-a08bf63812d3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SecretUpdateFailed' Failed to update Secret/aggregator-client-signer -n openshift-kube-apiserver-operator: Secret "aggregator-client-signer" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable
I0117 08:11:10.336023       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2433149b-30c2-4470-a74a-a08bf63812d3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretDeleted' Deleted Secret/aggregator-client-signer -n openshift-kube-apiserver-operator
...
I0117 08:11:12.131759       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"2433149b-30c2-4470-a74a-a08bf63812d3", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretCreated' Created Secret/aggregator-client-signer -n openshift-kube-apiserver-operator because it was missing

oc get secret aggregator-client-signer -n openshift-kube-apiserver-operator
NAME                       TYPE                DATA   AGE
aggregator-client-signer   kubernetes.io/tls   2      4s  ## SecretCreated, Type is not above "Opaque"

Comment 6 Xingxing Xia 2020-01-17 09:16:58 UTC
Now that David had test code https://github.com/openshift/cluster-kube-apiserver-operator/pull/703/files , you could also verify this bug via the test code if you agree the test code can cover the fix:
$ git clone <the cluster-kube-apiserver-operator repo> # or git pull if already cloned
$ git checkout <the-branch-that-includes-the-pr>
$ cd <the PR test file's directory>
$ export KUBECONFIG=<kubeconfig path>
go test -v -run TestCertRotationStompOnBadType
=== RUN   TestCertRotationStompOnBadType
Found configuration for host https://...snipped...:6443.
--- PASS: TestCertRotationStompOnBadType (3.30s)
PASS
ok      github.com/openshift/cluster-kube-apiserver-operator/test/e2e   3.313s

Comment 7 Ke Wang 2020-01-20 07:14:06 UTC
Verified with OCP 4.4.0-0.nightly-2020-01-19-223035.

Tried the xxia's first way, 
$ oc get secret aggregator-client-signer -n openshift-kube-apiserver-operator -o yaml > secret.yaml
$ sed -i "s/type: SecretTypeTLS/type: Opaque/" secret.yaml

Tried below CLI several times until " secret/aggregator-client-signer created "
$ oc delete secret aggregator-client-signer -n openshift-kube-apiserver-operator;oc create -f secret.yaml -n openshift-kube-apiserver-operator
secret "aggregator-client-signer" deleted
secret/aggregator-client-signer created

Export the log to file and check the related events,
$ oc logs deployment/kube-apiserver-operator -n openshift-kube-apiserver-operator > kad-o.log

We can see the events -> Secret "aggregator-client-signer" has a invalid field which is immutable, we messed up it by creating from secret.yaml -> Deleted Secret/aggregator-client-signer -> Created Secret/aggregator-client-signer, just like the workflow in PR openshift library-go pull 649.

687 I0120 06:49:24.385134       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"1d7746e1-edfa-    4c9c-97ba-4ddb87caf8ad", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Warning' reason: 'SecretUpdateFailed' Failed to update Secret/aggregator-client-signer -n opensh    ift-kube-apiserver-operator: Secret "aggregator-client-signer" is invalid: type: Invalid value: "kubernetes.io/tls": field is immutable                                                   
688 I0120 06:49:24.394424       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"1d7746e1-edfa-    4c9c-97ba-4ddb87caf8ad", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretDeleted' Deleted Secret/aggregator-client-signer -n openshift-kube-apiser    ver-operator
689 I0120 06:49:24.406008       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"1d7746e1-edfa-    4c9c-97ba-4ddb87caf8ad", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'CABundleUpdateRequired' "kube-apiserver-aggregator-client-ca" in "openshift-con    fig-managed" requires a new cert
690 I0120 06:49:24.406031       1 event.go:281] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-apiserver-operator", Name:"kube-apiserver-operator", UID:"1d7746e1-edfa-    4c9c-97ba-4ddb87caf8ad", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'SecretCreated' Created Secret/aggregator-client-signer -n openshift-kube-apiser    ver-operator because it was missing


Another way verify from related test code:
Prerequisites:
One openshift ENV is ready.
Install oc client on your test-box.

$ git clone <the cluster-kube-apiserver-operator repo> # or git pull if already cloned
$ cd <the PR test file's directory>
$ ls
certrotation_test.go  encryption_test.go  featuregate_upgradeable_test.go  operator_test.go  trace.out  user_certs_test.go  user_client_ca_test.go  user_cors_test.go

$ export KUBECONFIG=<kubeconfig path>

First run, it takes long time for compiling go packages, be patience.
$ go test -v -run TestCertRotationStompOnBadType
=== RUN   TestCertRotationStompOnBadType
Found configuration for host https://api...openshift.com:6443.
--- PASS: TestCertRotationStompOnBadType (1.40s)
PASS
ok  	github.com/openshift/cluster-kube-apiserver-operator/test/e2e	1.412s

Comment 9 errata-xmlrpc 2020-05-04 11:20:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581


Note You need to log in before you can comment on or make changes to this bug.