Bug 1871713

Summary: [AWS 4.6 upgrade] Upgrade failed while CCO in manual mode, Error: secret "aws-cloud-credentials" not found
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: Cloud Credential OperatorAssignee: Devan Goodwin <dgoodwin>
Status: CLOSED ERRATA QA Contact: Yunfei Jiang <yunjiang>
Severity: urgent Docs Contact:
Priority: high    
Version: 4.6CC: aos-bugs, dgoodwin, gshereme, hekumar, lwan, sdodson, wking
Target Milestone: ---Keywords: Reopened, Upgrades
Target Release: 4.5.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1879628 (view as bug list) Environment:
Last Closed: 2020-10-19 14:54:24 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1879628    
Bug Blocks:    

Description Yunfei Jiang 2020-08-24 05:43:53 UTC
Config CCO in manual mode, upgrade cluster from 4.5.6 to 4.6 nightly build was failed due to the Secret for openshift-cluster-csi-drivers is missing:

./oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.6     True        True          3h39m   Unable to apply 4.6.0-0.nightly-2020-08-18-165040: the cluster operator storage has not yet successfully rolled out

./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
storage                                    4.6.0-0.nightly-2020-08-18-165040   False       True          False      22h

./oc get pod -n openshift-cluster-csi-drivers
NAME                                             READY   STATUS                       RESTARTS   AGE
aws-ebs-csi-driver-controller-6c679787fd-4th6m   4/5     CreateContainerConfigError   0          22h
aws-ebs-csi-driver-node-cl7rz                    3/3     Running                      0          22h
aws-ebs-csi-driver-node-jc22w                    3/3     Running                      0          22h
aws-ebs-csi-driver-node-pcmwz                    3/3     Running                      0          22h
aws-ebs-csi-driver-node-r7g75                    3/3     Running                      0          22h
aws-ebs-csi-driver-node-rkhn5                    3/3     Running                      0          22h
aws-ebs-csi-driver-node-zdbnc                    3/3     Running                      0          22h
aws-ebs-csi-driver-operator-f574b4569-w2tmc      1/1     Running                      2          22h

./oc describe pod/aws-ebs-csi-driver-controller-6c679787fd-4th6m -n openshift-cluster-csi-drivers
Events:
  Type     Reason  Age                     From                                                Message
  ----     ------  ----                    ----                                                -------
  Warning  Failed  7m32s (x6145 over 22h)  kubelet, ip-10-0-51-226.us-east-2.compute.internal  Error: secret "aws-cloud-credentials" not found


> Compare CR between 4.5 and 4.6:

CredentialsRequest for 4.6
cloud-credential-operator-iam-ro     24h
cloud-credential-operator-s3         23h  <— new in 4.6, but it doesn’t impact upgrade
openshift-cluster-csi-drivers        23h  <— new in 4.6, missing this Secret will cause upgrade fail.
openshift-image-registry             24h
openshift-image-registry-azure       24h
openshift-image-registry-gcs         24h
openshift-image-registry-openstack   24h
openshift-ingress                    24h
openshift-ingress-azure              24h
openshift-ingress-gcp                24h
openshift-machine-api-aws            24h
openshift-machine-api-azure          24h
openshift-machine-api-gcp            24h
openshift-machine-api-openstack      24h
openshift-machine-api-ovirt          24h
openshift-machine-api-vsphere        24h
openshift-network                    24h

> CredentialsRequest for 4.5
cloud-credential-operator-iam-ro     53m
openshift-image-registry             64m
openshift-image-registry-azure       65m
openshift-image-registry-gcs         65m
openshift-image-registry-openstack   64m
openshift-ingress                    65m
openshift-ingress-azure              65m
openshift-ingress-gcp                64m
openshift-machine-api-aws            65m
openshift-machine-api-azure          65m
openshift-machine-api-gcp            64m
openshift-machine-api-openstack      65m
openshift-machine-api-ovirt          64m
openshift-machine-api-vsphere        64m
openshift-network                    65m


After provide Secret for openshift-cluster-csi-drivers, the cluster was upgraded to 4.6.0-0.nightly-2020-08-18-165040 successfully.

cat <<EOF >csi.yaml
apiVersion: v1
kind: Secret
metadata:
  name: aws-cloud-credentials
  namespace: openshift-cluster-csi-drivers
data:
  aws_access_key_id:  <HIDDEN>
  aws_secret_access_key: <HIDDEN>
EOF

./oc create -f csi.yaml

./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-08-18-165040   True        False         5m4s    Cluster version is 4.6.0-0.nightly-2020-08-18-165040

./oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.6.0-0.nightly-2020-08-18-165040   True            False         False      6m10s
cloud-credential                           4.6.0-0.nightly-2020-08-18-165040   True        False         False      23h
cluster-autoscaler                         4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
config-operator                            4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
console                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      17m
csi-snapshot-controller                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      23m
dns                                        4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
etcd                                       4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
image-registry                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      25h
ingress                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      23h
insights                                   4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
kube-apiserver                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
kube-controller-manager                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
kube-scheduler                             4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
kube-storage-version-migrator              4.6.0-0.nightly-2020-08-18-165040   True        False         False      18m
machine-api                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
machine-approver                           4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
machine-config                             4.6.0-0.nightly-2020-08-18-165040   True            False         False      6m37s
marketplace                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      17m
monitoring                                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      9m39s
network                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
node-tuning                                4.6.0-0.nightly-2020-08-18-165040   True        False         False      23h
openshift-apiserver                        4.6.0-0.nightly-2020-08-18-165040   True        False         False      165m
openshift-controller-manager               4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
openshift-samples                          4.6.0-0.nightly-2020-08-18-165040   True        False         False      2m25s
operator-lifecycle-manager                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
operator-lifecycle-manager-catalog         4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
operator-lifecycle-manager-packageserver   4.6.0-0.nightly-2020-08-18-165040   True        False         False      10m
service-ca                                 4.6.0-0.nightly-2020-08-18-165040   True        False         False      26h
storage                                    4.6.0-0.nightly-2020-08-18-165040   True        False         False      13m




Version-Release number of the following components: 
4.6.0-0.nightly-2020-08-18-165040
 
How reproducible: 
Always 
 
Steps to Reproduce: 
1. Create 4.5 cluster with CCO in manual mode, refer to https://github.com/openshift/cloud-credential-operator/blob/master/docs/mode-manual-creds.md
2. upgrade to 4.6 nightly build.
Actual results: 
storage operator upgrade failed due to Error: secret "aws-cloud-credentials" not found

Expected results:
storage operator upgrade successfully

Additional info:
It is better to notice user before upgrading cluster to 4.6 while CCO in manual mode.

Comment 1 Devan Goodwin 2020-08-24 12:36:02 UTC
Nice catch Yunfei, it looks like storage is incorrectly using the root/admin cloud credential which is often admin level and will not be present in manual mode.

In-cluster components that need cloud credentials should create a CredentialsRequest in the openshift-cloud-credential-operator namespace containing the exact permissions the component needs, this CredentialsRequest would be included in the release image. Documentation here: https://github.com/openshift/cloud-credential-operator

Moving to storage.

Comment 3 Devan Goodwin 2020-08-24 15:20:38 UTC
I misread, this looks more like what happens when you are in manual mode, and upgrade without pre-baking the credential needed for the new release image. This should be covered by the documentation.

Yunfei did you perform the documented steps for a manual mode cluster upgrade? This would entail precreating namespaces and secrets from the release image you intend to upgrade to. 

However in talking with Hemant this isn't even going to work because they are dynamically creating the credentials request in the operator itself, it is not carried in the release payload. This means that our audutiability story of all CredRequests in the payload is not accurate, and will break for users in manual mode attempting to upgrade. We are keeping this bug open to address this issue.

We hope to have tooling to automate this in future but it's not there yet, users opting into manual mode must perform these steps themselves

Comment 4 Yunfei Jiang 2020-08-25 12:39:20 UTC
hello Devan,

>> Yunfei did you perform the documented steps for a manual mode cluster upgrade?

I'm not sure that if [1] and [2] are the documents you mentioned above
And as you mentioned, the CR is not in the release payload, so it could not be extracted by user, they do not know there is a new Secret for cluster.

I tried to create a Secret for openshift-cluster-csi-drivers in 4.5 cluster before upgrade, but failed:
`Error from server (NotFound): error when creating "csi.yaml": namespaces "openshift-cluster-csi-drivers" not found`

what I expect are:
1. Document the detail upgrade process (the important step is provide the corresponding Secret before upgrading) in a more clear location, e.g. Upgrade chapter in official document or Release Note, instead of a component document, e.g. [1] or [2]
2. Perform a pre-check before upgrading, like Secrets validation, stop the upgrade command if it does not meet the upgrade requirement.

[1] https://github.com/openshift/cloud-credential-operator/blob/master/docs/mode-manual-creds.md#upgrades
[2] https://github.com/openshift/cloud-credential-operator/blob/master/README.md

Comment 5 Devan Goodwin 2020-08-25 13:04:51 UTC
Yes you likely need to create the namespace as well, and then the secret.

For (1) I will work with the docs team is the org approves this is the best path forward, I think it's our only option.

For (2) there is a plan for upgrade preflight checks but it missed 4.6. Hopefully there in 4.7 and we can use it for 4.8 and beyond.

Comment 7 Devan Goodwin 2020-08-28 16:35:17 UTC
It has come to our attention that we can protect against this by pushing code to 4.5 which checks for the existence of the new cred secrets we know are coming in 4.6 if in manual mode, and then sets Upgradable=False indicating the user needs to take action. Upgradable=false only blocks 4.y upgrades, it will not block a 4.5.z upgrade. 

Re-opening the bugzilla to use for this purpose.

Comment 8 Devan Goodwin 2020-09-10 12:11:38 UTC
Fix for this is actively underway, PR coming soon.

Comment 10 Scott Dodson 2020-09-23 17:31:32 UTC
I'm bumping this to urgent because I want to ensure that this makes the current merge window so that we can tie the minimum 4.5 to 4.6 version to this release.

Comment 12 Yunfei Jiang 2020-09-28 07:17:30 UTC
verified. FAILED.
version: 4.5.0-0.nightly-2020-09-26-194704

according to https://bugzilla.redhat.com/show_bug.cgi?id=1879628#c2 , I did following tests:

>> case 1 [PASS]
1. install 4.5 with credentials mode Manual
2. checked Upgradeable=False
3. create secrets for s3 and csi, checked cloud-credential => Upgradeable=True
4. delete s3 or csi secret, checked cloud-credential => Upgradeable=False
5. re-create s3 or csi secret, checked cloud-credential => Upgradeable=True
6. Upgrade to 4.6 successfully

>> case 2 [FAILED]
1. install 4.5 with default credentials (no `credentialsMode` in install-config.yaml)
2. checked cloud-credential => Upgradeable=True
3. checked annotations of aws-creds => "mint"
4. removed aws-creds
5. checked cloud-credential => Upgradeable=True (should be Upgradeable=False)

Comment 13 wang lin 2020-09-28 08:16:46 UTC
In version 4.5, the upgradeable status will not immediately change to false when we remove the root creds.
we need to wait for a long time until cco next reconcile or force a reconcile via adding an annotation to the CloudCredential object.

Comment 14 Devan Goodwin 2020-09-28 13:23:22 UTC
I spoke with our pillar lead Scott Dodson, he feels we should file a new bug for the missing immediate update when root cred removed/restored. Request we consider this one verified and I will work on the other issue separately. Does this sound ok?

Comment 17 wang lin 2020-09-29 01:55:59 UTC
The upgrade function for manual mode  has fixed, can mark "VERIFIED" in my side, will wait for Yunfei's suggestion.

Comment 18 Yunfei Jiang 2020-09-29 02:29:06 UTC
Hello @Devan,

Agree with you. The original problem we met is that we cannot upgrade cluster smoothly when config CCO in manual mode, now we have tested against this feature/process, it works well.

For case 2 in Comment 12 , we will file a new bz to track this issue. Thanks.

Mark this bug as VERIFIED.

Comment 21 errata-xmlrpc 2020-10-19 14:54:24 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5.15 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4228