Bug 1779531

Summary: [3.11] Dynamic provisioning with Ceph RBD failed with error "failed to get admin secret" while the secret exist
Product: OpenShift Container Platform Reporter: Liang Xia <lxia>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED WONTFIX QA Contact: Liang Xia <lxia>
Severity: medium Docs Contact:
Priority: medium    
Version: 3.11.0CC: aos-bugs, bfuru, jsafrane
Target Milestone: ---Keywords: Regression
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-12-10 14:54:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1719686    

Description Liang Xia 2019-12-04 07:03:15 UTC
Description of problem:
Dynamic provisioning with Ceph RBD failed with error,
Failed to provision volume with StorageClass "ceph-block": failed to get admin secret from ["lxiap"/"cephrbd-secret"]: failed to get secret from ["lxiap"/"cephrbd-secret"]
but the secret exist in the namespace.


Version-Release number of selected component (if applicable):
openshift v3.11.157
kubernetes v1.11.0+d4cacc0


How reproducible:
Always

Steps to Reproduce:
1. Create the Ceph secret.
2. Create the storage class.
3. Create a PVC, and check the PVC.

Actual results:
Provisioning failed with error failed to get admin secret.

Expected results:
Provisioning succeed.



PV Dump:
No PV provisioned.

PVC Dump:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    openshift.io/generated-by: OpenShiftNewApp
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/rbd
  creationTimestamp: "2019-12-04T04:03:22Z"
  finalizers:
  - kubernetes.io/pvc-protection
  labels:
    app: mysql-persistent
    template: mysql-persistent-template
  name: mysql
  namespace: lxiap
  resourceVersion: "20180"
  selfLink: /api/v1/namespaces/lxiap/persistentvolumeclaims/mysql
  uid: 03893253-164b-11ea-8189-fa163ee74be8
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 1Gi
  storageClassName: ceph-block
status:
  phase: Pending

StorageClass Dump (if StorageClass used by PV/PVC):
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  annotations:
    storageclass.kubernetes.io/is-default-class: "true"
  creationTimestamp: "2019-12-04T04:04:06Z"
  name: ceph-block
  resourceVersion: "20292"
  selfLink: /apis/storage.k8s.io/v1/storageclasses/ceph-block
  uid: 1dd48646-164b-11ea-8189-fa163ee74be8
parameters:
  adminId: admin
  adminSecretName: cephrbd-secret
  adminSecretNamespace: lxiap
  monitors: 10.0.149.138:6789
  pool: rbd
  userId: admin
  userSecretName: cephrbd-secret
provisioner: kubernetes.io/rbd
reclaimPolicy: Delete
volumeBindingMode: Immediate


Additional info:
# oc describe pvc mysql
Name:          mysql
Namespace:     lxiap
StorageClass:  ceph-block
Status:        Pending
Volume:        
Labels:        app=mysql-persistent
               template=mysql-persistent-template
Annotations:   openshift.io/generated-by=OpenShiftNewApp
               volume.beta.kubernetes.io/storage-provisioner=kubernetes.io/rbd
Finalizers:    [kubernetes.io/pvc-protection]
Capacity:      
Access Modes:  
Events:
  Type     Reason              Age                 From                         Message
  ----     ------              ----                ----                         -------
  Warning  ProvisioningFailed  46m (x3 over 46m)   persistentvolume-controller  Failed to provision volume with StorageClass "ceph-block": failed to get admin secret from ["default"/"cephrbd-secret"]: failed to get secret from ["default"/"cephrbd-secret"]
  Warning  ProvisioningFailed  46m                 persistentvolume-controller  storageclass.storage.k8s.io "ceph-block" not found
  Warning  ProvisioningFailed  1m (x184 over 45m)  persistentvolume-controller  Failed to provision volume with StorageClass "ceph-block": failed to get admin secret from ["lxiap"/"cephrbd-secret"]: failed to get secret from ["lxiap"/"cephrbd-secret"]

# oc get secret cephrbd-secret -n lxiap
NAME             TYPE      DATA      AGE
cephrbd-secret   Opaque    1         1h
# oc get secret cephrbd-secret -n default
NAME             TYPE      DATA      AGE
cephrbd-secret   Opaque    1         2h

Comment 1 Jan Safranek 2019-12-06 15:12:31 UTC
Liang, since you used "Regression" keyword, do you remember in what OCP release it was working? I don't recall any recent changes in this area. In addition, can you see any suspicious errors in kube-controller-manager logs? Especially any permission denied.

Comment 2 Liang Xia 2019-12-09 06:45:40 UTC
(In reply to Jan Safranek from comment #1)
> Liang, since you used "Regression" keyword, do you remember in what OCP
> release it was working? I don't recall any recent changes in this area. In
> addition, can you see any suspicious errors in kube-controller-manager logs?
> Especially any permission denied.

We haven't touched this for a long time, so no exact time/build info when this starts broken.

kube-system-master-controllers-lxia-bz1719686master-etcd-1.log
I1209 01:25:12.294770       1 graph_builder.go:586] GraphBuilder process object: v1/PersistentVolumeClaim, namespace lxiap, name mysql, uid a7c1d05d-1a4c-11ea-814d-fa163eb6601c, event type add
I1209 01:25:12.298274       1 pv_controller_base.go:171] enqueued "lxiap/mysql" for sync
I1209 01:25:12.301748       1 pv_controller_base.go:357] claimWorker[lxiap/mysql]
I1209 01:25:12.301803       1 pv_controller_base.go:507] storeObjectUpdate: adding claim "lxiap/mysql", version 30138
I1209 01:25:12.301889       1 pv_controller.go:238] synchronizing PersistentVolumeClaim[lxiap/mysql]: phase: Pending, bound to: "", bindCompleted: false, boundByController: false
I1209 01:25:12.301985       1 pv_controller.go:336] synchronizing unbound PersistentVolumeClaim[lxiap/mysql]: no volume found
I1209 01:25:12.303683       1 pvc_protection_controller.go:256] Got event on PVC lxiap/mysql
I1209 01:25:12.314764       1 factory.go:100] Adding deployment config lxiap/mysql
I1209 01:25:12.314920       1 deploymentconfig_controller.go:88] Reconciling lxiap/mysql
I1209 01:25:12.315093       1 deploymentconfig_controller.go:156] Postponing rollout #0 for DeploymentConfig lxiap/mysql because of invalid or unresolved image for container #0 (name=mysql)
I1209 01:25:12.316137       1 image_trigger_controller.go:363] Started syncing resource "deploymentconfigs.apps.openshift.io/lxiap/mysql"
I1209 01:25:12.317311       1 deploymentconfigs.go:247] Deployment config mysql has changed
I1209 01:25:12.324236       1 graph_builder.go:586] GraphBuilder process object: apps.openshift.io/v1/DeploymentConfig, namespace lxiap, name mysql, uid a7c7263a-1a4c-11ea-814d-fa163eb6601c, event type add
I1209 01:25:12.337728       1 pv_controller.go:1352] provisionClaim[lxiap/mysql]: started
I1209 01:25:12.342238       1 pv_controller.go:1597] scheduleOperation[provision-lxiap/mysql[a7c1d05d-1a4c-11ea-814d-fa163eb6601c]]
I1209 01:25:12.342463       1 pv_controller.go:1365] provisionClaimOperation [lxiap/mysql] started, class: "cephrbdprovisioner"
I1209 01:25:12.388127       1 factory.go:108] Updating deployment config lxiap/mysql
I1209 01:25:12.389144       1 graph_builder.go:586] GraphBuilder process object: apps.openshift.io/v1/DeploymentConfig, namespace lxiap, name mysql, uid a7c7263a-1a4c-11ea-814d-fa163eb6601c, event type update
I1209 01:25:12.389447       1 image_trigger_controller.go:365] Finished syncing resource "deploymentconfigs.apps.openshift.io/lxiap/mysql" (73.313645ms)
I1209 01:25:12.389600       1 image_trigger_controller.go:363] Started syncing resource "deploymentconfigs.apps.openshift.io/lxiap/mysql"
I1209 01:25:12.389789       1 deploymentconfigs.go:244] Deployment config mysql has not changed
I1209 01:25:12.389905       1 image_trigger_controller.go:365] Finished syncing resource "deploymentconfigs.apps.openshift.io/lxiap/mysql" (345.038µs)
I1209 01:25:12.394777       1 deploymentconfig_controller.go:507] Error syncing deployment config lxiap/mysql: Operation cannot be fulfilled on deploymentconfigs.apps.openshift.io "mysql": the object has been modified; please apply your changes to the latest version and try again
I1209 01:25:12.394805       1 deploymentconfig_controller.go:88] Reconciling lxiap/mysql
I1209 01:25:12.394955       1 deploymentconfig_controller.go:574] Rolling out initial deployment for lxiap/mysql as it now have images available
I1209 01:25:12.402807       1 graph_builder.go:586] GraphBuilder process object: v1/PersistentVolumeClaim, namespace lxiap, name mysql, uid a7c1d05d-1a4c-11ea-814d-fa163eb6601c, event type update
I1209 01:25:12.402845       1 pv_controller_base.go:171] enqueued "lxiap/mysql" for sync
I1209 01:25:12.402857       1 pv_controller_base.go:357] claimWorker[lxiap/mysql]
I1209 01:25:12.402867       1 pv_controller_base.go:535] storeObjectUpdate updating claim "lxiap/mysql" with version 30141
I1209 01:25:12.402877       1 pv_controller.go:238] synchronizing PersistentVolumeClaim[lxiap/mysql]: phase: Pending, bound to: "", bindCompleted: false, boundByController: false
I1209 01:25:12.402895       1 pv_controller_base.go:535] storeObjectUpdate updating claim "lxiap/mysql" with version 30141
I1209 01:25:12.402903       1 pv_controller.go:336] synchronizing unbound PersistentVolumeClaim[lxiap/mysql]: no volume found
I1209 01:25:12.402911       1 pv_controller.go:1352] provisionClaim[lxiap/mysql]: started
I1209 01:25:12.402916       1 pv_controller.go:1597] scheduleOperation[provision-lxiap/mysql[a7c1d05d-1a4c-11ea-814d-fa163eb6601c]]
I1209 01:25:12.402926       1 pv_controller.go:1608] operation "provision-lxiap/mysql[a7c1d05d-1a4c-11ea-814d-fa163eb6601c]" is already running, skipping
I1209 01:25:12.402941       1 pvc_protection_controller.go:256] Got event on PVC lxiap/mysql
I1209 01:25:12.407295       1 graph_builder.go:586] GraphBuilder process object: apps.openshift.io/v1/DeploymentConfig, namespace lxiap, name mysql, uid a7c7263a-1a4c-11ea-814d-fa163eb6601c, event type update
I1209 01:25:12.407957       1 deploymentconfig_controller.go:88] Reconciling lxiap/mysql
I1209 01:25:12.408005       1 deploymentconfig_controller.go:574] Rolling out initial deployment for lxiap/mysql as it now have images available
I1209 01:25:12.408237       1 factory.go:108] Updating deployment config lxiap/mysql
E1209 01:25:12.409213       1 rbd.go:1101] failed to get secret from ["default"/"cephrbd-secret"]
I1209 01:25:12.409426       1 pv_controller.go:1476] failed to provision volume for claim "lxiap/mysql" with StorageClass "cephrbdprovisioner": failed to get admin secret from ["default"/"cephrbd-secret"]: failed to get secret from ["default"/"cephrbd-secret"]
I1209 01:25:12.409937       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"lxiap", Name:"mysql", UID:"a7c1d05d-1a4c-11ea-814d-fa163eb6601c", APIVersion:"v1", ResourceVersion:"30141", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' Failed to provision volume with StorageClass "cephrbdprovisioner": failed to get admin secret from ["default"/"cephrbd-secret"]: failed to get secret from ["default"/"cephrbd-secret"]

Comment 3 Jan Safranek 2019-12-09 11:24:48 UTC
I looked at the cluster at lxia-bz1719686master-etcd-1.int.1209-som.qe.rhcloud.com and cephrbd-secret had wrong type:

apiVersion: v1
data:
  key: QVFEUnJLTlZiRWV2Q2hBQUVtUkMrcFcvS0JWSHhhMHcvUE9JTEE9PQo=
kind: Secret
metadata:
  creationTimestamp: "2019-12-09T03:12:43Z"
  name: cephrbd-secret
  namespace: default
  resourceVersion: "4760"
  selfLink: /api/v1/namespaces/default/secrets/cephrbd-secret
  uid: c3fb2ec3-1a31-11ea-814d-fa163eb6601c
type: Opaque


Should be:

type: kubernetes.io/rbd


With this change, RBD provisioning works. I don't know who created the secret, was it installer? It needs to be fixed then.

We can improve the error message by backporting https://github.com/kubernetes/kubernetes/pull/72033:

> failed to get admin secret from ["default"/"cephrbd-secret"]: failed to get secret from ["default"/"cephrbd-secret"]: Cannot get secret of type Opaque

Comment 6 Liang Xia 2019-12-10 02:31:12 UTC
(In reply to Jan Safranek from comment #3)
> With this change, RBD provisioning works. I don't know who created the
> secret, was it installer? It needs to be fixed then.

QE have fixed the templates and scripts.

> 
> We can improve the error message by backporting
> https://github.com/kubernetes/kubernetes/pull/72033:
> 
> > failed to get admin secret from ["default"/"cephrbd-secret"]: failed to get secret from ["default"/"cephrbd-secret"]: Cannot get secret of type Opaque

That would be helpful.

Comment 7 Jan Safranek 2019-12-10 14:54:13 UTC
(In reply to Liang Xia from comment #6)
> > 
> > We can improve the error message by backporting
> > https://github.com/kubernetes/kubernetes/pull/72033:
> > 
> > > failed to get admin secret from ["default"/"cephrbd-secret"]: failed to get secret from ["default"/"cephrbd-secret"]: Cannot get secret of type Opaque
> 
> That would be helpful.

I got second thoughts, this is 3.11 but and we backport only important fixes there. Otherwise we're going to backport everything.

We fixed our documentation, that should be enough for now.