Bug 1507628

Summary:	GlusterFS registry PVC not binding when default StorageClass specified
Product:	OpenShift Container Platform	Reporter:	Hongkai Liu <hongkliu>
Component:	Installer	Assignee:	Jose A. Rivera <jarrpa>
Status:	CLOSED ERRATA	QA Contact:	Johnny Liu <jialiu>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	3.7.0	CC:	ansverma, aos-bugs, hongkliu, jokerman, mifiedle, mmccomas
Target Milestone:	---
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-30 19:09:00 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Hongkai Liu 2017-10-30 18:26:38 UTC

Description of problem:
Follow the doc: https://github.com/openshift/openshift-ansible/tree/master/playbooks/byo/openshift-glusterfs

to install glusterfs storage for docker registery on an existing openshift gluster where the docker registery storage is backed up with aws s3.

The ansible playbook shows a successful run but the gluster PVC run struck with Pending status.

Version-Release number of the following components:
rpm -q openshift-ansible
$ git log --oneline -1
0b0dea682 (HEAD -> master, origin/master, origin/HEAD, glusterfs2) Merge pull request #5769 from jarrpa/image-revert

rpm -q ansible
ansible --version
2.3.2.0

How reproducible:

Steps to Reproduce:
1. install a cluster with s3 as docker registery storage
2. run openshift-ansible/playbooks/byo/openshift-glusterfs/registry.yml


Actual results:
The PVC type should be glusterfs, instead of gp2.
$ oc get pvc
NAME                       STATUS    VOLUME    CAPACITY   ACCESSMODES   STORAGECLASS   AGE
registry-claim             Pending                                      gp2            3m
registry-glusterfs-claim   Pending                                      gp2            3m


Expected results:
Glusterfs PVC is bound and attached to docker registry pod.

Additional info:
Please attach logs from ansible-playbook with the -vvv flag
See the attachment.

Comment 4 Hongkai Liu 2017-10-30 18:35:07 UTC

It might be related to the fact that the default storage class on the cluster with AWS instances is aws gb2.

Comment 5 Hongkai Liu 2017-10-30 18:36:07 UTC

(In reply to Hongkai Liu from comment #4)
> It might be related to the fact that the default storage class on the
> cluster with AWS instances is aws gb2.

Correction: gp2

Comment 6 Jose A. Rivera 2017-10-30 21:13:21 UTC

To start, I'll note that use of this playbook has not been tested and is currently not supported. That doesn't mean I won't try to figure this out. :)

Please provide the output of "oc describe pvc registry-glusterfs-claim". Also provide the output of "oc get pv" and "oc describe pv registry-glusterfs-volume" if said volume exists.

Comment 7 Hongkai Liu 2017-10-30 22:21:39 UTC

Thanks.
Please check the attachment: terminal output
Probably that rings some bell already.
At least, oc describe pvc registry-glusterfs-claim
is ready there.

BTW, what is the difference of config.yml and registry.yml?
The readme says the latter has the same behaviors as the former.
And I get confused there. :)

Comment 8 Jose A. Rivera 2017-10-30 22:57:18 UTC

Ah, right, I did actually read through that. I was evidently unable to keep the huge volume of info between the three attachments straight. :) Still, the other outputs would help.

See if you can do the following:

1. Get a YAML definition for the registry-glusterfs-claim PVC "oc get pvc registry-glusterfs-claim -o yaml"
2. Delete the current PVC
3. Remove the metadata except for the name
4. Provide a storageClassName parameter of "" (empty string)
5. "oc create" the modified PVC YAML file.

config.yml will setup a GlusterFS cluster managed by heketi and (by default) create a StorageClass that will use it. registry.yml will setup a GlusterFS cluster managed by heketi without a StorageClass (by default) AND it will create a volume that is intended for use as storage for a hosted registry. registry.yml uses all the same ansible as config.yml with slightly different defaults and then adds a few more tasks on top of that.

Comment 9 Hongkai Liu 2017-10-31 15:42:25 UTC

Today I did not see the 2nd PVC registry-glusterfs-claim. But registry-claim is still there.
So I run the commands agains that PVC.

I seems that PVC is bound after the modification.

How to fix the playbook?
Can we expect the PVC is attached to docker registry pod after running the registry.yaml playbook? Or it just create the PVC and we need to attach manually?

$ oc get pvc
NAME             STATUS    VOLUME    CAPACITY   ACCESSMODES   STORAGECLASS   AGE
registry-claim   Pending                                      gp2            10m


$ oc get pvc registry-claim -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  annotations:
    volume.beta.kubernetes.io/storage-provisioner: kubernetes.io/aws-ebs
  creationTimestamp: 2017-10-31T15:20:03Z
  name: registry-claim
  namespace: default
  resourceVersion: "15700"
  selfLink: /api/v1/namespaces/default/persistentvolumeclaims/registry-claim
  uid: f809fb84-be4e-11e7-820e-02431c970084
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  storageClassName: gp2
status:
  phase: Pending


$ oc get pvc registry-claim -o yaml > registry-claim.yaml

$ vi registry-claim.yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: registry-claim
spec:
  accessModes:
  - ReadWriteMany
  resources:
    requests:
      storage: 5Gi
  storageClassName: ""

$ oc delete pvc registry-claim 
persistentvolumeclaim "registry-claim" deleted

$ oc create -f registry-claim.yaml 
persistentvolumeclaim "registry-claim" created

$ oc get pvc -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolumeClaim
  metadata:
    annotations:
      pv.kubernetes.io/bind-completed: "yes"
      pv.kubernetes.io/bound-by-controller: "yes"
    creationTimestamp: 2017-10-31T15:35:46Z
    name: registry-claim
    namespace: default
    resourceVersion: "17488"
    selfLink: /api/v1/namespaces/default/persistentvolumeclaims/registry-claim
    uid: 2a459d9a-be51-11e7-820e-02431c970084
  spec:
    accessModes:
    - ReadWriteMany
    resources:
      requests:
        storage: 5Gi
    storageClassName: ""
    volumeName: registry-volume
  status:
    accessModes:
    - ReadWriteMany
    capacity:
      storage: 5Gi
    phase: Bound
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

$ oc get pv -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: PersistentVolume
  metadata:
    annotations:
      pv.kubernetes.io/bound-by-controller: "yes"
    creationTimestamp: 2017-10-31T15:20:01Z
    name: registry-volume
    namespace: ""
    resourceVersion: "17486"
    selfLink: /api/v1/persistentvolumes/registry-volume
    uid: f6e5d6d0-be4e-11e7-820e-02431c970084
  spec:
    accessModes:
    - ReadWriteMany
    capacity:
      storage: 5Gi
    claimRef:
      apiVersion: v1
      kind: PersistentVolumeClaim
      name: registry-claim
      namespace: default
      resourceVersion: "17484"
      uid: 2a459d9a-be51-11e7-820e-02431c970084
    glusterfs:
      endpoints: glusterfs-registry-endpoints
      path: glusterfs-registry-volume
    persistentVolumeReclaimPolicy: Retain
  status:
    phase: Bound
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

Comment 10 Jose A. Rivera 2017-10-31 15:48:48 UTC

The general fix is to provide a storageClassName of "" (empty string) in the playbooks. I do not know how to do this, off the top of my head. The infra storage goes through a lot of layers, so I'd have to look carefully and see where the fix should go.

I'm updating the BZ title to reflect the specific problem. I'll try to have someone look into this this week.

Comment 11 Hongkai Liu 2017-11-22 16:45:02 UTC

Hi Jose,
do you have any update on this? Thanks.

Comment 12 Jose A. Rivera 2017-11-28 16:10:03 UTC

Not at this time. I'll try to get to it this week. Next week I'll be traveling, so I can't guarantee my time then.

Comment 13 Hongkai Liu 2017-11-28 19:23:33 UTC

(In reply to Jose A. Rivera from comment #12)
> Not at this time. I'll try to get to it this week. Next week I'll be
> traveling, so I can't guarantee my time then.

Understood. Thanks for the update.

Comment 15 Jose A. Rivera 2018-04-03 18:29:20 UTC

I'm having difficulty replicating this issue. Can you still hit in on the latest OCP 3.10 builds? If so, can you give me the output of "oc get pvc <PVC> -o yaml" were PVC is the name of one of the non-binding PVCs?

Comment 16 Hongkai Liu 2018-04-04 00:10:01 UTC

Hi Jose,

I have not yet had the working env. for 3.10.
Do you think it also makes sense to give it a try on 3.9?

Thanks.

Comment 17 Jose A. Rivera 2018-04-04 00:47:32 UTC

Definitely does.

Comment 18 Hongkai Liu 2018-04-04 11:54:01 UTC

Cool. I will test it today with 3.9.

Comment 19 Hongkai Liu 2018-04-04 15:43:40 UTC

The playbook failed now.

Will attach the ansible log and inventory file later.

$ git log --oneline -1
b2999772f (HEAD -> release-3.9, tag: openshift-ansible-3.9.17-1, origin/release-3.9) Automatic commit of package [openshift-ansible] release [3.9.17-1].

# yum list installed | grep openshift
atomic-openshift.x86_64         3.9.13-1.git.0.e0acf74.el7

$ dnf list ansible
Last metadata expiration check: 25 days, 3:21:13 ago on Sat 10 Mar 2018 12:19:47 PM UTC.
Installed Packages
ansible.noarch                                             2.4.3.0-1.fc27                                             @updates

$ ansible-playbook -i /tmp/2.file openshift-ansible/playbooks/openshift-glusterfs/registry.yml

TASK [openshift_persistent_volumes : include_tasks] **************************************************************************
task path: /home/fedora/openshift-ansible/roles/openshift_persistent_volumes/tasks/main.yml:39
included: /home/fedora/openshift-ansible/roles/openshift_persistent_volumes/tasks/pvc.yml for ec2-34-215-64-176.us-west-2.compute.amazonaws.com

TASK [openshift_persistent_volumes : Deploy PersistentVolumeClaim definitions] ***********************************************
task path: /home/fedora/openshift-ansible/roles/openshift_persistent_volumes/tasks/pvc.yml:2
fatal: [ec2-34-215-64-176.us-west-2.compute.amazonaws.com]: FAILED! => {
    "changed": false, 
    "msg": "AnsibleUndefinedVariable: 'dict object' has no attribute 'storageclass'"
}

Comment 22 Jose A. Rivera 2018-04-04 16:11:25 UTC

Ah, I think I get it. Created the following PR: https://github.com/openshift/openshift-ansible/pull/7778

Comment 23 Hongkai Liu 2018-04-04 17:43:55 UTC

Can you backport it to 3.9 branch as well?
I am not sure what it is going to happen when I run master playbook against 3.9 cluster.

Thanks.

Comment 24 Jose A. Rivera 2018-04-04 17:50:29 UTC

I'll backport it once it gets merged to master.

Comment 25 Jose A. Rivera 2018-04-04 18:42:01 UTC

Cherry-pick PR created: https://github.com/openshift/openshift-ansible/pull/7782

Comment 26 Hongkai Liu 2018-04-04 19:53:53 UTC

The failing task passed. However, the pvc does not seem to be attached to the registry pod.

Background of the test: The existing cluster has registry pod using aws-s3 as storage backend before running the playbook.

$ git log --oneline -1
62289a155 (HEAD -> bz1507628) GlusterFS: Fix missing parameter for registry PVC

root@ip-172-31-30-135: ~ # oc get pod
NAME                       READY     STATUS    RESTARTS   AGE
docker-registry-1-4f4rl    1/1       Running   0          7h
registry-console-1-p4pkc   1/1       Running   0          7h
router-1-q8wq9             1/1       Running   0          7h
root@ip-172-31-30-135: ~ # oc get pvc
NAME                       STATUS    VOLUME                      CAPACITY   ACCESS MODES   STORAGECLASS   AGE
registry-glusterfs-claim   Bound     registry-glusterfs-volume   100Gi      RWX                           7m

root@ip-172-31-30-135: ~ # oc volumes docker-registry-1-4f4rl
error: resource(s) were provided, but no name, label selector, or --all flag specified
root@ip-172-31-30-135: ~ # oc volumes pod docker-registry-1-4f4rl
pods/docker-registry-1-4f4rl
  empty directory as registry-storage
    mounted at /registry
  secret/registry-certificates as registry-certificates
    mounted at /etc/secrets
  secret/registry-config as docker-config
    mounted at /etc/registry
  secret/registry-token-lzqtl as registry-token-lzqtl
    mounted at /var/run/secrets/kubernetes.io/serviceaccount

Comment 27 Jose A. Rivera 2018-04-04 19:58:38 UTC

Hey, progress! Can you provide the log output from the successful run?

I haven't dealt with a registry that was using s3 storage before, so I don't know how that is going to work... the ansible I wrote assumes that the registry is writing all data to /registry inside the container, hence if swapcopy is on I just do an rsync from that directory to a local directory where the GlusterFS volume is mounted. I'm not sure how this would be done with S3. We may need to just not support that for now.

Comment 28 Hongkai Liu 2018-04-04 20:30:54 UTC

I did not collect the ansible log the first time. Have to delete project glusterfs and then rerun the playbook. Did not know how to recover it.

Hopefully it can reveal some details that you want to see. Let me know otherwise.

The log of the 2nd run will be attached.

From what I know, aws s3 is configured for registry by a configMap mounted to
/etc/registry/config.yml

I believe the playbook has to change that configMap if we want to support that case (maybe even for other cases).

Comment 30 Jose A. Rivera 2018-04-04 20:41:20 UTC

....ohhh, I see what's going on. I never had the chance to really flesh out the swap and swapcopy implementations so I don't think they will work as they are now. 

However, at this point we have resolved the initial problem of the BZ. At this time you have a GlusterFS volume, PV, and PVC ready to be used by an integrated registry. The catch here is that you have to manually perform the volume swap and any data copying yourself. This is, for now, working as designed. 

If you want to pursue that feature further, I'd ask you to create a new RFE BZ for it and we'll take it in as we have time.

Comment 31 Hongkai Liu 2018-04-04 20:55:44 UTC

Hi Jose,
I might misunderstand you here.
The PV and PVC are created and bound.

BUT,
The PVC is not attached to the registry pod. It is "empty directory".
See my comment 26 above.

Comment 32 Jose A. Rivera 2018-04-04 21:22:58 UTC

Yes. The user is left to attach the PVC to the pods themselves. This is as designed.

Your original problem was merely that the PVC was stuck in Pending status. The swap and swapcopy options are not officially documented anywhere (GitHub README doesn't count!), have not been tested by me, and are thus not supported.

Comment 33 Hongkai Liu 2018-04-05 00:19:55 UTC

OK. I see now.
Probably we can use the same logic for the configuration file of the registry.
I am fine with both cases.

Thanks for the fix.

Comment 35 errata-xmlrpc 2018-07-30 19:09:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816