Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1388868

Summary:	Numbers of unesseceary volumes created on endpoints automatic creation failure
Product:	OpenShift Container Platform	Reporter:	Jianwei Hou <jhou>
Component:	Storage	Assignee:	hchen
Status:	CLOSED ERRATA	QA Contact:	Jianwei Hou <jhou>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	3.4.0	CC:	aos-bugs, bchilds, hchiramm, jhou, penehyba, rcyriac, tdawson
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-01-18 12:46:23 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Jianwei Hou 2016-10-26 10:40:16 UTC

Description of problem:
Setup heketi and create StorageClass for dynamic provisioning. On occasion of a failure in endpoints/service creation, lots of unesseceary volumes were provisioned. On the heketi server, disk space were quickly used up because the program was retrying the provision and plenty of volumes are being created. 

Version-Release number of selected component (if applicable):
openshift v3.4.0.15+9c963ec
kubernetes v1.4.0+776c994
etcd 3.1.0-alpha.1

How reproducible:
Always

Steps to Reproduce:
1. Setup heketi and create StorageClass. In my setup, I was using HOSTNAMEs for node.hostnames.storage in heketi topology. The right configuration should be IPs because this field is used to create endpoints by provisioner.
2. Create PVC to provision volumes, at this point endpoints, service, PV are not created, but volume got created on the storage server.
3. Leave it for a while, go to heketi server and list volumes

Actual results:
After step 3:
Found lots of volumes being created, disk space got used up.

Expected results:
On provision failure, all to-be provisioned resources: volume, endpoints, service, PV should not be created. 

Additional info:

Comment 1 Humble Chirammal 2016-10-26 12:47:48 UTC

The fix for this issue (https://github.com/kubernetes/kubernetes/pull/35285)  is in merge queue of  upstream kubernetes. I will backport the patch to OCP as soon as its done.

Comment 2 Bradley Childs 2016-10-31 18:32:46 UTC

merged upstream, waiting on OSE PR

Comment 3 Humble Chirammal 2016-11-02 09:17:25 UTC

(In reply to Bradley Childs from comment #2)
> merged upstream, waiting on OSE PR

I have filed https://github.com/openshift/origin/pull/11722

Comment 6 Troy Dawson 2016-11-04 18:45:42 UTC

This has been merged into ose and is in OSE v3.4.0.22 or newer.

Comment 8 Jianwei Hou 2016-11-07 06:06:50 UTC

Verified on 
openshift v3.4.0.22+5c56720
kubernetes v1.4.0+776c994
etcd 3.1.0-rc.0

1. Make endpoints creation fail: Failed to provision volume with StorageClass
"glusterprovisioner": glusterfs: create volume err: failed to create endpoint/service <nil>.

2. Go to heketi server, list volumes: Found no volumes there, repeatedly list volumes, found there was one volume created but was then immediately deleted .

Considering this is an edge scenario which only happens when there is a wrong heketi topology configuration, the above fix is acceptable. Mark this one as verified.

Comment 10 errata-xmlrpc 2017-01-18 12:46:23 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0066

Comment 11 penehyba 2017-04-19 13:40:02 UTC

(In reply to Jianwei Hou from comment #8)
> Verified on 
> openshift v3.4.0.22+5c56720
> kubernetes v1.4.0+776c994
> etcd 3.1.0-rc.0
> 
> 1. Make endpoints creation fail: Failed to provision volume with StorageClass
> "glusterprovisioner": glusterfs: create volume err: failed to create
> endpoint/service <nil>.
> 
> 2. Go to heketi server, list volumes: Found no volumes there, repeatedly
> list volumes, found there was one volume created but was then immediately
> deleted .
> 
> Considering this is an edge scenario which only happens when there is a
> wrong heketi topology configuration, the above fix is acceptable. Mark this
> one as verified.

Hi, I am experiencing same behavior (as originaly stated) on:
OpenShift Master:
    v3.5.0.53
Kubernetes Master:
    v1.5.2+43a9be4 

I have heketi server addressed by storageclass (volumetype=replicate:3)
After create -f pvc.yaml it switches to Pending state.
There are several volumes created, none of them is connected to pvc and all space is used up.

In description I encountered messages like:
- Token used before issued (~ in heketi.json I added exact iat and exp to prevent this)

- No space (~ no more space for next volume causes this whole bug)

- failed to create endpoint/service <nil> (~ i think IP vs glusternodename causes this. in etc/hosts it is unable to recognize node when creating endpoint, why?)

- Id not found (~it was overloaded?)

- Host 'IP' is not in 'Peer in Cluster' state (~ i see it is: name+ IP both are there)

My questions: what was your "wrong" topology configuration? 
Does it make sence to try this on version not yet containing fix: 
https://github.com/kubernetes/kubernetes/commit/fc62687b2c4924c9f1b95c7d1314787bc7b7cada

PS i tried replica:2 and onenode solution ... it creates and deletes volume one by one... but with secret it uses up all space (volumes persist but not bound)

Comment 12 Jianwei Hou 2017-04-20 02:52:27 UTC

@penehyba I had my glusterfs hosted on EC2, I found by using hostname(in topology.json, node.hostnames.storage) my endpoints could not be created when lots of volumes were created. They quickly used up the space. However after the fix even if endpoints were not created, the volume should have been immediately deleted. After I replaced hostnames with private_dns_name, the endpoints could be created and I did not see this issue again. 

By the time I reported the bug, volumetype parameter in storageclass was not supported yet.