Bug 1369977

Summary: PVC fails to bind
Product: OpenShift Container Platform Reporter: Bob Kozdemba <bkozdemb>
Component: StorageAssignee: Jan Safranek <jsafrane>
Status: CLOSED WORKSFORME QA Contact: Jianwei Hou <jhou>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.2.1CC: aos-bugs, bkozdemb, eparis, hchen, jsafrane, mturansk, plarsen, wmeng
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-09-29 14:13:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
output of 'journalctl -u atomic-openshift-master -le'
none
journalctl-level5.txt
none
pvc
none
pv
none
journalctl -u atomic-openshift-master -le none

Description Bob Kozdemba 2016-08-24 23:12:10 UTC
Description of problem:
Fresh 3.2.1.13 install. PVC status remains in "pending" state.

Version-Release number of selected component (if applicable):
atomic-openshift-3.2.1.13-1.git.0.e438b0e.el7.x86_64

How reproducible: 100%


Steps to Reproduce:
1. Install OpenShift, Configure NFS server
2. Create PV
3. Create PVC

Actual results:
PVC status remains in "pending" state.

Expected results:
PVC should bind to PV.

Additional info:
journalctl output:

Aug 24 19:03:12 master.ose.lab atomic-openshift-master[1039]: E0824 19:03:12.455157    1039 persistentvolume_claim_binder_controller.go:149] PVClaimBinder could not add claim claim1: Error updating claim with VolumeName pv01: PersistentVolumeClaim "claim1" is invalid: spec: Forbidden: spec is immutable once a claim has been bound to a volume
Aug 24 19:06:59 master.ose.lab atomic-openshift-master[1039]: E0824 19:06:59.099308    1039 persistentvolume_claim_binder_controller.go:162] PVClaimBinder could not update claim claim1: Error updating claim with VolumeName pv01: PersistentVolumeClaim "claim1" is invalid: spec: Forbidden: spec is immutable once a claim has been bound to a volume

Comment 1 Jan Safranek 2016-08-25 12:28:59 UTC
Bob, please attach this information:

- output of `oc get pv -o yaml`
- output of `oc get pvc -o yaml` 
- full logs of openshift master from creation of the PV to the reported failure, preferably with log level 5.


@mturansk, does it ring a bell?

Comment 2 Bob Kozdemba 2016-08-25 14:32:53 UTC
Created attachment 1194047 [details]
output of 'journalctl -u atomic-openshift-master -le'

Comment 3 Bob Kozdemba 2016-08-25 14:35:47 UTC
This morning I powered up the vm and the pvc from last night did eventually bind. So I deleted it and created a new one and the new pvc has been in pending status for a several minutes (~11 minutes). If it eventually succeeds w/o a reboot, I'll report back.

Comment 4 Bob Kozdemba 2016-08-25 14:48:05 UTC
How can I increase the log level for the master service?

Comment 5 Jan Safranek 2016-08-25 14:57:06 UTC
I think you need to add or edit this line in /etc/sysconfig/openshift-master:

OPTIONS=--loglevel=5

Comment 6 Jan Safranek 2016-08-25 14:58:42 UTC
And for deal with binding problems, please wait at least 15 minutes, which is a periodic sync interval that should re-try any failed binding. It's much better in 3.3, where it's I think 15 seconds and things recover from error much faster.

Comment 7 Bob Kozdemba 2016-08-25 15:07:25 UTC
Created attachment 1194056 [details]
journalctl-level5.txt

This is from a different system that exhibits the same behaviour.

Comment 8 hchen 2016-08-25 19:31:08 UTC
The error message "spec is immutable once a claim has been bound to a volume" during PVC binder addClaim() is a sign of race condition that is fixed in kubernetes 1.3 (openshift 3.3).

Comment 9 Bob Kozdemba 2016-08-25 19:52:23 UTC
(In reply to Jan Safranek from comment #6)
> And for deal with binding problems, please wait at least 15 minutes, which
> is a periodic sync interval that should re-try any failed binding. It's much
> better in 3.3, where it's I think 15 seconds and things recover from error
> much faster.

Still pending after 4 hours. I'll try a reboot.

[cloud-user@master ~]$ oc get pvc
NAME      STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
claim1    Pending   pv0001    0                        4h
[cloud-user@master ~]$

Comment 10 Bob Kozdemba 2016-08-25 22:21:36 UTC
(In reply to Bob Kozdemba from comment #9)
> (In reply to Jan Safranek from comment #6)
> > And for deal with binding problems, please wait at least 15 minutes, which
> > is a periodic sync interval that should re-try any failed binding. It's much
> > better in 3.3, where it's I think 15 seconds and things recover from error
> > much faster.
> 
> Still pending after 4 hours. I'll try a reboot.
> 
> [cloud-user@master ~]$ oc get pvc
> NAME      STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
> claim1    Pending   pv0001    0                        4h
> [cloud-user@master ~]$

The reboot did not help but deleting and re-creating the pv and pvc worked. This environment is hosted by Ravello/Google. I will try a fresh install on my libvirt laptop and report back.

Comment 11 Jan Safranek 2016-08-26 10:43:10 UTC
Please attach also out put of:
$ oc get pv -o yaml
$ oc get pvc -o yaml

Comment 12 Bob Kozdemba 2016-08-26 12:40:16 UTC
Created attachment 1194332 [details]
pvc

Comment 13 Bob Kozdemba 2016-08-26 12:40:44 UTC
Created attachment 1194333 [details]
pv

Comment 14 Jan Safranek 2016-08-26 12:47:30 UTC
these are *input* PV/PVCs, I need to see their full status and spec *after* the bug occurs (i.e. you can see "Error updating claim with VolumeName" in logs).

Really, please run
$ oc get pv -o yaml
$ oc get pvc -o yaml

Comment 15 Bob Kozdemba 2016-08-26 19:19:23 UTC
(In reply to Jan Safranek from comment #14)
> these are *input* PV/PVCs, I need to see their full status and spec *after*
> the bug occurs (i.e. you can see "Error updating claim with VolumeName" in
> logs).
> 
> Really, please run
> $ oc get pv -o yaml
> $ oc get pvc -o yaml

Sorry about that. Next time it fails, I'll post the output.

Comment 16 Bob Kozdemba 2016-08-26 22:10:11 UTC
Created attachment 1194504 [details]
journalctl -u atomic-openshift-master -le

Comment 17 Bob Kozdemba 2016-08-26 22:12:47 UTC
Just had a failure.
# oc get pv pv02 -o yaml
apiVersion: v1
kind: PersistentVolume
metadata:
  creationTimestamp: 2016-08-26T19:35:10Z
  name: pv02
  resourceVersion: "9906"
  selfLink: /api/v1/persistentvolumes/pv02
  uid: 337e1ff8-6bc4-11e6-a36d-525400f2f8cb
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 10Gi
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: logging-es-1
    namespace: logging
    resourceVersion: "9904"
    uid: 0222db92-6bd9-11e6-8283-525400f2f8cb
  nfs:
    path: /storage/pv/pv02
    server: allinone.koz.laptop
  persistentVolumeReclaimPolicy: Recycle
status:
  phase: Bound
[root@allinone ~]# 

[root@allinone ~]# oc get pvc logging-es-1 -o yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  creationTimestamp: 2016-08-26T22:04:06Z
  name: logging-es-1
  namespace: logging
  resourceVersion: "9904"
  selfLink: /api/v1/namespaces/logging/persistentvolumeclaims/logging-es-1
  uid: 0222db92-6bd9-11e6-8283-525400f2f8cb
spec:
  accessModes:
  - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
  volumeName: logging-volume
status:
  phase: Pending
[root@allinone ~]#

Comment 18 Jan Safranek 2016-08-29 13:44:29 UTC
Thanks for the logs!

Summary:

- pv02 is recycled ~5 minutes before it's bound to logging-es-1. This should not matter, posting here just in case it does.

- "Synchronizing PersistentVolumeClaim[logging-es-1] for binding" tries to bind pv02 to logging-es-1. pv02 is bound just fine (valid claimref + status), however PVC logging-es-1 already has volumeName: logging-volume. No clue how it got here.

Bob, what is PV named "logging-volume"? I don't see it in the logs, neither I see PVC logging-es-1 getting bound to it.

This is exactly the situation we solved with volume controller rewrite in 3.3, however, I've never seen it "live" in 3.2. It should be pretty hard to get a PVC that is half-bound to a PV and you seem to reproduce it pretty regularly. Can I get access to your cluster? Or get all PVs/PVCs and full master log (from startup) even if it is long?

Comment 19 Jan Safranek 2016-08-29 13:48:38 UTC
cc @mark, does it ring a bell? How can a PVC be bound to a PV (i.e. pvc.Spec.VolumeName is set) and still have pvc.Status.Phase = Pending in the old volume controller? It should be Bound.

Comment 20 Peter Larsen 2016-08-29 13:57:09 UTC
(In reply to Jan Safranek from comment #19)
> cc @mark, does it ring a bell? How can a PVC be bound to a PV (i.e.
> pvc.Spec.VolumeName is set) and still have pvc.Status.Phase = Pending in the
> old volume controller? It should be Bound.

According to the error, there's a problem writing the data to etcd: Forbidden: spec is immutable once a claim has been bound to a volume. This would indicate a locking/timestamp issue, or a race condition where multiple threads have different versions of the record being written out of sequence.

Comment 21 Bob Kozdemba 2016-08-30 17:02:45 UTC
(In reply to Jan Safranek from comment #18)
> Thanks for the logs!
> 
> Summary:
> 
> - pv02 is recycled ~5 minutes before it's bound to logging-es-1. This should
> not matter, posting here just in case it does.
> 
> - "Synchronizing PersistentVolumeClaim[logging-es-1] for binding" tries to
> bind pv02 to logging-es-1. pv02 is bound just fine (valid claimref +
> status), however PVC logging-es-1 already has volumeName: logging-volume. No
> clue how it got here.
> 
> Bob, what is PV named "logging-volume"? I don't see it in the logs, neither
> I see PVC logging-es-1 getting bound to it.
> 
> This is exactly the situation we solved with volume controller rewrite in
> 3.3, however, I've never seen it "live" in 3.2. It should be pretty hard to
> get a PVC that is half-bound to a PV and you seem to reproduce it pretty
> regularly. Can I get access to your cluster? Or get all PVs/PVCs and full
> master log (from startup) even if it is long?

Hi Jan:
I've been able to reproduce the problem in a cloud VM and on my laptop. Since I'm traveling the next few days it would be best if I give you access to the cloud VM. Send your public ssh key to bkozdemb and I'll set you up.

Bob

Comment 22 Jan Safranek 2016-09-29 14:13:35 UTC
I asked for access to the machines on Sep 9th privately and Bob was not able to reproduce the issue since then. I am closing this bug, feel free to reopen it when you stumble upon it again.