Bug 1365867
| Summary: | Ceph: Unable to mount volumes for pod : rbd: image is locked by other nodes | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Josep 'Pep' Turro Mauri <pep> |
| Component: | Storage | Assignee: | hchen |
| Status: | CLOSED ERRATA | QA Contact: | Jianwei Hou <jhou> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 3.2.0 | CC: | aos-bugs, bchilds, eparis, hchen, jsafrane, michael.morello, misalunk, schamilt, smunilla |
| Target Milestone: | --- | Keywords: | NeedsTestCase, Reopened |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: |
Cause: When an openshift node crashes before umapping a rbd volume, the advisory lock held on the rbd volume is not released and prevents other nodes to map it.
Consequence: The rbd volume cannot be used by other nodes unless the advisory lock is manully removed.
Fix: If no rbd client is using the rbd volume, the advisory lock is removed automatically.
Result: The rbd volume can be used by other nodes without manually removing the lock.
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-08-10 05:15:47 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1267746 | ||
|
Description
Josep 'Pep' Turro Mauri
2016-08-10 11:32:04 UTC
Huamin, can you please take a look? this is a known issue, there are card and k8s/openshift issues tracking it[1]. The plan is to use attach/detach to do controller master initiatied to detach call to unlock the lock. 1. https://trello.com/c/Y1j4dTBO/131-bug-the-ceph-rbd-volume-plugin-seems-to-hold-a-lock-when-the-container-fails Upstream fix is proposed at https://github.com/kubernetes/kubernetes/pull/33660 33660 depends on the following https://github.com/kubernetes/kubernetes/pull/35433 https://github.com/kubernetes/kubernetes/pull/35434 33660 depends on the following https://github.com/kubernetes/kubernetes/pull/35433 https://github.com/kubernetes/kubernetes/pull/35434 *** Bug 1409237 has been marked as a duplicate of this bug. *** Still reproducible in openshift v3.6.86
Steps:
1. Create StorageClass for rbd provisioner.
2. Create a PVC that dynamically provisions a PV, create a ReplicationController(rc=1).
3. After Pod is running, stop its the node service.
4. New Pod is recreated in another node, but stuck at status 'ContainerCreating'. Old Pod became 'Unkown'. The new Pod would become 'Running' when original node was recovered or the lock is manually removed.
# oc get pods
NAME READY STATUS RESTARTS AGE
rbdpd-8n8zd 0/1 ContainerCreating 0 8m
rbdpd-xwn25 1/1 Unknown 0 1h
# oc describe pod rbdpd-8n8zd
Name: rbdpd-8n8zd
Namespace: jhou
Security Policy: restricted
Node: ip-172-18-6-78.ec2.internal/172.18.6.78
Start Time: Fri, 02 Jun 2017 16:12:37 +0800
Labels: app=rbd
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"jhou","name":"rbdpd","uid":"31b871f0-475a-11e7-8550-0e259545e72a","api...
openshift.io/scc=restricted
Status: Pending
IP:
Controllers: ReplicationController/rbdpd
Containers:
myfrontend:
Container ID:
Image: jhou/hello-openshift
Image ID:
Port: 80/TCP
State: Waiting
Reason: ContainerCreating
Ready: False
Restart Count: 0
Environment: <none>
Mounts:
/mnt/rbd from pvol (rw)
/var/run/secrets/kubernetes.io/serviceaccount from default-token-xk75q (ro)
Conditions:
Type Status
Initialized True
Ready False
PodScheduled True
Volumes:
pvol:
Type: PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
ClaimName: rbdc
ReadOnly: false
default-token-xk75q:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-xk75q
Optional: false
QoS Class: BestEffort
Node-Selectors: <none>
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
9m 9m 1 default-scheduler Normal Scheduled Successfully assigned rbdpd-8n8zd to ip-172-18-6-78.ec2.internal
9m 1m 12 kubelet, ip-172-18-6-78.ec2.internal Warning FailedMount MountVolume.SetUp failed for volume "kubernetes.io/rbd/3d88d0cc-476b-11e7-8550-0e259545e72a-pvc-3874b558-4757-11e7-8550-0e259545e72a" (spec.Name: "pvc-3874b558-4757-11e7-8550-0e259545e72a") pod "3d88d0cc-476b-11e7-8550-0e259545e72a" (UID: "3d88d0cc-476b-11e7-8550-0e259545e72a") with: rbd: image kubernetes-dynamic-pvc-387acbcd-4757-11e7-8550-0e259545e72a is locked by other nodes
7m 56s 4 kubelet, ip-172-18-6-78.ec2.internal Warning FailedMount Unable to mount volumes for pod "rbdpd-8n8zd_jhou(3d88d0cc-476b-11e7-8550-0e259545e72a)": timeout expired waiting for volumes to attach/mount for pod "jhou"/"rbdpd-8n8zd". list of unattached/unmounted volumes=[pvol]
7m 56s 4 kubelet, ip-172-18-6-78.ec2.internal Warning FailedSync Error syncing pod, skipping: timeout expired waiting for volumes to attach/mount for pod "jhou"/"rbdpd-8n8zd". list of unattached/unmounted volumes=[pvol]
[root@ip-172-18-3-46 ~]# rbd lock list kubernetes-dynamic-pvc-387acbcd-4757-11e7-8550-0e259545e72a
There is 1 exclusive lock on this image.
Locker ID Address
client.4193 kubelet_lock_magic_ip-172-18-1-167.ec2.internal 172.18.1.167:0/1037989
[root@ip-172-18-3-46 ~]# rbd lock remove kubernetes-dynamic-pvc-387acbcd-4757-11e7-8550-0e259545e72a kubelet_lock_magic_ip-172-18-1-167.ec2.internal client.4193
Verified on openshift v3.6.106. Given the node is down(I shut it down), the replication controller creates a Pod in another functional node and the Pod could become running. The rbd lock does not prevents other nodes from mounting. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 this is resolved through rbd attach/detach refactoring. The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |