Using provision_demo.sh (OpenShift Demo for Coolstore MSA), trying to remove OpenShift namespace: * The namespace pends because pods aren't being deleted * The pods aren't being deleted because the namespace is being deleted Version-Release number of selected component (if applicable): 3.6 How reproducible: Believe 100% for Coolstore MSA Steps to Reproduce: 1. Deploy OpenShift 3.6 with CNS 2. Deploy Coolstore MSA https://github.com/jbossdemocentral/coolstore-microservice using provision_demo.sh 3. Delete Coolstore MSA using provision_demo.sh Actual results: ci-ocuser namespace deletion hangs in status Terminating pod deletion hangs in status Terminating Expected results: Correct pod/namespace deletion Additional info: sosreports to follow in private attachments Uncertain if it is tied to CNS? Happened for 2 pods on separate nodes Possible Workaround: remove the underlying containers?
CNS = Container Native Storage
(In reply to Thom Carlin from comment #6) > CNS = Container Native Storage This doesn't address the full list of questions I've raised, so I'm re-adding the needinfo flag.
Likely Workaround: 1) Delete associated pods 2) systemctl restart atomic-openshift-node on affected nodes
Updated cluster to latest GA. Errors during deployment: "[...] docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod <pod_name> Unexpected command output nsenter: cannot open : No such file or directory". Investigating whether this is a related issue
Created attachment 1325618 [details] Stuck pod YAML Attaching a YAML dump of a pod stuck Terminating.
Created attachment 1325619 [details] Stuck pod container inspection Attaching a docker inspection of a stuck pod's container.
Reassigning to the pod team to investigate further. Please reach out to me if there's any additional context I can provide to jumpstart the ongoing investigation. Thom reproduced in a multi-node cluster to which he gave me access, and I attached the only new information I obtained. Increasing the controller service log verbosity might be useful if the problem isn't apparent from the provided state dumps alone. I know we've dealt with many pod stuck terminating issues in the past, so my hope is a clue will stand out to somebody with more experience around those bugs...
I took a look on a recreate by Thom: Node logs show: kubelet.go:1832] SyncLoop (DELETE, "api"): "gogs-1-kgj3x_ci-ocuser(2a523fa3-989b-11e7-bcc6-001a4a160152)" kuberuntime_container.go:568] Killing container "docker://d0f7d3851b9dd751a0a05806a0da077935317baae40b405cc8f1c2448728c529" with 30 second grace period then kubelet.go:1460] error killing pod: failed to "KillContainer" for "gogs" with KillContainerError: "rpc error: code = 4 desc = context deadline exceeded" and docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "gogs-1-kgj3x_ci-ocuser": Unexpected command output nsenter: cannot open : No such file or directory Container was in crash loop backoff before (attempting to) be deleted. PLEG DELETE coming in from api source is good. Timeout on KillContainer is not good. All pod containers are in terminal state in the pod status, but the pod is in phase Running still. I'm pretty sure this is what is blocking the deletion. Seems like the code path that adjusts the phase is being aborted early due to error.
Created attachment 1326128 [details] pods.txt more stuck pod listing and yaml
Worth noting that restarting the node process where the stuck pods reside DOES clear the issue. Might be an addition indicator about where the problem lies.
Joel, mind taking a look at this?
*** Bug 1502129 has been marked as a duplicate of this bug. ***
I have reproduced this on my development cluster. I'm updating the title to reflect the nature of the bug. Here are the steps to reproduce: 0. For all created resources mentioned below, save the resource definition to a file, then use "oc create -f <file>" to create it. 1. Deploy CNS to a cluster. The documentation link above explains how to do this, but I will note that a cluster must have at least 3 nodes with block devices that can be used for CNS for it to successfully deploy. 2. Create a storage class that uses the CNS's heketi storage to auto-provision volumes. Create a resource using the following (and adjusting the URL to match the heketi service in your cluster as created by the DNS deploy process): --- apiVersion: storage.k8s.io/v1beta1 kind: StorageClass metadata: name: glusterfs-storage provisioner: kubernetes.io/glusterfs parameters: resturl: "http://heketi-storage-project.router.default.svc.cluster.local" 3. Create a test project oc new-project test 4. Create a PVC in that project like this one: --- apiVersion: v1 kind: PersistentVolumeClaim metadata: name: mypv annotations: volume.beta.kubernetes.io/storage-class: glusterfs-storage spec: accessModes: - ReadWriteOnce resources: requests: storage: 1Gi 5. Create a pod to use that pvc, like this one: --- apiVersion: v1 kind: Pod metadata: name: busybox-pod spec: containers: - image: gcr.io/google_containers/busybox name: busybox command: ["/bin/sh"] args: ["-c", "sleep 3600"] volumeMounts: - mountPath: /mnt/glusterfs name: glusterfsvol restartPolicy: Never volumes: - name: glusterfsvol persistentVolumeClaim: claimName: mypv 6. After the pod is running, delete the project: oc delete project test
Thanks Joel. I did do this but it didn't take much time. The openshift-installer with the CNS options worked on the first try! Anyway, I created persistent mysql from the template then deleted the project. Recreated immediately. The pod containers, both infra and mysql, are exited according to docker. Pod remains in terminating state. openshift v3.7.0-0.143.7 kubernetes v1.7.0+80709908fd Logs of interest on the node: atomic-openshift-node[22969]: E1022 22:15:44.730885 22969 glusterfs.go:148] glusterfs: failed to get endpoints pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde[an empty namespace may not be set when a resource name is provided] atomic-openshift-node[22969]: E1022 22:15:44.730915 22969 reconciler.go:367] Could not construct volume information: MountVolume.NewMounter failed for volume "kubernetes.io/glusterfs/21c1d04a-b797-11e7-a0aa-fa163ecf5bde-pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde" (spec.Name: "pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde") pod "21c1d04a-b797-11e7-a0aa-fa163ecf5bde" (UID: "21c1d04a-b797-11e7-a0aa-fa163ecf5bde") with: an empty namespace may not be set when a resource name is provided atomic-openshift-node[22969]: I1022 22:16:17.105298 22969 reconciler.go:186] operationExecutor.UnmountVolume started for volume "mysql-data" (UniqueName: "kubernetes.io/glusterfs/21c1d04a-b797-11e7-a0aa-fa163ecf5bde-pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde") pod "21c1d04a-b797-11e7-a0aa-fa163ecf5bde" (UID: "21c1d04a-b797-11e7-a0aa-fa163ecf5bde") atomic-openshift-node[22969]: E1022 22:16:17.105552 22969 nestedpendingoperations.go:264] Operation for "\"kubernetes.io/glusterfs/21c1d04a-b797-11e7-a0aa-fa163ecf5bde-pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde\" (\"21c1d04a-b797-11e7-a0aa-fa163ecf5bde\")" failed. No retries permitted until 2017-10-22 22:18:19.105535004 -0400 EDT (durationBeforeRetry 2m2s). Error: UnmountVolume.TearDown failed for volume "mysql-data" (UniqueName: "kubernetes.io/glusterfs/21c1d04a-b797-11e7-a0aa-fa163ecf5bde-pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde") pod "21c1d04a-b797-11e7-a0aa-fa163ecf5bde" (UID: "21c1d04a-b797-11e7-a0aa-fa163ecf5bde") : Error checking if path exists: stat /var/lib/origin/openshift.local.volumes/pods/21c1d04a-b797-11e7-a0aa-fa163ecf5bde/volumes/kubernetes.io~glusterfs/pvc-10fdb95e-b797-11e7-a0aa-fa163ecf5bde: transport endpoint is not connected So the hang up is in the volume detach. It's like we removed something as part of the namespace deletion that is needed for the successful detaching of the volume. I think that thing is an Endpoint resource the CNS seems to create: $ oc get ep NAME ENDPOINTS AGE glusterfs-dynamic-mysql 10.42.10.207:1,10.42.10.211:1,10.42.10.214:1 24s mariadb <none> 29s I bet the glusterfs-dynamic-mysql endpoint is being deleted and then the node is unable to successfully detach the volume.
Additionally, if I delete the mysql dc and not the project, it is successful. This supports the conclusion that the glusterfs-dynamic* ep remove is the problem as deleting the dc leaves the ep intact.
UnmountVolume.TearDown tries to stat the gluster mount point and errors with "transport endpoint is not connected" due to the endpoint being deleted. https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/volume/util/util.go#L84 This blocks the umount later in that function causing the overall volume detach to fail and hang the pod termination. I logged into the node and umounted the volume manually. The pod terminated within 2m (the volume cleanup retry interval).
IMO, this is a design flaw in CNS. An endpoint should not be created for each pvc in each project. On top of being racey, it doesn't scale and stores a lot of redundant endpoint data for every pvc. Those endpoints must be deployed by the proxy on every node in the cluster (think of the iptables rules!). Sending to storage for evaluation. One idea I had is to just create one service/endpoint in a privileged project like openshift-storage or openshift-cns and all the nodes can use that service for discovery of the gluster endpoints when mounting gluster volumes. That way the endpoint resource is bound to the lifetime of the cluster rather than the lifetime of each PVC or project. It also reduces the per-pvc work for the provisioner must do.
I've replicated the issues and the workaround, which is to unmount the glusterfs volume. I created a simple script which tries to walk through all provisioned glusterfs volumes and does matching with the ones detected on the node, if it finds a glusterfs volume which is not found with 'oc get pv' then it's assumed bad and we do an unmount of it. This works reliably for me (c). Use at your own risk.. :) https://raw.githubusercontent.com/mglantz/ocp36-azure-cns/master/scripts/fixCnsVol.sh
Can we please raise severity? Without a workaround this bug constitutes a NO-GO for using CNS. One of the main reasons to run containers is the easy to create/destroy. This bug prevents that.
(In reply to Magnus Glantz from comment #29) > Can we please raise severity? Without a workaround this bug constitutes a > NO-GO for using CNS. One of the main reasons to run containers is the easy > to create/destroy. This bug prevents that. *the ease to create/destroy.
Caught the error earlier and found a new message: Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"nexus-3-gv6q2.14f6b7f71e7a3da5", GenerateName:"", Namespace:"ci-jenkins", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"ci-jenkins", Name:"nexus-3-gv6q2", UID:"dd1f08b7-c31f-11e7-92cf-001a4a160152", APIVersion:"v1", ResourceVersion:"16025437", FieldPath:"spec.containers{nexus}"}, Reason:"Killing", Message:"Killing container with id docker://nexus:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"<<node_name>>"}, FirstTimestamp:v1.Time{Time:time.Time{sec:63646193797, nsec:25512869, loc:(*time.Location)(0xc048f00)}}, LastTimestamp:v1.Time{Time:time.Time{sec:63646193797, nsec:25512869, loc:(*time.Location)(0xc048f00)}}, Count:1, Type:"Normal"}': 'events "nexus-3-gv6q2.14f6b7f71e7a3da5" is forbidden: unable to create new content in namespace ci-jenkins because it is being terminated.' (will not retry!) followed by: W1113 13:16:39.408329 57711 docker_sandbox.go:263] NetworkPlugin cni failed on the status hook for pod "nexus-3-gv6q2_ci-jenkins": Unexpected command output nsenter: cannot open /proc/39156/ns/net: No such file or directory I1113 13:16:39.450107 57711 kubelet.go:1850] SyncLoop (PLEG): "nexus-3-gv6q2_ci-jenkins(dd1f08b7-c31f-11e7-92cf-001a4a160152)", event: &pleg.PodLifecycleEvent{ID:"dd1f08b7-c31f-11e7-92cf-001a4a160152", Type:"ContainerDied", Data:"e11f000efbfe746e79dbfe7fa4999e459e7d4c01e1fd4838f46687c2c7649424"} Changed component to CNS
After restart of service: Orphaned pod "dd1f08b7-c31f-11e7-92cf-001a4a160152" found, but volume paths are still present on disk. Extrapolating from comment 27: df | egrep dd1f08b7-c31f-11e7-92cf-001a4a160152 umount /var/lib/origin/openshift.local.volumes/pods/dd1f08b7-c31f-11e7-92cf-001a4a160152/volumes/kubernetes.io~glusterfs/pvc-78925268-bf42-11e7-96cd-001a4a160152 Once the volume is unmounted, the pod is removed automatically Once all the pods are removed, the namespace is removed automatically ========================================================================== Better way: oc get pods -n ci-jenkins gogs-1-jtnzx --template '{{ .metadata.uid }}' On node where pod is running: df <<pod_uuid>> df: /var/lib/origin/openshift.local.volumes/pods/65988e62-c321-11e7-92cf-001a4a160152/volumes/kubernetes.io~glusterfs/pvc-a77487a1-bf43-11e7-96cd-001a4a160152: Transport endpoint is not connected umount /var/lib/origin/openshift.local.volumes/pods/65988e62-c321-11e7-92cf-001a4a160152/volumes/kubernetes.io~glusterfs/pvc-a77487a1-bf43-11e7-96cd-001a4a160152 The latter way does NOT require the service to be restarted. Note it will take a while for everything to propagate through....
Humble: Please look into this at a Kubernetes level.
I went through this bz details and would like to share my thought here. Obvious workaround exist for this issue which is: ------------------------------------------------------ Unmounting gluster volumes which are used by the problematic pods from the nodes where the application pod is running. RCA: ------- Here is the code flow: UnmountPath() -> PathExists() which calls "stat()" on "glusterfs fuse mount path" which Return: "Transport endpoint is not connected" So, I would like to ask: Are CNS gluster pods/ gluster server containers ( which host gluster volumes) are on the SAME project/namespace which is getting deleted ? That said, If I follow bz description , are gluster server containers are running in namespace "ci-ocuser"? In C#22, are gluster pods/containers running in project "test" ? At first look, this issue does NOT seems to be related to "wiping of endpoint/service at project/namespace deletion time".
(In reply to Humble Chirammal from comment #37) > I went through this bz details and would like to share my thought here. > > Obvious workaround exist for this issue which is: > ------------------------------------------------------ > Unmounting gluster volumes which are used by the problematic pods from the > nodes where the application pod is running. Problem here is that you have to have some control loop running on the node looking for this condition and there is no clear way to find "problematic pods" i.e. pods that are stuck in terminating because the gluster mounts aren't unmounting properly. You'd basically be implementing part of the kubelet's volume manager outside the kubelet for one volume type. Not great. > > RCA: > ------- > > Here is the code flow: > > UnmountPath() > -> PathExists() which calls "stat()" on "glusterfs fuse mount path" > > which Return: "Transport endpoint is not connected" > > So, I would like to ask: > > Are CNS gluster pods/ gluster server containers ( which host gluster > volumes) are on the SAME project/namespace which is getting deleted ? The gluster server pods are not in the project being deleted. However the endpoint resources that supports the DNS resolution of the gluster servers on the nodes DO exist in the project being deleted and this is what leads to this issue. > > That said, > > If I follow bz description , are gluster server containers are running in > namespace "ci-ocuser"? > > In C#22, are gluster pods/containers running in project "test" ? > > At first look, this issue does NOT seems to be related to "wiping of > endpoint/service at project/namespace deletion time". Hopefully my comments show this to be the proper description. The overall fix is to decouple the resolution of the gluster servers from the project lifetime. The node should ALWAYS be able to resolve the gluster servers independent the presence (or absence) of any namespaced resource.
(In reply to Seth Jennings from comment #38) > (In reply to Humble Chirammal from comment #37) > > I went through this bz details and would like to share my thought here. > > > > Obvious workaround exist for this issue which is: > > ------------------------------------------------------ > > Unmounting gluster volumes which are used by the problematic pods from the > > nodes where the application pod is running. > > Problem here is that you have to have some control loop running on the node > looking for this condition and there is no clear way to find "problematic > pods" i.e. pods that are stuck in terminating because the gluster mounts > aren't unmounting properly. > I put this as a 'interim workaround', definitely not a solution. > You'd basically be implementing part of the kubelet's volume manager outside > the kubelet for one volume type. Not great. > > > > > RCA: > > ------- > > > > Here is the code flow: > > > > UnmountPath() > > -> PathExists() which calls "stat()" on "glusterfs fuse mount path" > > > > which Return: "Transport endpoint is not connected" > > > > So, I would like to ask: > > > > Are CNS gluster pods/ gluster server containers ( which host gluster > > volumes) are on the SAME project/namespace which is getting deleted ? > > The gluster server pods are not in the project being deleted. However the > endpoint resources that supports the DNS resolution of the gluster servers > on the nodes DO exist in the project being deleted and this is what leads to > this issue. Fine, let me expand why I still think endpoint is not an issue: As in above code path, the code does a 'stat()' on mount point and received above mentioned error. Now, the first question which comes to my mind is, does 'stat()' make use of endpoint/service to wrap the result. I dont think so. As per my understanding, the glusterfs fuse client communicate directly with the brick servers when 'stat()' is issued, as the fuse client has the brick servers info in its volfile. If this assumption is correct, I can think of 2 possibilities which is leading to the error: Either: 1) brick servers/processes are down Or: 2) glusterfs fuse client process is down. By asking information on glusterfs pods or brick servers where they exist, I was trying on first possibility mentioned above. As it is confirmed that, gluster pods are up and running in different namespace I believe its mostly not possibility 1. We could thoroughly check the brick process status when the issue exist though. Second possibility is that, the glusterfs fuse client process is down. We could look at the process status of fuse client (glusterfs) from the node where the application pod is running. For ex: There should be one process running similar to below: /usr/sbin/glusterfs --volfile-server=172.17.0.1 --volfile-id=/vol2 /mnt If this process is down, still I expect 'transport endpoint not connected' error message. I will also invite glusterfs fuse engineers to this bugzilla for confirming/denying the thought on mentioned error message. Unfortunately our lab servers are down this week, so dont have a reproducer at my end, it should have been easy if there was one.
We are also trying to come up with a common mechanism to resolve the stale mounts across different FSs like NFS, GlusterFS ..etc in upstream kube. I will keep you posted. The patch is in review state .
(In reply to Humble Chirammal from comment #41) > We are also trying to come up with a common mechanism to resolve the stale > mounts across different FSs like NFS, GlusterFS ..etc in upstream kube. I > will keep you posted. The patch is in review state . It has been merged and I am back porting this patch to Origin via https://github.com/openshift/origin/pull/18225
The subjected PR was merged yesterday and it should be available in OCP 3.9 builds. It would be really appreciated if anyone can test this against lastest OCP 3.9 builds. I am moving this bugzilla to "MODIFIED" status.
*** Bug 1545222 has been marked as a duplicate of this bug. ***
Will the fix also be backported to 3.7, 3.8? Seems a bit harsh that there would be no fix to this until 3.9. From where I'm standing this bug is severely handicapping CNS.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:0639