Bug 1561385

Summary:	OCP fails to terminate Cassandra pod with gluster block storage mounted
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Ture Karlsson <tkarlsso>
Component:	gluster-block	Assignee:	Prasanna Kumar Kalever <prasanna.kalever>
Status:	CLOSED WONTFIX	QA Contact:	Rahul Hinduja <rhinduja>
Severity:	high	Docs Contact:
Priority:	medium
Version:	cns-3.6	CC:	atumball, bkunal, ekuric, hchiramm, madam, ndevos, pkarampu, prasanna.kalever, rhs-bugs, rtalur, tkarlsso, vinug
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-11-19 09:02:17 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1573420, 1622458

Description Ture Karlsson 2018-03-28 09:15:22 UTC

Description of problem:

Customer is running containerized installation of OCP 3.6 with CNS glusterfs and glusterblock. When trying to scale down the pod hawkular-cassandra that has glusterblock storage mounted, it gets stuck in "Terminating" state.

This was discovered when performing OS upgrades according to documentation:
https://docs.openshift.com/container-platform/3.6/install_config/upgrading/os_upgrades.html

Version-Release number of selected component (if applicable):
Containerized OCP 3.6 as per documentation: https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html

CNS 3.6 deployed with cns-deploy as per documentation: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#idm139668599809168

Metrics deployed with glusterblock backed storage as per documentation: https://docs.openshift.com/container-platform/3.6/install_config/cluster_metrics.html

with the following parameters in the inventory file:

openshift_metrics_install_metrics=false
openshift_metrics_hawkular_hostname=hawkular-metrics.apps.example.com
openshift_metrics_hawkular_replicas=1
openshift_metrics_cassandra_replicas=1
openshift_metrics_cassandra_limits_memory=2G
openshift_hosted_metrics_deployer_prefix=docker-registry-default.registry.example.com:443/openshift3/
openshift_hosted_metrics_deployer_version=v3.6.173.0.83
openshift_metrics_cassandra_nodeselector={"region":"infra"}
openshift_metrics_hawkular_nodeselector={"region":"infra"}
openshift_metrics_heapster_nodeselector={"region":"infra"}
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_cassandra_pvc_size=50Gi


How reproducible:

2/2 for this customer

Steps to Reproduce:

1. Install components as described above

2. Find node that is running hawkular-cassandra pod:
# oc get pods -n openshift-infra -o wide
NAME                         READY     STATUS        RESTARTS   AGE       IP             NODE
hawkular-cassandra-1-qwert   1/1       Running       0          1d        1.2.3.4        <node>
...

3. Take the node out of schedule:  
oc adm manage-node <node> --schedulable=false

4. Draing the node:
oc adm drain <node> --force --delete-local-data --ignore-daemonsets

Actual results:

The cassandra pod is stuck in terminating state, seemingly forever:

# oc get pods -n openshift-infra
NAME                         READY     STATUS        RESTARTS   AGE
hawkular-cassandra-1-wj08g   0/1       Terminating   2          1d
hawkular-metrics-wngsq       0/1       Running       357        1d
heapster-gsgmk               0/1       Running       276        1d

The node produce errors, e.g:

Mar 26 15:58:03 node.example.com dockerd-current[44602]: time="2018-03-26T15:58:03.417790295+02:00" level=error msg="Handler for POST /v1.24/containers/4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251/stop returned error: Container 4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251 is already stopped" 
Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418535 44969 remote_runtime.go:109] StopPodSandbox "4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-wj08g_openshift-infra" network: cni config uninitialized 
Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418535 44969 remote_runtime.go:109] StopPodSandbox "4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-wj08g_openshift-infra" network: cni config uninitialized 
Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418589 44969 kubelet.go:1460] error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" 
Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418601 44969 pod_workers.go:182] Error syncing pod ecbda575-30f6-11e8-ad10-001a4a160755 ("hawkular-cassandra-1-wj08g_openshift-infra(ecbda575-30f6-11e8-ad10-001a4a160755)"), skipping: error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" 
Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418589 44969 kubelet.go:1460] error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" 
Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418601 44969 pod_workers.go:182] Error syncing pod ecbda575-30f6-11e8-ad10-001a4a160755 ("hawkular-cassandra-1-wj08g_openshift-infra(ecbda575-30f6-11e8-ad10-001a4a160755)"), skipping: error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized"

Which is treated in: https://access.redhat.com/solutions/3241891

But since this is a containerized installation, the sdn-ovs package is not installed by the installer:

# less /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/main.yml
...
- name: Install sdn-ovs package
  package:
    name: "{{ openshift.common.service_type }}-sdn-ovs{{ openshift_pkg_version | oo_image_tag_to_rpm_version(include_dash=True) }}"
    state: present
  when: openshift.common.use_openshift_sdn and not openshift.common.is_containerized | bool
...

Expected results:

The pod should terminate without issues.

Additional info:

We have run into this issue twice and would most probably be able to reproduce it again. Last time we worked around this by restarting the node, which terminated the pod, but that is obviously not a feasible solution in the long run.

Comment 20 Red Hat Bugzilla 2023-09-15 00:07:10 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days