Description of problem:
Customer is running containerized installation of OCP 3.6 with CNS glusterfs and glusterblock. When trying to scale down the pod hawkular-cassandra that has glusterblock storage mounted, it gets stuck in "Terminating" state.
This was discovered when performing OS upgrades according to documentation:
https://docs.openshift.com/container-platform/3.6/install_config/upgrading/os_upgrades.html
Version-Release number of selected component (if applicable):
Containerized OCP 3.6 as per documentation: https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html
CNS 3.6 deployed with cns-deploy as per documentation: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#idm139668599809168
Metrics deployed with glusterblock backed storage as per documentation: https://docs.openshift.com/container-platform/3.6/install_config/cluster_metrics.html
with the following parameters in the inventory file:
openshift_metrics_install_metrics=false
openshift_metrics_hawkular_hostname=hawkular-metrics.apps.example.com
openshift_metrics_hawkular_replicas=1
openshift_metrics_cassandra_replicas=1
openshift_metrics_cassandra_limits_memory=2G
openshift_hosted_metrics_deployer_prefix=docker-registry-default.registry.example.com:443/openshift3/
openshift_hosted_metrics_deployer_version=v3.6.173.0.83
openshift_metrics_cassandra_nodeselector={"region":"infra"}
openshift_metrics_hawkular_nodeselector={"region":"infra"}
openshift_metrics_heapster_nodeselector={"region":"infra"}
openshift_metrics_cassandra_storage_type=dynamic
openshift_metrics_cassandra_pvc_size=50Gi
How reproducible:
2/2 for this customer
Steps to Reproduce:
1. Install components as described above
2. Find node that is running hawkular-cassandra pod:
# oc get pods -n openshift-infra -o wide
NAME READY STATUS RESTARTS AGE IP NODE
hawkular-cassandra-1-qwert 1/1 Running 0 1d 1.2.3.4 <node>
...
3. Take the node out of schedule:
oc adm manage-node <node> --schedulable=false
4. Draing the node:
oc adm drain <node> --force --delete-local-data --ignore-daemonsets
Actual results:
The cassandra pod is stuck in terminating state, seemingly forever:
# oc get pods -n openshift-infra
NAME READY STATUS RESTARTS AGE
hawkular-cassandra-1-wj08g 0/1 Terminating 2 1d
hawkular-metrics-wngsq 0/1 Running 357 1d
heapster-gsgmk 0/1 Running 276 1d
The node produce errors, e.g:
Mar 26 15:58:03 node.example.com dockerd-current[44602]: time="2018-03-26T15:58:03.417790295+02:00" level=error msg="Handler for POST /v1.24/containers/4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251/stop returned error: Container 4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251 is already stopped"
Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418535 44969 remote_runtime.go:109] StopPodSandbox "4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-wj08g_openshift-infra" network: cni config uninitialized
Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418535 44969 remote_runtime.go:109] StopPodSandbox "4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-wj08g_openshift-infra" network: cni config uninitialized
Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418589 44969 kubelet.go:1460] error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized"
Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418601 44969 pod_workers.go:182] Error syncing pod ecbda575-30f6-11e8-ad10-001a4a160755 ("hawkular-cassandra-1-wj08g_openshift-infra(ecbda575-30f6-11e8-ad10-001a4a160755)"), skipping: error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized"
Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418589 44969 kubelet.go:1460] error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized"
Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418601 44969 pod_workers.go:182] Error syncing pod ecbda575-30f6-11e8-ad10-001a4a160755 ("hawkular-cassandra-1-wj08g_openshift-infra(ecbda575-30f6-11e8-ad10-001a4a160755)"), skipping: error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized"
Which is treated in: https://access.redhat.com/solutions/3241891
But since this is a containerized installation, the sdn-ovs package is not installed by the installer:
# less /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/main.yml
...
- name: Install sdn-ovs package
package:
name: "{{ openshift.common.service_type }}-sdn-ovs{{ openshift_pkg_version | oo_image_tag_to_rpm_version(include_dash=True) }}"
state: present
when: openshift.common.use_openshift_sdn and not openshift.common.is_containerized | bool
...
Expected results:
The pod should terminate without issues.
Additional info:
We have run into this issue twice and would most probably be able to reproduce it again. Last time we worked around this by restarting the node, which terminated the pod, but that is obviously not a feasible solution in the long run.
Comment 20Red Hat Bugzilla
2023-09-15 00:07:10 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days