Description of problem: Customer is running containerized installation of OCP 3.6 with CNS glusterfs and glusterblock. When trying to scale down the pod hawkular-cassandra that has glusterblock storage mounted, it gets stuck in "Terminating" state. This was discovered when performing OS upgrades according to documentation: https://docs.openshift.com/container-platform/3.6/install_config/upgrading/os_upgrades.html Version-Release number of selected component (if applicable): Containerized OCP 3.6 as per documentation: https://docs.openshift.com/container-platform/3.6/install_config/install/advanced_install.html CNS 3.6 deployed with cns-deploy as per documentation: https://access.redhat.com/documentation/en-us/red_hat_gluster_storage/3.3/html-single/container-native_storage_for_openshift_container_platform/#idm139668599809168 Metrics deployed with glusterblock backed storage as per documentation: https://docs.openshift.com/container-platform/3.6/install_config/cluster_metrics.html with the following parameters in the inventory file: openshift_metrics_install_metrics=false openshift_metrics_hawkular_hostname=hawkular-metrics.apps.example.com openshift_metrics_hawkular_replicas=1 openshift_metrics_cassandra_replicas=1 openshift_metrics_cassandra_limits_memory=2G openshift_hosted_metrics_deployer_prefix=docker-registry-default.registry.example.com:443/openshift3/ openshift_hosted_metrics_deployer_version=v3.6.173.0.83 openshift_metrics_cassandra_nodeselector={"region":"infra"} openshift_metrics_hawkular_nodeselector={"region":"infra"} openshift_metrics_heapster_nodeselector={"region":"infra"} openshift_metrics_cassandra_storage_type=dynamic openshift_metrics_cassandra_pvc_size=50Gi How reproducible: 2/2 for this customer Steps to Reproduce: 1. Install components as described above 2. Find node that is running hawkular-cassandra pod: # oc get pods -n openshift-infra -o wide NAME READY STATUS RESTARTS AGE IP NODE hawkular-cassandra-1-qwert 1/1 Running 0 1d 1.2.3.4 <node> ... 3. Take the node out of schedule: oc adm manage-node <node> --schedulable=false 4. Draing the node: oc adm drain <node> --force --delete-local-data --ignore-daemonsets Actual results: The cassandra pod is stuck in terminating state, seemingly forever: # oc get pods -n openshift-infra NAME READY STATUS RESTARTS AGE hawkular-cassandra-1-wj08g 0/1 Terminating 2 1d hawkular-metrics-wngsq 0/1 Running 357 1d heapster-gsgmk 0/1 Running 276 1d The node produce errors, e.g: Mar 26 15:58:03 node.example.com dockerd-current[44602]: time="2018-03-26T15:58:03.417790295+02:00" level=error msg="Handler for POST /v1.24/containers/4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251/stop returned error: Container 4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251 is already stopped" Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418535 44969 remote_runtime.go:109] StopPodSandbox "4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-wj08g_openshift-infra" network: cni config uninitialized Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418535 44969 remote_runtime.go:109] StopPodSandbox "4adc27ec07bd6455ef60405f47a75a69b7548f58a108ad31d154e7b646eda251" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "hawkular-cassandra-1-wj08g_openshift-infra" network: cni config uninitialized Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418589 44969 kubelet.go:1460] error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" Mar 26 15:58:03 node.example.com atomic-openshift-node[44787]: E0326 15:58:03.418601 44969 pod_workers.go:182] Error syncing pod ecbda575-30f6-11e8-ad10-001a4a160755 ("hawkular-cassandra-1-wj08g_openshift-infra(ecbda575-30f6-11e8-ad10-001a4a160755)"), skipping: error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418589 44969 kubelet.go:1460] error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" Mar 26 15:58:03 node.example.com dockerd-current[44602]: E0326 15:58:03.418601 44969 pod_workers.go:182] Error syncing pod ecbda575-30f6-11e8-ad10-001a4a160755 ("hawkular-cassandra-1-wj08g_openshift-infra(ecbda575-30f6-11e8-ad10-001a4a160755)"), skipping: error killing pod: failed to "KillPodSandbox" for "ecbda575-30f6-11e8-ad10-001a4a160755" with KillPodSandboxError: "rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod \"hawkular-cassandra-1-wj08g_openshift-infra\" network: cni config uninitialized" Which is treated in: https://access.redhat.com/solutions/3241891 But since this is a containerized installation, the sdn-ovs package is not installed by the installer: # less /usr/share/ansible/openshift-ansible/roles/openshift_node/tasks/main.yml ... - name: Install sdn-ovs package package: name: "{{ openshift.common.service_type }}-sdn-ovs{{ openshift_pkg_version | oo_image_tag_to_rpm_version(include_dash=True) }}" state: present when: openshift.common.use_openshift_sdn and not openshift.common.is_containerized | bool ... Expected results: The pod should terminate without issues. Additional info: We have run into this issue twice and would most probably be able to reproduce it again. Last time we worked around this by restarting the node, which terminated the pod, but that is obviously not a feasible solution in the long run.
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days