Bugzilla (bugzilla.redhat.com) will be under maintenance for infrastructure upgrades and will not be available on July 31st between 12:30 AM - 05:30 AM UTC. We appreciate your understanding and patience. You can follow status.redhat.com for details.
Bug 1413850 - Dead containers on the node after upgrading
Summary: Dead containers on the node after upgrading
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 3.3.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 3.4.z
Assignee: Scott Dodson
QA Contact: liujia
Depends On:
TreeView+ depends on / blocked
Reported: 2017-01-17 06:55 UTC by Jaspreet Kaur
Modified: 2020-07-16 09:07 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2017-05-12 13:38:31 UTC
Target Upstream Version:

Attachments (Terms of Use)

Description Jaspreet Kaur 2017-01-17 06:55:04 UTC
Description of problem: During the serialized upgrade of the nodes there seems to be too little time between the evacuation of the node  and when the kubelet is restarted. All the pods that are set to be evacuated does not have enough time to exit cleanly and after the upgrade we see dead-containers on the nodes. This is not a major problem for stateless pods/containers, but for containers that mount NFS-volumes we have seen dead-containers with active mounts to the NFS-volume, causing NFS Stale File Handle. Also the tmps-mounts for i.e. secrets are not cleaned up since the “maintainer-pods” (k8s_POD.*) did not exit cleanly and is now in “status=dead”. The workaround is to restart the Docker-daemon on every node after the upgrade to remove every dead-container and mounts related to the dead-containers.

In the ansible playbook (/usr/share/ansible/openshift-ansible/playbooks/byo/openshift-cluster/upgrades/v3_3/upgrade.yml) there is not enough time between the node is evacuated and when "openvswitch" and "atomic-openshift-node" is restarted. The pods that are evacuated has a default terminationGracePeriod set to 30 seconds and the upgrade playbook should wait for the pods to exit cleanly before "openvswitch" and "atomic-openshift-node" is restarted. A quick test showed that "openvswitch" is restarted 17 seconds after the node is evacuated. 

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:

Comment 1 Scott Dodson 2017-05-09 12:53:52 UTC
This is fixed in 3.4 by use of the drain command which is a blocking command ensuring that all pods are stopped before continuing.

Moving to ON_QA to verify.

Comment 2 liujia 2017-05-10 09:53:35 UTC
Reproduced on atomic-openshift-utils-3.3.78-1.git.0.fc348bc.el7.noarch.
After upgrade successfully, there are dead containers.
# docker ps -a -f status=dead
CONTAINER ID        IMAGE                          COMMAND             CREATED             STATUS              PORTS               NAMES
70dbfc305824        openshift3/ose-pod:v3.2.1.31   "/pod"              15 minutes ago      Dead                                    k8s_POD.d245fd3e_docker-registry-4-k6ypo_default_168f4da3-3563-11e7-9b76-fa163ee094f9_83d9b6fb

Comment 3 liujia 2017-05-11 08:12:56 UTC

1. install ocp3.3
# openshift version
openshift v3.3.1.24
kubernetes v1.3.0+52492b4
etcd 2.3.0+git
2. new project and app, check there are no dead containers.
3. run upgrade playbook

Upgrade succeed and no dead containers existed.
# docker ps -a -f status=dead
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES

Change bug's status to verify.

Comment 4 Scott Dodson 2017-05-12 13:38:31 UTC
This was fixed in the OCP 3.4 GA release.

Note You need to log in before you can comment on or make changes to this bug.