1591352 – bootstrap-autoapprover pod in multi-master OpenShift cluster remains in ContainerCreating status forever

Bug 1591352 - bootstrap-autoapprover pod in multi-master OpenShift cluster remains in ContainerCreating status forever

Summary: bootstrap-autoapprover pod in multi-master OpenShift cluster remains in Conta...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-containers
Sub Component:
Version:	13.0 (Queens)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	ga
Target Release:	13.0 (Queens)
Assignee:	Michał Dulko
QA Contact:	Jon Uriarte
Docs Contact:	Andrew Burden
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-06-14 14:15 UTC by Jon Uriarte
Modified:	2018-06-28 08:01 UTC (History)
CC List:	9 users (show)
Fixed In Version:	openstack-kuryr-cni-container-13.0-67
Doc Type:	Bug Fix
Doc Text:	A race condition in openshift-node causes CNI script failures that prevent pods from receiving IP addresses. This happens when containers get spawned before pods in OpenShift API get updated with the correct containerId, and as a result the containerId value in the kuryr-cni script is hard-coded as "null". To avoid the issue and to have the IPs assigned correctly, use the Docker API instead of the OpenShift API to fetch containerId when generating the kuryr-cni script.
Clone Of:
Environment:
Last Closed:	2018-06-28 08:00:49 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
kuryr logs and more information (15.35 KB, application/x-gzip) 2018-06-14 14:15 UTC, Jon Uriarte	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
OpenStack gerrit	575119	0	None	MERGED	process to gracefully exit when last watcher exits	2020-09-09 07:53:17 UTC
Red Hat Product Errata	RHEA-2018:2085	0	None	None	None	2018-06-28 08:01:13 UTC

Description Jon Uriarte 2018-06-14 14:15:10 UTC

Created attachment 1451419 [details]
kuryr logs and more information

Description of problem:

When deploying a multi-master OpenShift cluster with kuryr sdn solution on top of OpenStack, the bootstrap-autoapprover pod remains in ContainerCreating status and the nodes do not reach the Ready status, so no application pods can be deployed.

Version-Release number of selected component (if applicable):
openstack-kuryr-kubernetes-controller-0.4.3-1.el7ost.noarch
openstack-kuryr-kubernetes-cni-0.4.2-0.20180404104924.985c387.el7ost.noarch

How reproducible: almost 100% when deploying a multi-master OpenShift cluster on top of OpenStack

Steps to Reproduce:
1. Get OCP openshift-ansible downstream rpm
2. Configure OSP (all.yml) and OCP (OSEv3.yml) inventory files
   - Set 'openshift_openstack_num_masters: 3' in inventory/group_vars/all.yml
3. Run:
ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/prerequisites.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/provision.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory red-hat-ca.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/repos.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/install.yml

Actual results:
Deployed Openshift multi-master deployment not fully working. The nodes are not ready and the bootstrap-autoapprover pod remains in ContainerCreating status.

[openshift@master-0 ~]$ oc get nodes
NAME                                 STATUS     ROLES           AGE       VERSION
app-node-0.openshift.example.com     NotReady   compute         54m       v1.10.0+b81c8f8
app-node-1.openshift.example.com     NotReady   compute         54m       v1.10.0+b81c8f8
infra-node-0.openshift.example.com   NotReady   compute,infra   54m       v1.10.0+b81c8f8
infra-node-1.openshift.example.com   NotReady   compute,infra   54m       v1.10.0+b81c8f8
master-0.openshift.example.com       Ready      master          59m       v1.10.0+b81c8f8
master-1.openshift.example.com       Ready      master          59m       v1.10.0+b81c8f8
master-2.openshift.example.com       Ready      master          59m       v1.10.0+b81c8f8
 
 
[openshift@master-0 ~]$ oc get pods --all-namespaces -o wide
NAMESPACE         NAME                                                READY     STATUS              RESTARTS   AGE       IP              NODE
default           router-1-sfsqt                                      1/1       Running             0          49m       192.168.99.10   infra-node-1.openshift.example.com
default           router-1-vvh6h                                      1/1       Running             0          49m       192.168.99.19   infra-node-0.openshift.example.com
kube-system       master-api-master-0.openshift.example.com           1/1       Running             1          58m       192.168.99.6    master-0.openshift.example.com
kube-system       master-api-master-1.openshift.example.com           1/1       Running             1          58m       192.168.99.15   master-1.openshift.example.com
kube-system       master-api-master-2.openshift.example.com           1/1       Running             1          58m       192.168.99.8    master-2.openshift.example.com
kube-system       master-controllers-master-0.openshift.example.com   1/1       Running             0          58m       192.168.99.6    master-0.openshift.example.com
kube-system       master-controllers-master-1.openshift.example.com   1/1       Running             0          58m       192.168.99.15   master-1.openshift.example.com
kube-system       master-controllers-master-2.openshift.example.com   1/1       Running             1          58m       192.168.99.8    master-2.openshift.example.com
kube-system       master-etcd-master-0.openshift.example.com          1/1       Running             1          58m       192.168.99.6    master-0.openshift.example.com
kube-system       master-etcd-master-1.openshift.example.com          1/1       Running             1          59m       192.168.99.15   master-1.openshift.example.com
kube-system       master-etcd-master-2.openshift.example.com          1/1       Running             1          59m       192.168.99.8    master-2.openshift.example.com
openshift-infra   bootstrap-autoapprover-0                            0/1       ContainerCreating   0          57m       <none>          master-1.openshift.example.com
openshift-infra   kuryr-cni-ds-67cxc                                  1/1       Running             0          54m       192.168.99.12   app-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-9dbsk                                  1/1       Running             0          53m       192.168.99.19   infra-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-bxnkz                                  1/1       Running             0          54m       192.168.99.4    app-node-1.openshift.example.com
openshift-infra   kuryr-cni-ds-cr5sc                                  1/1       Running             0          53m       192.168.99.15   master-1.openshift.example.com
openshift-infra   kuryr-cni-ds-gpnxp                                  1/1       Running             0          53m       192.168.99.10   infra-node-1.openshift.example.com
openshift-infra   kuryr-cni-ds-h2vsm                                  1/1       Running             0          53m       192.168.99.6    master-0.openshift.example.com
openshift-infra   kuryr-cni-ds-ztqms                                  1/1       Running             0          53m       192.168.99.8    master-2.openshift.example.com
openshift-infra   kuryr-controller-65c98f7444-qv7qh                   1/1       Running             0          57m       192.168.99.4    app-node-1.openshift.example.com
openshift-node    sync-ch285                                          1/1       Running             0          55m       192.168.99.10   infra-node-1.openshift.example.com
openshift-node    sync-d6r9f                                          1/1       Running             0          55m       192.168.99.19   infra-node-0.openshift.example.com
openshift-node    sync-h82jm                                          1/1       Running             0          57m       192.168.99.6    master-0.openshift.example.com
openshift-node    sync-jmwjx                                          1/1       Running             0          57m       192.168.99.8    master-2.openshift.example.com
openshift-node    sync-mhmfq                                          1/1       Running             0          54m       192.168.99.4    app-node-1.openshift.example.com
openshift-node    sync-nms7d                                          1/1       Running             0          54m       192.168.99.12   app-node-0.openshift.example.com
openshift-node    sync-qw9t8                                          1/1       Running             0          57m       192.168.99.15   master-1.openshift.example.com


Expected results:
Fully working Openshift multi-master deployment, with the nodes in Ready status and the bootstrap-autoapprover pod in Running status.

Additional info:
Find attached the logs.

Comment 1 Antoni Segura Puimedon 2018-06-14 14:31:36 UTC

The Kuryr CNI executable works by placing a script on the host that does:

docker exec ID_of_the_kuryr_cni_container ...

The ID is retrieved from the Kubernetes API. It is possible for the API to be reached by the CNI container before the kubelet has updated the pod status field that contains the container ID. In such cases, we get a Null result and the script generation misbehaves.

Comment 2 Antoni Segura Puimedon 2018-06-14 14:35:35 UTC

An alternative would be to not use the Kubernetes API to find the container to execute but have the CNI script do some check itself as in this example:

[openshift@app-node-1 ~]$ cat findit.sh 
CNI_POD_NAME="$1"
read -r -d '' finder <<EOF
import json
import sys
import pprint

containers=json.load(sys.stdin)
for container in containers:
    if ('Labels' in container and
            container['Labels'].get('io.kubernetes.pod.name') == '$CNI_POD_NAME' and
            container['Labels'].get('com.redhat.component') != 'openshift-enterprise-pod-container'):
        print(container['Id'])
EOF

curl --unix-socket /var/run/docker.sock http:/containers/json 2> /dev/null | python -c "$finder"
[openshift@app-node-1 ~]$ sudo sh findit.sh kuryr-cni-ds-sflpf
69d8bdec7033fb3d4267ebd8781fc57a9af4e5491ab0352e579c7cff1d3fa31a

the output of calling the findit.sh or function could be stored in some directory and have the CNI container start wipe it so it needs to be regenerated (so upgrades work instead of pointing to the terminating CNI).

Comment 9 Jon Schlueter 2018-06-20 13:16:23 UTC

openstack-kuryr-cni Container image change only for this fix

Comment 13 Jon Uriarte 2018-06-20 20:25:03 UTC

Verified in https://access.redhat.com/containers/#/registry.access.redhat.com/rhosp13/openstack-kuryr-cni/images/13.0-67 image.

A multi-master cluster is successfully deployed with the new kuryr-cni image, all the pods are running and the openshift nodes are ready.

Verification steps:
1. Get OCP openshift-ansible downstream rpm
2. Configure OSP (all.yml) and OCP (OSEv3.yml) inventory files
   - Set 'openshift_openstack_num_masters: 3' in inventory/group_vars/all.yml
3. Run:
ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/prerequisites.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/provision.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory red-hat-ca.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/repos.yml

ansible-playbook --user openshift -i /usr/share/ansible/openshift-ansible/playbooks/openstack/inventory.py -i inventory /usr/share/ansible/openshift-ansible/playbooks/openstack/openshift-cluster/install.yml

4. Check the installer finishes without errors
5. Check vms deployed in the overcloud
(overcloud) [cloud-user@ansible-host ~]$ openstack server list                                                                                                                                                     
+--------------------------------------+------------------------------------+--------+-------------------------------------------------------------------------+---------+-----------+
| ID                                   | Name                               | Status | Networks                                                                | Image   | Flavor    |
+--------------------------------------+------------------------------------+--------+-------------------------------------------------------------------------+---------+-----------+
| 214765eb-2028-44b4-ac68-97bf56b36586 | infra-node-1.openshift.example.com | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.10, 172.20.0.235 | rhel75  | m1.node   |
| c8fc3cea-1371-42e7-88f0-634330a8db13 | infra-node-0.openshift.example.com | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.20, 172.20.0.210 | rhel75  | m1.node   |
| 266d023c-537b-4116-8a05-250e5fca1c09 | master-2.openshift.example.com     | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.14, 172.20.0.236 | rhel75  | m1.master |
| d4730b0f-5a79-45d6-9089-e1f3b5f92a72 | master-0.openshift.example.com     | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.4, 172.20.0.234  | rhel75  | m1.master |
| 3a1e4b89-746d-46e8-9a97-1b9a627ad505 | master-1.openshift.example.com     | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.5, 172.20.0.220  | rhel75  | m1.master |
| 6079fc25-bf35-4924-b89a-4c0afe92f7e6 | app-node-1.openshift.example.com   | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.13, 172.20.0.223 | rhel75  | m1.node   |
| dd006f99-a61e-47a7-9a19-b9b36a001712 | app-node-0.openshift.example.com   | ACTIVE | openshift-ansible-openshift.example.com-net=192.168.99.6, 172.20.0.233  | rhel75  | m1.node   |
| 19ea6449-8fe3-41b7-bff8-ce973357ccfe | openshift-dns                      | ACTIVE | openshift-dns=192.168.23.3, 172.20.0.218                                | centos7 | m1.small  |
| 490056e5-a0b0-4af8-8cb2-f7b7321dd604 | ansible-host                       | ACTIVE | ansible-host=172.16.0.6, 172.20.0.212                                   | rhel75  | m1.small  |
+--------------------------------------+------------------------------------+--------+-------------------------------------------------------------------------+---------+-----------+

6. Check all the nodes are Ready
[openshift@master-0 ~]$ oc get nodes
NAME                                 STATUS    ROLES     AGE       VERSION
app-node-0.openshift.example.com     Ready     compute   51m       v1.10.0+b81c8f8
app-node-1.openshift.example.com     Ready     compute   51m       v1.10.0+b81c8f8
infra-node-0.openshift.example.com   Ready     infra     51m       v1.10.0+b81c8f8
infra-node-1.openshift.example.com   Ready     infra     51m       v1.10.0+b81c8f8
master-0.openshift.example.com       Ready     master    55m       v1.10.0+b81c8f8
master-1.openshift.example.com       Ready     master    55m       v1.10.0+b81c8f8
master-2.openshift.example.com       Ready     master    55m       v1.10.0+b81c8f8

7. Check all the pods are Running
[openshift@master-0 ~]$ oc get pods --all-namespaces -o wide
NAMESPACE         NAME                                                READY     STATUS    RESTARTS   AGE       IP              NODE
default           router-1-8tjvx                                      1/1       Running   0          46m       192.168.99.20   infra-node-0.openshift.example.com
default           router-1-gtkmg                                      1/1       Running   0          46m       192.168.99.10   infra-node-1.openshift.example.com
kube-system       master-api-master-0.openshift.example.com           1/1       Running   1          54m       192.168.99.4    master-0.openshift.example.com
kube-system       master-api-master-1.openshift.example.com           1/1       Running   1          54m       192.168.99.5    master-1.openshift.example.com
kube-system       master-api-master-2.openshift.example.com           1/1       Running   1          54m       192.168.99.14   master-2.openshift.example.com
kube-system       master-controllers-master-0.openshift.example.com   1/1       Running   1          54m       192.168.99.4    master-0.openshift.example.com
kube-system       master-controllers-master-1.openshift.example.com   1/1       Running   0          54m       192.168.99.5    master-1.openshift.example.com
kube-system       master-controllers-master-2.openshift.example.com   1/1       Running   2          54m       192.168.99.14   master-2.openshift.example.com
kube-system       master-etcd-master-0.openshift.example.com          1/1       Running   1          54m       192.168.99.4    master-0.openshift.example.com
kube-system       master-etcd-master-1.openshift.example.com          1/1       Running   1          54m       192.168.99.5    master-1.openshift.example.com
kube-system       master-etcd-master-2.openshift.example.com          1/1       Running   1          54m       192.168.99.14   master-2.openshift.example.com
openshift-infra   bootstrap-autoapprover-0                            1/1       Running   0          52m       10.11.0.50      master-2.openshift.example.com
openshift-infra   kuryr-cni-ds-5ddlw                                  1/1       Running   0          50m       192.168.99.6    app-node-0.openshift.example.com
openshift-infra   kuryr-cni-ds-8tcxv                                  1/1       Running   0          49m       192.168.99.4    master-0.openshift.example.com
openshift-infra   kuryr-cni-ds-ck8tn                                  1/1       Running   0          50m       192.168.99.13   app-node-1.openshift.example.com
openshift-infra   kuryr-cni-ds-f6mqr                                  1/1       Running   0          49m       192.168.99.5    master-1.openshift.example.com
openshift-infra   kuryr-cni-ds-gxrbp                                  1/1       Running   0          49m       192.168.99.10   infra-node-1.openshift.example.com
openshift-infra   kuryr-cni-ds-n84fs                                  1/1       Running   0          49m       192.168.99.14   master-2.openshift.example.com
openshift-infra   kuryr-cni-ds-tdxvr                                  1/1       Running   0          49m       192.168.99.20   infra-node-0.openshift.example.com
openshift-infra   kuryr-controller-65c98f7444-zpg47                   1/1       Running   1          53m       192.168.99.13   app-node-1.openshift.example.com
openshift-node    sync-2sqqx                                          1/1       Running   0          53m       192.168.99.14   master-2.openshift.example.com
openshift-node    sync-d5mj7                                          1/1       Running   0          53m       192.168.99.4    master-0.openshift.example.com
openshift-node    sync-djjmd                                          1/1       Running   0          50m       192.168.99.13   app-node-1.openshift.example.com
openshift-node    sync-jcxjv                                          1/1       Running   0          50m       192.168.99.20   infra-node-0.openshift.example.com
openshift-node    sync-jxwvf                                          1/1       Running   0          50m       192.168.99.6    app-node-0.openshift.example.com
openshift-node    sync-qhb7h                                          1/1       Running   0          53m       192.168.99.5    master-1.openshift.example.com
openshift-node    sync-x6mn8                                          1/1       Running   0          50m       192.168.99.10   infra-node-1.openshift.example.com

8. Deploy a dc with 8 replicas
[openshift@master-0 ~]$ oc get pods -o wide
NAME           READY     STATUS    RESTARTS   AGE       IP           NODE
demo-1-29pg2   1/1       Running   0          27s       10.11.0.13   app-node-0.openshift.example.com
demo-1-5kj7s   1/1       Running   0          27s       10.11.0.10   app-node-1.openshift.example.com
demo-1-7x99s   1/1       Running   0          28s       10.11.0.2    app-node-0.openshift.example.com
demo-1-8qxdd   1/1       Running   0          28s       10.11.0.12   app-node-0.openshift.example.com
demo-1-97fmp   1/1       Running   0          27s       10.11.0.6    app-node-1.openshift.example.com
demo-1-cn2l7   1/1       Running   0          28s       10.11.0.14   app-node-1.openshift.example.com
demo-1-fcsnf   1/1       Running   0          5m        10.11.0.9    app-node-1.openshift.example.com
demo-1-hswj4   1/1       Running   0          27s       10.11.0.8    app-node-0.openshift.example.com

Comment 15 errata-xmlrpc 2018-06-28 08:00:49 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2085

Note You need to log in before you can comment on or make changes to this bug.