1633892 – OCP 3.11: CRI-O cluster install with openshift-ansible fails on large AWS instances m5.12xlarge and m4.16xlarge due to Master node openshift-sdn pod in CrashLoopBackOff state

Bug 1633892 - OCP 3.11: CRI-O cluster install with openshift-ansible fails on large AWS instances m5.12xlarge and m4.16xlarge due to Master node openshift-sdn pod in CrashLoopBackOff state

Summary: OCP 3.11: CRI-O cluster install with openshift-ansible fails on large AWS in...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	3.11.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Dan Williams
QA Contact:	Walid A.
Docs Contact:
URL:
Whiteboard:	aos-scalability-311
Depends On:
Blocks:	1634151
TreeView+	depends on / blocked

Reported:	2018-09-28 02:33 UTC by Walid A.
Modified:	2019-06-06 02:00 UTC (History)
CC List:	17 users (show)
Fixed In Version:	openshift-ansible-3.10.54-1.git.0.714584b
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1634151 (view as bug list)
Environment:
Last Closed:	2019-06-06 02:00:28 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 10269	0	'None'	closed	Start only the ovsdb so we can add the config safely	2020-11-23 17:24:12 UTC
Red Hat Product Errata	RHBA-2019:0794	0	None	None	None	2019-06-06 02:00:41 UTC

Description Walid A. 2018-09-28 02:33:51 UTC

Description of problem:
Install of OCP 3.11.15 and OCP 3.11.16 with cri-o container runtime is failing on large AWS instances such as m5.12xlarge or m4.16xlarge, when using openshift-ansible deploy-cluster.yml playbook.  This is reproducible, and is not happening on smaller instance types such us m5.xlarge.

openshift-web-console pods are stuck in ContainerCreating because the openshift sdn pod on master node is in CrashLoopBackOff state, and keeps restarting unsuccessfully.  Journal logs on master showing these errors:

Sep 27 17:49:41 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: E0927 17:49:41.232472   31727 pod_workers.go:186] Error syncing pod e92b4f2e-c25c-11e8-bc31-027fae47c8b8 ("sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)"), skipping: failed to "StartContainer" for "sdn" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdn pod=sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)"
Sep 27 17:49:52 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: I0927 17:49:52.233750   31727 kuberuntime_manager.go:757] checking backoff for container "sdn" in pod "sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)"
Sep 27 17:49:52 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: I0927 17:49:52.233896   31727 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=sdn pod=sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)

.
.
.

27fae47c8b8_0(b8ac1c906eaf4956c69c00dd0646b81be6bc3f94f8fcc2ef5c474fb1d7ab2347): OpenShift SDN network process is not (yet?) available"
Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal crio[3207]: time="2018-09-27 14:02:55.430034609Z" level=error msg="Error adding network: OpenShift SDN network process is not (yet?) available"
Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal crio[3207]: time="2018-09-27 14:02:55.430067057Z" level=error msg="Error while adding to cni network: OpenShift SDN network process is not (yet?) available"
Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal crio[3207]: time="2018-09-27 14:02:55.457593004Z" level=error msg="Error deleting network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: no such file or directory"
Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: E0927 14:02:55.482319   31727 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webconsole-78fd5bb8cf-tt7hj_openshift-web-console_2d07195a-c242-11e8-bc31-027fae47c8b8_0(52cb35e2de1c3e3c69b86344500ffcc219622d69b05c13b91cd1682b91d164b3): OpenShift SDN network process is not (yet?) available
 
Version-Release number of the following components:

openshift v3.11.16
kubernetes v1.11.0+d4cacc0

openshift-ansible rpm not installed, but running from git cloned repo from release-3.11 branch:
commit 5067b5eaacd7a69e5d77a264803f48939a5c144f
Merge: 4ac6f81 03a2257

rpm -q ansible: 
ansible-2.6.4-1.el7ae.noarch

ansible --version:
ansible 2.6.4
  config file = /root/openshift-ansible/ansible.cfg
  configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules']
  ansible python module location = /usr/lib/python2.7/site-packages/ansible
  executable location = /usr/bin/ansible
  python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)]
root@ip-172-31-24-164: ~/openshift-ansible # 



How reproducible:
Reproducible on m4.12xlarge, m5.24xlarge, and m4.16xlarge AWS instance types

Steps to Reproduce:
1. On a jenkins slave, git clone the latest the release-3.11 branch of openshift-ansible, commit id:
commit 5067b5eaacd7a69e5d77a264803f48939a5c144f
Merge: 4ac6f81 03a2257
Author: OpenShift Merge Robot <openshift-merge-robot.github.com>
Date:   Wed Sep 26 10:41:28 2018 -0700

    Merge pull request #10236 from openshift-cherrypick-robot/cherry-pick-10231-to-release-3.11

2. Run the openshift-ansible/playbooks/prerequisites.yml
   cd openshift-ansible/playbooks/
   ansible-playbook -i flexy_inventory prerequisites.yml

3. Deploy an OCP 3.11.16 environment with cri-o container runtime using openshift-ansible deploy-cluster.yml playbook:
   ansible-playbook -i flexy_inventory deploy_cluster.yml

4. install will fail at task: [openshift_web_console : Get console pod logs]. ssh to Master node, oc get pods -n openshift-sdn -o yaml shows the sdn pod on Master node in CrashLoopBackOff state.



Actual results:
 
Playbook fails at task: 

TASK [openshift_web_console : Get console pod logs] ****************************
Thursday 27 September 2018  18:54:52 +0800 (0:00:00.107)       0:32:55.187 **** 
fatal: [ec2-18-236-162-83.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "logs", "deployment/webconsole", "--tail=50", "--config=/etc/origin/master/admin.kubeconfig", "-n", "openshift-web-console"], "delta": "0:00:00.221403", "end": "2018-09-27 10:54:56.920797", "msg": "non-zero return code", "rc": 1, "start": "2018-09-27 10:54:56.699394", "stderr": "Error from server (BadRequest): container \"webconsole\" in pod \"webconsole-78fd5bb8cf-tt7hj\" is waiting to start: ContainerCreating", "stderr_lines": ["Error from server (BadRequest): container \"webconsole\" in pod \"webconsole-78fd5bb8cf-tt7hj\" is waiting to start: ContainerCreating"], "stdout": "", "stdout_lines": []}
...ignoring

TASK [openshift_web_console : debug] *******************************************
Thursday 27 September 2018  18:54:53 +0800 (0:00:01.039)       0:32:56.226 **** 
ok: [ec2-18-236-162-83.us-west-2.compute.amazonaws.com] => {
    "msg": []
}

TASK [openshift_web_console : Report console errors] ***************************
Thursday 27 September 2018  18:54:53 +0800 (0:00:00.102)       0:32:56.329 **** 
fatal: [ec2-18-236-162-83.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Console install failed."}
	to retry, use: --limit @/home/slave4/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry

PLAY RECAP *********************************************************************
ec2-18-236-162-83.us-west-2.compute.amazonaws.com : ok=641  changed=274  unreachable=0    failed=1   
ec2-34-217-148-80.us-west-2.compute.amazonaws.com : ok=129  changed=61   unreachable=0    failed=0   
ec2-35-164-234-63.us-west-2.compute.amazonaws.com : ok=129  changed=61   unreachable=0    failed=0   
ec2-54-202-53-94.us-west-2.compute.amazonaws.com : ok=129  changed=61   unreachable=0    failed=0   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization               : Complete (0:00:49)
Health Check                 : Complete (0:00:03)
Node Bootstrap Preparation   : Complete (0:03:33)
etcd Install                 : Complete (0:01:46)
Master Install               : Complete (0:08:44)
Master Additional Install    : Complete (0:01:19)
Node Join                    : Complete (0:01:32)
Hosted Install               : Complete (0:01:33)
Cluster Monitoring Operator  : Complete (0:01:37)
Web Console Install          : In Progress (0:11:45)
	This phase can be restarted by running: playbooks/openshift-web-console/config.yml
Thursday 27 September 2018  18:54:53 +0800 (0:00:00.072)       0:32:56.401 **** 
=============================================================================== 
openshift_web_console : Verify that the console is running ------------ 680.79s
openshift_control_plane : Wait for control plane pods to appear ------- 110.86s
openshift_manage_node : Wait for sync DS to set annotations on all nodes -- 69.01s
openshift_cluster_monitoring_operator : Wait for the ServiceMonitor CRD to be created -- 67.02s
openshift_control_plane : Wait for all control plane pods to become ready -- 58.87s
openshift_node_group : Wait for the sync daemonset to become ready and available -- 24.48s
openshift_sdn : Copy templates to temp directory ----------------------- 12.63s
openshift_node : Install node, clients, and conntrack packages --------- 12.48s
openshift_node_group : Copy templates to temp directory ---------------- 12.04s
tuned : Ensure files are populated from templates ---------------------- 11.87s
Run variable sanity checks --------------------------------------------- 11.53s
openshift_node : Install NFS storage plugin dependencies --------------- 11.45s
openshift_manageiq : Configure role/user permissions ------------------- 11.02s
openshift_control_plane : Wait for APIs to become available ------------ 10.72s
etcd : Install etcd ---------------------------------------------------- 10.14s
openshift_node : install needed rpm(s) ---------------------------------- 9.54s
openshift_control_plane : Copy static master scripts -------------------- 8.98s
tuned : Ensure files are populated from templates ----------------------- 8.65s
openshift_control_plane : Prepare master static pods -------------------- 8.47s
openshift_hosted : Create OpenShift router ------------------------------ 8.06s



Expected results:
Successful install

Additional info:
Links to logs from ansible-playbook, oc get events, oc logs from failed pods, and journal are in next attachment

Comment 4 Weihua Meng 2018-09-28 09:20:50 UTC

hit the same issue on GCP
Machine type n1-standard-64
OCP v3.11.16

Comment 6 Casey Callendrello 2018-09-28 12:58:22 UTC

Dan (dcbw), can you take a look?

Comment 7 Casey Callendrello 2018-09-28 16:00:42 UTC

From the logs, it looks like openvswitch is hanging / timing out / on a trip to mars.

Adding aconole to CC list. Aaron, if we get a machine in this state, could you help us diagnose this?

Comment 8 Dan Winship 2018-09-28 16:23:34 UTC

Can we get the OVS logs from a failed install?

Comment 9 Phil Cameron 2018-09-28 18:18:26 UTC

Looks like ovs-vsctl timed out (maybe after 30 sec)
./pkg/util/ovs/ovs.go +166

func (ovsif *ovsExec) execWithStdin(cmd string, stdinArgs []string, args ...string) (string, error) {
...
        switch cmd {
        case OVS_VSCTL:
                args = append([]string{"--timeout=30"}, args...)
...
        output, err := kcmd.CombinedOutput()
        if err != nil {
166:             glog.V(2).Infof("Error executing %s: %s", cmd, string(output))

man ovs-vsctl(8)
       --timeout=secs
              By default, or with a secs of 0, ovs-vsctl waits forever  for  a
              response  from  the  database.   This  option  limits runtime to
              approximately secs seconds.  If the timeout  expires,  ovs-vsctl
              will exit with a SIGALRM signal.  (A timeout would normally hap‐
              pen only if the database cannot be contacted, or if  the  system
              is overloaded.)

Comment 10 Weibin Liang 2018-09-28 19:03:49 UTC

--- ds	2018-09-28 15:00:12.647762202 -0400
+++ ds-new	2018-09-28 15:02:51.039364940 -0400
@@ -67,16 +67,17 @@
               exit 0
           }
           trap quit SIGTERM
-          /usr/share/openvswitch/scripts/ovs-ctl start --system-id=random
+          /usr/share/openvswitch/scripts/ovs-ctl start --no-ovs-vswitchd --system-id=random
 
           # Restrict the number of pthreads ovs-vswitchd creates to reduce the
           # amount of RSS it uses on hosts with many cores
           # https://bugzilla.redhat.com/show_bug.cgi?id=1571379
           # https://bugzilla.redhat.com/show_bug.cgi?id=1572797
           if [[ `nproc` -gt 12 ]]; then
-              ovs-vsctl set Open_vSwitch . other_config:n-revalidator-threads=4
-              ovs-vsctl set Open_vSwitch . other_config:n-handler-threads=10
+              ovs-vsctl --no-wait set Open_vSwitch . other_config:n-revalidator-threads=4
+              ovs-vsctl --no-wait set Open_vSwitch . other_config:n-handler-threads=10
           fi
+          /usr/share/openvswitch/scripts/ovs-ctl start --no-ovsdb-server --system-id=random
           while true; do sleep 5; done
         image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.11
         imagePullPolicy: IfNotPresent

Comment 12 Weibin Liang 2018-09-28 19:23:46 UTC

@Walid,

Ben and Aconole worked out the problem in my env,  it was a race between us telling ovs not to use too many cores.  an ansible tweak will solve the problem (that's where the dc lives)

Comment 13 Ben Bennett 2018-09-28 19:49:04 UTC

Fix for master: https://github.com/openshift/openshift-ansible/pull/10269

Comment 14 Ben Bennett 2018-09-28 19:59:03 UTC

Made https://bugzilla.redhat.com/show_bug.cgi?id=1634151 for the 3.11.z fix.

Comment 15 openshift-github-bot 2018-09-29 16:47:31 UTC

Commit pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/5c003b15acf9cae3290dcd17ecda640aefeecdb6
Start only the ovsdb so we can add the config safely

The existing code starts both the ovsdb and the ovsvswitchd and then
chances the database to limit the threads used.

Unfortunately, on very fast machines that have many cores, there is a
race between when the config change gets in to limit the threads used,
and when the threads get started.  If the config change loses the
race, then the threads aren't limited and the vswitchd becomes
unresponsive.

The fix is to start the db component, program it, then start the
vswitchd component.

Fixes bug 1633892 (https://bugzilla.redhat.com/show_bug.cgi?id=1633892)

Comment 16 Manoj Kumar 2018-10-01 16:08:46 UTC

Could I request that the fix for 1633892 be back-ported to 3.10 and made available on ppc64le as well?

(assuming that 1615375 is a duplicate)

https://bugzilla.redhat.com/show_bug.cgi?id=1615375

Comment 17 Dan Williams 2018-10-02 15:04:13 UTC

(In reply to Manoj Kumar from comment #16)
> Could I request that the fix for 1633892 be back-ported to 3.10 and made
> available on ppc64le as well?
> 
> (assuming that 1615375 is a duplicate)
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=1615375

3.10 backport is https://github.com/openshift/openshift-ansible/pull/10280 and was merged.

Comment 23 zhaozhanqi 2019-04-16 09:25:18 UTC

@Walid
 Can you help try this bug on the fixed version and verify this bug?

Comment 25 Walid A. 2019-04-17 13:50:20 UTC

Verified on OCP 3.11.106 m5.12xlarge instances in AWS, 1 master, 1 infra and 2 compute nodes

component versions:

oc v3.11.106
kubernetes v1.11.0+d4cacc0
openshift v3.11.106
Container Runtime Version:         cri-o://1.11.13-1.rhaos3.11.gitfb88a9c.el7

# oc get pods --all-namespaces
NAMESPACE                           NAME                                                            READY     STATUS    RESTARTS   AGE
default                             docker-registry-1-qtddm                                         1/1       Running   0          19m
default                             registry-console-1-qwpgg                                        1/1       Running   0          19m
default                             router-1-wlrsn                                                  1/1       Running   0          19m
kube-service-catalog                apiserver-xr2gg                                                 1/1       Running   0          15m
kube-service-catalog                controller-manager-k6qwr                                        1/1       Running   3          15m
kube-system                         master-api-ip-172-31-61-61.us-west-2.compute.internal           1/1       Running   0          25m
kube-system                         master-controllers-ip-172-31-61-61.us-west-2.compute.internal   1/1       Running   0          24m
kube-system                         master-etcd-ip-172-31-61-61.us-west-2.compute.internal          1/1       Running   0          25m
openshift-ansible-service-broker    asb-1-487js                                                     1/1       Running   0          13m
openshift-console                   console-89747b7d8-x6q59                                         1/1       Running   0          16m
openshift-monitoring                alertmanager-main-0                                             3/3       Running   0          16m
openshift-monitoring                alertmanager-main-1                                             3/3       Running   0          16m
openshift-monitoring                alertmanager-main-2                                             3/3       Running   0          16m
openshift-monitoring                cluster-monitoring-operator-75c6b544dd-cck56                    1/1       Running   0          18m
openshift-monitoring                grafana-c7d5bc87c-xt9kq                                         2/2       Running   0          17m
openshift-monitoring                kube-state-metrics-6c64799586-f5nn5                             3/3       Running   0          15m
openshift-monitoring                node-exporter-74hcs                                             2/2       Running   0          15m
openshift-monitoring                node-exporter-bfvnf                                             2/2       Running   0          15m
openshift-monitoring                node-exporter-bq926                                             2/2       Running   0          15m
openshift-monitoring                node-exporter-x6hdp                                             2/2       Running   0          15m
openshift-monitoring                prometheus-k8s-0                                                4/4       Running   1          17m
openshift-monitoring                prometheus-k8s-1                                                4/4       Running   1          16m
openshift-monitoring                prometheus-operator-5b47ff445b-45cg6                            1/1       Running   0          18m
openshift-node                      sync-9xxfh                                                      1/1       Running   0          21m
openshift-node                      sync-dgm6t                                                      1/1       Running   0          21m
openshift-node                      sync-ppc76                                                      1/1       Running   0          22m
openshift-node                      sync-qtp94                                                      1/1       Running   0          21m
openshift-sdn                       ovs-b5sk9                                                       1/1       Running   0          21m
openshift-sdn                       ovs-csxh6                                                       1/1       Running   0          21m
openshift-sdn                       ovs-fq4zf                                                       1/1       Running   0          21m
openshift-sdn                       ovs-mmgds                                                       1/1       Running   0          21m
openshift-sdn                       sdn-6qwxm                                                       1/1       Running   0          21m
openshift-sdn                       sdn-7b8km                                                       1/1       Running   0          21m
openshift-sdn                       sdn-fhkfn                                                       1/1       Running   0          21m
openshift-sdn                       sdn-whpxs                                                       1/1       Running   0          21m
openshift-template-service-broker   apiserver-j26z6                                                 1/1       Running   0          13m
openshift-web-console               webconsole-787f54c7f8-c5zj5                                     1/1       Running   0          17m

Comment 27 errata-xmlrpc 2019-06-06 02:00:28 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0794

Note You need to log in before you can comment on or make changes to this bug.