Description of problem: Install of OCP 3.11.15 and OCP 3.11.16 with cri-o container runtime is failing on large AWS instances such as m5.12xlarge or m4.16xlarge, when using openshift-ansible deploy-cluster.yml playbook. This is reproducible, and is not happening on smaller instance types such us m5.xlarge. openshift-web-console pods are stuck in ContainerCreating because the openshift sdn pod on master node is in CrashLoopBackOff state, and keeps restarting unsuccessfully. Journal logs on master showing these errors: Sep 27 17:49:41 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: E0927 17:49:41.232472 31727 pod_workers.go:186] Error syncing pod e92b4f2e-c25c-11e8-bc31-027fae47c8b8 ("sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)"), skipping: failed to "StartContainer" for "sdn" with CrashLoopBackOff: "Back-off 5m0s restarting failed container=sdn pod=sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)" Sep 27 17:49:52 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: I0927 17:49:52.233750 31727 kuberuntime_manager.go:757] checking backoff for container "sdn" in pod "sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8)" Sep 27 17:49:52 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: I0927 17:49:52.233896 31727 kuberuntime_manager.go:767] Back-off 5m0s restarting failed container=sdn pod=sdn-7h9zs_openshift-sdn(e92b4f2e-c25c-11e8-bc31-027fae47c8b8) . . . 27fae47c8b8_0(b8ac1c906eaf4956c69c00dd0646b81be6bc3f94f8fcc2ef5c474fb1d7ab2347): OpenShift SDN network process is not (yet?) available" Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal crio[3207]: time="2018-09-27 14:02:55.430034609Z" level=error msg="Error adding network: OpenShift SDN network process is not (yet?) available" Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal crio[3207]: time="2018-09-27 14:02:55.430067057Z" level=error msg="Error while adding to cni network: OpenShift SDN network process is not (yet?) available" Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal crio[3207]: time="2018-09-27 14:02:55.457593004Z" level=error msg="Error deleting network: failed to send CNI request: Post http://dummy/: dial unix /var/run/openshift-sdn/cni-server.sock: connect: no such file or directory" Sep 27 14:02:55 ip-172-31-24-164.us-west-2.compute.internal atomic-openshift-node[31727]: E0927 14:02:55.482319 31727 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_webconsole-78fd5bb8cf-tt7hj_openshift-web-console_2d07195a-c242-11e8-bc31-027fae47c8b8_0(52cb35e2de1c3e3c69b86344500ffcc219622d69b05c13b91cd1682b91d164b3): OpenShift SDN network process is not (yet?) available Version-Release number of the following components: openshift v3.11.16 kubernetes v1.11.0+d4cacc0 openshift-ansible rpm not installed, but running from git cloned repo from release-3.11 branch: commit 5067b5eaacd7a69e5d77a264803f48939a5c144f Merge: 4ac6f81 03a2257 rpm -q ansible: ansible-2.6.4-1.el7ae.noarch ansible --version: ansible 2.6.4 config file = /root/openshift-ansible/ansible.cfg configured module search path = [u'/root/.ansible/plugins/modules', u'/usr/share/ansible/plugins/modules'] ansible python module location = /usr/lib/python2.7/site-packages/ansible executable location = /usr/bin/ansible python version = 2.7.5 (default, Sep 12 2018, 05:31:16) [GCC 4.8.5 20150623 (Red Hat 4.8.5-36)] root@ip-172-31-24-164: ~/openshift-ansible # How reproducible: Reproducible on m4.12xlarge, m5.24xlarge, and m4.16xlarge AWS instance types Steps to Reproduce: 1. On a jenkins slave, git clone the latest the release-3.11 branch of openshift-ansible, commit id: commit 5067b5eaacd7a69e5d77a264803f48939a5c144f Merge: 4ac6f81 03a2257 Author: OpenShift Merge Robot <openshift-merge-robot.github.com> Date: Wed Sep 26 10:41:28 2018 -0700 Merge pull request #10236 from openshift-cherrypick-robot/cherry-pick-10231-to-release-3.11 2. Run the openshift-ansible/playbooks/prerequisites.yml cd openshift-ansible/playbooks/ ansible-playbook -i flexy_inventory prerequisites.yml 3. Deploy an OCP 3.11.16 environment with cri-o container runtime using openshift-ansible deploy-cluster.yml playbook: ansible-playbook -i flexy_inventory deploy_cluster.yml 4. install will fail at task: [openshift_web_console : Get console pod logs]. ssh to Master node, oc get pods -n openshift-sdn -o yaml shows the sdn pod on Master node in CrashLoopBackOff state. Actual results: Playbook fails at task: TASK [openshift_web_console : Get console pod logs] **************************** Thursday 27 September 2018 18:54:52 +0800 (0:00:00.107) 0:32:55.187 **** fatal: [ec2-18-236-162-83.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": true, "cmd": ["oc", "logs", "deployment/webconsole", "--tail=50", "--config=/etc/origin/master/admin.kubeconfig", "-n", "openshift-web-console"], "delta": "0:00:00.221403", "end": "2018-09-27 10:54:56.920797", "msg": "non-zero return code", "rc": 1, "start": "2018-09-27 10:54:56.699394", "stderr": "Error from server (BadRequest): container \"webconsole\" in pod \"webconsole-78fd5bb8cf-tt7hj\" is waiting to start: ContainerCreating", "stderr_lines": ["Error from server (BadRequest): container \"webconsole\" in pod \"webconsole-78fd5bb8cf-tt7hj\" is waiting to start: ContainerCreating"], "stdout": "", "stdout_lines": []} ...ignoring TASK [openshift_web_console : debug] ******************************************* Thursday 27 September 2018 18:54:53 +0800 (0:00:01.039) 0:32:56.226 **** ok: [ec2-18-236-162-83.us-west-2.compute.amazonaws.com] => { "msg": [] } TASK [openshift_web_console : Report console errors] *************************** Thursday 27 September 2018 18:54:53 +0800 (0:00:00.102) 0:32:56.329 **** fatal: [ec2-18-236-162-83.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": false, "msg": "Console install failed."} to retry, use: --limit @/home/slave4/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry PLAY RECAP ********************************************************************* ec2-18-236-162-83.us-west-2.compute.amazonaws.com : ok=641 changed=274 unreachable=0 failed=1 ec2-34-217-148-80.us-west-2.compute.amazonaws.com : ok=129 changed=61 unreachable=0 failed=0 ec2-35-164-234-63.us-west-2.compute.amazonaws.com : ok=129 changed=61 unreachable=0 failed=0 ec2-54-202-53-94.us-west-2.compute.amazonaws.com : ok=129 changed=61 unreachable=0 failed=0 localhost : ok=11 changed=0 unreachable=0 failed=0 INSTALLER STATUS *************************************************************** Initialization : Complete (0:00:49) Health Check : Complete (0:00:03) Node Bootstrap Preparation : Complete (0:03:33) etcd Install : Complete (0:01:46) Master Install : Complete (0:08:44) Master Additional Install : Complete (0:01:19) Node Join : Complete (0:01:32) Hosted Install : Complete (0:01:33) Cluster Monitoring Operator : Complete (0:01:37) Web Console Install : In Progress (0:11:45) This phase can be restarted by running: playbooks/openshift-web-console/config.yml Thursday 27 September 2018 18:54:53 +0800 (0:00:00.072) 0:32:56.401 **** =============================================================================== openshift_web_console : Verify that the console is running ------------ 680.79s openshift_control_plane : Wait for control plane pods to appear ------- 110.86s openshift_manage_node : Wait for sync DS to set annotations on all nodes -- 69.01s openshift_cluster_monitoring_operator : Wait for the ServiceMonitor CRD to be created -- 67.02s openshift_control_plane : Wait for all control plane pods to become ready -- 58.87s openshift_node_group : Wait for the sync daemonset to become ready and available -- 24.48s openshift_sdn : Copy templates to temp directory ----------------------- 12.63s openshift_node : Install node, clients, and conntrack packages --------- 12.48s openshift_node_group : Copy templates to temp directory ---------------- 12.04s tuned : Ensure files are populated from templates ---------------------- 11.87s Run variable sanity checks --------------------------------------------- 11.53s openshift_node : Install NFS storage plugin dependencies --------------- 11.45s openshift_manageiq : Configure role/user permissions ------------------- 11.02s openshift_control_plane : Wait for APIs to become available ------------ 10.72s etcd : Install etcd ---------------------------------------------------- 10.14s openshift_node : install needed rpm(s) ---------------------------------- 9.54s openshift_control_plane : Copy static master scripts -------------------- 8.98s tuned : Ensure files are populated from templates ----------------------- 8.65s openshift_control_plane : Prepare master static pods -------------------- 8.47s openshift_hosted : Create OpenShift router ------------------------------ 8.06s Expected results: Successful install Additional info: Links to logs from ansible-playbook, oc get events, oc logs from failed pods, and journal are in next attachment
hit the same issue on GCP Machine type n1-standard-64 OCP v3.11.16
Dan (dcbw), can you take a look?
From the logs, it looks like openvswitch is hanging / timing out / on a trip to mars. Adding aconole to CC list. Aaron, if we get a machine in this state, could you help us diagnose this?
Can we get the OVS logs from a failed install?
Looks like ovs-vsctl timed out (maybe after 30 sec) ./pkg/util/ovs/ovs.go +166 func (ovsif *ovsExec) execWithStdin(cmd string, stdinArgs []string, args ...string) (string, error) { ... switch cmd { case OVS_VSCTL: args = append([]string{"--timeout=30"}, args...) ... output, err := kcmd.CombinedOutput() if err != nil { 166: glog.V(2).Infof("Error executing %s: %s", cmd, string(output)) man ovs-vsctl(8) --timeout=secs By default, or with a secs of 0, ovs-vsctl waits forever for a response from the database. This option limits runtime to approximately secs seconds. If the timeout expires, ovs-vsctl will exit with a SIGALRM signal. (A timeout would normally hap‐ pen only if the database cannot be contacted, or if the system is overloaded.)
--- ds 2018-09-28 15:00:12.647762202 -0400 +++ ds-new 2018-09-28 15:02:51.039364940 -0400 @@ -67,16 +67,17 @@ exit 0 } trap quit SIGTERM - /usr/share/openvswitch/scripts/ovs-ctl start --system-id=random + /usr/share/openvswitch/scripts/ovs-ctl start --no-ovs-vswitchd --system-id=random # Restrict the number of pthreads ovs-vswitchd creates to reduce the # amount of RSS it uses on hosts with many cores # https://bugzilla.redhat.com/show_bug.cgi?id=1571379 # https://bugzilla.redhat.com/show_bug.cgi?id=1572797 if [[ `nproc` -gt 12 ]]; then - ovs-vsctl set Open_vSwitch . other_config:n-revalidator-threads=4 - ovs-vsctl set Open_vSwitch . other_config:n-handler-threads=10 + ovs-vsctl --no-wait set Open_vSwitch . other_config:n-revalidator-threads=4 + ovs-vsctl --no-wait set Open_vSwitch . other_config:n-handler-threads=10 fi + /usr/share/openvswitch/scripts/ovs-ctl start --no-ovsdb-server --system-id=random while true; do sleep 5; done image: registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.11 imagePullPolicy: IfNotPresent
@Walid, Ben and Aconole worked out the problem in my env, it was a race between us telling ovs not to use too many cores. an ansible tweak will solve the problem (that's where the dc lives)
Fix for master: https://github.com/openshift/openshift-ansible/pull/10269
Made https://bugzilla.redhat.com/show_bug.cgi?id=1634151 for the 3.11.z fix.
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/5c003b15acf9cae3290dcd17ecda640aefeecdb6 Start only the ovsdb so we can add the config safely The existing code starts both the ovsdb and the ovsvswitchd and then chances the database to limit the threads used. Unfortunately, on very fast machines that have many cores, there is a race between when the config change gets in to limit the threads used, and when the threads get started. If the config change loses the race, then the threads aren't limited and the vswitchd becomes unresponsive. The fix is to start the db component, program it, then start the vswitchd component. Fixes bug 1633892 (https://bugzilla.redhat.com/show_bug.cgi?id=1633892)
Could I request that the fix for 1633892 be back-ported to 3.10 and made available on ppc64le as well? (assuming that 1615375 is a duplicate) https://bugzilla.redhat.com/show_bug.cgi?id=1615375
(In reply to Manoj Kumar from comment #16) > Could I request that the fix for 1633892 be back-ported to 3.10 and made > available on ppc64le as well? > > (assuming that 1615375 is a duplicate) > > https://bugzilla.redhat.com/show_bug.cgi?id=1615375 3.10 backport is https://github.com/openshift/openshift-ansible/pull/10280 and was merged.
@Walid Can you help try this bug on the fixed version and verify this bug?
Verified on OCP 3.11.106 m5.12xlarge instances in AWS, 1 master, 1 infra and 2 compute nodes component versions: oc v3.11.106 kubernetes v1.11.0+d4cacc0 openshift v3.11.106 Container Runtime Version: cri-o://1.11.13-1.rhaos3.11.gitfb88a9c.el7 # oc get pods --all-namespaces NAMESPACE NAME READY STATUS RESTARTS AGE default docker-registry-1-qtddm 1/1 Running 0 19m default registry-console-1-qwpgg 1/1 Running 0 19m default router-1-wlrsn 1/1 Running 0 19m kube-service-catalog apiserver-xr2gg 1/1 Running 0 15m kube-service-catalog controller-manager-k6qwr 1/1 Running 3 15m kube-system master-api-ip-172-31-61-61.us-west-2.compute.internal 1/1 Running 0 25m kube-system master-controllers-ip-172-31-61-61.us-west-2.compute.internal 1/1 Running 0 24m kube-system master-etcd-ip-172-31-61-61.us-west-2.compute.internal 1/1 Running 0 25m openshift-ansible-service-broker asb-1-487js 1/1 Running 0 13m openshift-console console-89747b7d8-x6q59 1/1 Running 0 16m openshift-monitoring alertmanager-main-0 3/3 Running 0 16m openshift-monitoring alertmanager-main-1 3/3 Running 0 16m openshift-monitoring alertmanager-main-2 3/3 Running 0 16m openshift-monitoring cluster-monitoring-operator-75c6b544dd-cck56 1/1 Running 0 18m openshift-monitoring grafana-c7d5bc87c-xt9kq 2/2 Running 0 17m openshift-monitoring kube-state-metrics-6c64799586-f5nn5 3/3 Running 0 15m openshift-monitoring node-exporter-74hcs 2/2 Running 0 15m openshift-monitoring node-exporter-bfvnf 2/2 Running 0 15m openshift-monitoring node-exporter-bq926 2/2 Running 0 15m openshift-monitoring node-exporter-x6hdp 2/2 Running 0 15m openshift-monitoring prometheus-k8s-0 4/4 Running 1 17m openshift-monitoring prometheus-k8s-1 4/4 Running 1 16m openshift-monitoring prometheus-operator-5b47ff445b-45cg6 1/1 Running 0 18m openshift-node sync-9xxfh 1/1 Running 0 21m openshift-node sync-dgm6t 1/1 Running 0 21m openshift-node sync-ppc76 1/1 Running 0 22m openshift-node sync-qtp94 1/1 Running 0 21m openshift-sdn ovs-b5sk9 1/1 Running 0 21m openshift-sdn ovs-csxh6 1/1 Running 0 21m openshift-sdn ovs-fq4zf 1/1 Running 0 21m openshift-sdn ovs-mmgds 1/1 Running 0 21m openshift-sdn sdn-6qwxm 1/1 Running 0 21m openshift-sdn sdn-7b8km 1/1 Running 0 21m openshift-sdn sdn-fhkfn 1/1 Running 0 21m openshift-sdn sdn-whpxs 1/1 Running 0 21m openshift-template-service-broker apiserver-j26z6 1/1 Running 0 13m openshift-web-console webconsole-787f54c7f8-c5zj5 1/1 Running 0 17m
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2019:0794