1726450 – Installation fails - control plane pods don't come up

Bug 1726450 - Installation fails - control plane pods don't come up

Summary: Installation fails - control plane pods don't come up

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	liujia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-07-02 21:21 UTC by Benjamin Milne
Modified:	2019-11-22 02:00 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The timeout for waiting for clusters services to start up could be too short for some environments. Consequence: The cluster deploy may fail prematurely, and upon investigation, cluster services are normal. Fix: Increase the timeout from 5 minutes to 6 minutes. Result: Cluster service availability checks pass normally.
Clone Of:
Environment:
Last Closed:	2019-11-18 14:52:08 UTC
Target Upstream Version:
Embargoed:
Flags:	bmilne: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 11963	0	'None'	closed	Bug 1726450: Increase timeout for control plane pod checks	2020-11-06 04:38:36 UTC
Red Hat Product Errata	RHBA-2019:3817	0	None	None	None	2019-11-18 14:52:18 UTC

Description Benjamin Milne 2019-07-02 21:21:01 UTC

During and install of 3.11 latest, the task 'openshift_control_plane : Report control plane errors' fails as the control plane pods did not come up[1]. The openshift-sdn namespace does not exist and we see the node service fails to come up due to CNI issues[2]. Looking in the api/controller logs(sosreport-ip-10-31-217-33-02416046-2019-06-28-jbriqyw/sos_commands/origin/*), we do not really see much indicating an unhealthy controller or api. 

Version:
atomic-openshift-3.11.98-1

Steps to reproduce:
[ ] Run installer https://docs.openshift.com/container-platform/3.11/install/running_install.html#running-the-advanced-installation-rpm
[ ] Playbook fails in task 'openshift_control_plane : Report control plane errors'

Expected Results:
[ ] Cluster installed with healthy control plane. 

https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_control_plane/tasks/main.yml#L268
~~~~
- name: Wait for control plane pods to appear
  oc_obj:
    state: list
    kind: pod
    name: "master-{{ item }}-{{ l_kubelet_node_name | lower }}"
    namespace: kube-system
  register: control_plane_pods
  until:
  - control_plane_pods.module_results is defined
  - control_plane_pods.module_results.results is defined
  - control_plane_pods.module_results.results | length > 0
  retries: 60
  delay: 5
  with_items:
  - "{{ 'etcd' if inventory_hostname in groups['oo_etcd_to_config'] else omit }}"
  - api
  - controllers
  ignore_errors: true

- when: control_plane_pods is failed
  block:
  - name: Check status in the kube-system namespace
    command: >
      {{ openshift_client_binary }} status --config={{ openshift.common.config_base }}/master/admin.kubeconfig -n kube-system
    register: control_plane_status
    ignore_errors: true
  - debug:
      msg: "{{ control_plane_status.stdout_lines }}"
  - name: Get pods in the kube-system namespace
    command: >
      {{ openshift_client_binary }} get pods --config={{ openshift.common.config_base }}/master/admin.kubeconfig -n kube-system -o wide
    register: control_plane_pods_list
    ignore_errors: true
  - debug:
      msg: "{{ control_plane_pods_list.stdout_lines }}"
  - name: Get events in the kube-system namespace
    command: >
      {{ openshift_client_binary }} get events --config={{ openshift.common.config_base }}/master/admin.kubeconfig -n kube-system
    register: control_plane_events
    ignore_errors: true
  - debug:
      msg: "{{ control_plane_events.stdout_lines }}"
  - name: Get node logs
    command: journalctl --no-pager -n 300 -u {{ openshift_service_type }}-node
    register: logs_node
    ignore_errors: true
  - debug:
      msg: "{{ logs_node.stdout_lines }}"
  - name: Report control plane errors
    fail:
msg: Control plane pods didn't come up
~~~~

===============================================================
[1]
~~~~
TASK [openshift_control_plane : Report control plane errors] *******************
task path: /usr/share/ansible/openshift-ansible/roles/openshift_control_plane/tasks/main.yml:256
fatal: [ip-10-31-217-33.ec2.internal]: FAILED! => {
    "changed": false, 
    "msg": "Control plane pods didn't come up"
}
fatal: [ip-10-31-217-81.ec2.internal]: FAILED! => {
    "changed": false, 
    "msg": "Control plane pods didn't come up"
}
fatal: [ip-10-31-217-145.ec2.internal]: FAILED! => {
    "changed": false, 
    "msg": "Control plane pods didn't come up"
}
..8<
PLAY RECAP *********************************************************************
ip-10-31-217-100.ec2.internal : ok=116  changed=63   unreachable=0    failed=0   
ip-10-31-217-145.ec2.internal : ok=284  changed=148  unreachable=0    failed=1   
ip-10-31-217-185.ec2.internal : ok=116  changed=63   unreachable=0    failed=0   
ip-10-31-217-32.ec2.internal : ok=116  changed=63   unreachable=0    failed=0   
ip-10-31-217-33.ec2.internal : ok=343  changed=165  unreachable=0    failed=1   
ip-10-31-217-56.ec2.internal : ok=116  changed=63   unreachable=0    failed=0   
ip-10-31-217-81.ec2.internal : ok=284  changed=148  unreachable=0    failed=1   
ip-10-31-217-96.ec2.internal : ok=116  changed=63   unreachable=0    failed=0   
localhost                  : ok=11   changed=0    unreachable=0    failed=0   


INSTALLER STATUS ***************************************************************
Initialization              : Complete (0:05:54)
Health Check                : Complete (0:01:18)
Node Bootstrap Preparation  : Complete (0:23:05)
etcd Install                : Complete (0:04:37)
Master Install              : In Progress (0:14:36)
        This phase can be restarted by running: playbooks/openshift-master/config.yml


Failure summary:


  1. Hosts:    ip-10-31-217-145.ec2.internal, ip-10-31-217-33.ec2.internal, ip-10-31-217-81.ec2.internal
     Play:     Configure masters
     Task:     Report control plane errors
     Message:  Control plane pods didn't come up
~~~~

[2]
Jun 28 17:08:11 ip-10-31-217-33.ec2.internal atomic-openshift-node[20765]: E0628 17:08:11.624084   20765 kubelet.go:2101] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized

Comment 11 liujia 2019-11-04 04:21:42 UTC

Can not reproduce on openshift-ansible-3.11.98-1.git.0.3cfa7c3.el7.noarch.rpm.

The installation succeed.
...
PLAY [Disable excluders and gather facts] **************************************
PLAY [Create OpenShift certificates for master hosts] **************************
PLAY [Generate or retrieve existing session secrets] ***************************
PLAY [Configure masters] *******************************************************
PLAY [Deploy the central bootstrap configuration] ******************************
PLAY [Ensure inventory labels are assigned to masters] *************************
...
And task [openshift_control_plane : Report control plane errors] were skipped without errors.

Comment 13 liujia 2019-11-06 08:14:20 UTC

Verified on openshift-ansible-3.11.154-1.git.0.7a11cbe.el7.noarch.rpm

Installation succeed with task [openshift_control_plane : Report control plane errors] were skipped without errors.

Comment 15 errata-xmlrpc 2019-11-18 14:52:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3817

Note You need to log in before you can comment on or make changes to this bug.