Bug 1579676 - Install failed at TASK [openshift_node_group : Ensure project exists] sometimes
Summary: Install failed at TASK [openshift_node_group : Ensure project exists] sometimes
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 3.10.0
Assignee: Vadim Rutkovsky
QA Contact: Weihua Meng
URL:
Whiteboard:
: 1573604 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-18 06:36 UTC by Weihua Meng
Modified: 2018-09-14 08:09 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-30 19:16:09 UTC
Target Upstream Version:


Attachments (Terms of Use)
problem-node (121.85 KB, application/x-gzip)
2018-05-24 16:20 UTC, Michael Gugino
no flags Details
problem-node-second-run (173.77 KB, application/x-gzip)
2018-05-24 17:18 UTC, Michael Gugino
no flags Details
mirror-pods.log (5.55 KB, text/plain)
2018-05-25 20:02 UTC, Seth Jennings
no flags Details
no-istag-journal.log (184.61 KB, text/x-vhdl)
2018-05-25 22:09 UTC, Vadim Rutkovsky
no flags Details


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:16:29 UTC

Description Weihua Meng 2018-05-18 06:36:52 UTC
Description of problem:
In my opinion, this task should not fail whether the project already exists or not.

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_node_group/tasks/sync.yml
- name: Ensure project exists
  oc_project:
    name: openshift-node
    state: present
    node_selector:
      - ""


Version-Release number of the following components:
openshift-ansible-3.10.0-0.47.0

How reproducible:
sometimes (20%)

Steps to Reproduce:
1. Install OCP by openshift-ansible

Actual results:
TASK [calico_master : Calico Master | Delete temp directory] *******************
Thursday 17 May 2018  21:09:49 -0400 (0:00:00.032)       0:09:58.355 ********** 
skipping: [ec2-107-20-100-122.compute-1.amazonaws.com] => {"changed": false, "skip_reason": "Conditional result was False", "skipped": true}

TASK [openshift_node_group : Ensure project exists] ****************************
Thursday 17 May 2018  21:09:49 -0400 (0:00:00.034)       0:09:58.389 ********** 

fatal: [ec2-107-20-100-122.compute-1.amazonaws.com]: FAILED! => {"changed": false, "failed": true, "msg": {"cmd": "/usr/bin/oc adm new-project openshift-node --admin-role=admin --node-selector=", "results": {}, "returncode": 1, "stderr": "error: project openshift-node already exists\n", "stdout": ""}}

NO MORE HOSTS LEFT *************************************************************
	to retry, use: --limit @/home/slave3/workspace/Launch Environment Flexy/private-openshift-ansible/playbooks/deploy_cluster.retry


Expected results:
Should not fail and exit

Additional info:
for successful installation with identical parameters

TASK [openshift_node_group : Ensure project exists] ****************************
Thursday 17 May 2018  22:20:10 -0400 (0:00:00.026)       0:10:06.507 ********** 

changed: [ec2-75-101-238-53.compute-1.amazonaws.com] => {"changed": true, "failed": false, "results": {"cmd": "/usr/bin/oc get namespace openshift-node -o json", "results": {"apiVersion": "v1", "kind": "Namespace", "metadata": {"annotations": {"openshift.io/node-selector": "", "openshift.io/sa.scc.mcs": "s0:c7,c4", "openshift.io/sa.scc.supplemental-groups": "1000050000/10000", "openshift.io/sa.scc.uid-range": "1000050000/10000"}, "creationTimestamp": "2018-05-18T02:20:34Z", "name": "openshift-node", "resourceVersion": "900", "selfLink": "/api/v1/namespaces/openshift-node", "uid": "0bed028f-5a42-11e8-b643-0eaa25410cfa"}, "spec": {"finalizers": ["kubernetes"]}, "status": {"phase": "Active"}}, "returncode": 0}, "state": "present"}

Comment 2 Johnny Liu 2018-05-21 03:12:41 UTC
I also hit some similar issue:
TASK [openshift_sdn : Ensure project exists] ***********************************
Sunday 20 May 2018  20:25:58 -0400 (0:00:14.700)       0:14:25.866 ************ 
fatal: [ec2-35-168-12-131.compute-1.amazonaws.com]: FAILED! => {"changed": false, "failed": true, "msg": {"cmd": "/usr/bin/oc adm new-project openshift-sdn --admin-role=admin --node-selector=", "results": {}, "returncode": 1, "stderr": "Error from server: timeout\nError from server: etcdserver: request timed out\n", "stdout": "Created project openshift-sdn\nCould not add service accounts to the system:image-puller role: timeout\nCould not add service accounts to the system:image-builder role: etcdserver: request timed out\n"}}

After that failure, I log into master, run the same command:
# /usr/bin/oc adm new-project openshift-sdn --admin-role=admin --node-selector=
error: project openshift-sdn already exists

Seem like the project is created successfully.

Comment 3 Scott Dodson 2018-05-24 13:24:54 UTC
Can we please get `journalctl --no-pager -u atomic-openshift-node` from all master hosts. Can we also get `master-logs etcd` from all master hosts. Or if etcd is still running as system services `journalctl --no-pager -u etcd`

Our suspicion is that etcd is either performing poorly or for some reason it's being restarted unexpectedly.

Comment 4 Michael Gugino 2018-05-24 16:09:49 UTC
Hit this same problem on another step:

TASK [openshift_manageiq : Add Management Infrastructure project] ********************************************************************************
task path: /home/mgugino/git/openshift-ansible/roles/openshift_manageiq/tasks/main.yaml:2
Thursday 24 May 2018  11:52:41 -0400 (0:00:00.441)       0:11:25.343 ********** 
Using module file /home/mgugino/git/openshift-ansible/roles/lib_openshift/library/oc_project.py
<ec2-35-153-157-166.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ec2-user
<ec2-35-153-157-166.compute-1.amazonaws.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=ec2-user -o ConnectTimeout=30 -o ControlPath=/home/mgugino/.ansible/cp/%h-%r ec2-35-153-157-166.compute-1.amazonaws.com '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-fsjtelkxmmxrzjzamcksflrkodldukzc; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<ec2-35-153-157-166.compute-1.amazonaws.com> (1, b'\n{"msg": {"returncode": 1, "cmd": "/bin/oc adm new-project management-infra --admin-role=admin --description=Management Infrastructure", "results": {}, "stderr": "Error from server: etcdserver: request timed out\\n", "stdout": ""}, "failed": true, "exception": "  File \\"/tmp/ansible_ie_tR8/ansible_module_oc_project.py\\", line 47, in <module>\\n    import ruamel.yaml as yaml\\n", "invocation": {"module_args": {"admin_role": "admin", "display_name": null, "name": "management-infra", "admin": null, "kubeconfig": "/etc/origin/master/admin.kubeconfig", "state": "present", "node_selector": null, "debug": false, "description": "Management Infrastructure"}}}\n', b'')
The full traceback is:
  File "/tmp/ansible_ie_tR8/ansible_module_oc_project.py", line 47, in <module>
    import ruamel.yaml as yaml

fatal: [ec2-35-153-157-166.compute-1.amazonaws.com]: FAILED! => {
    "changed": false,
    "failed": true,
    "invocation": {
        "module_args": {
            "admin": null,
            "admin_role": "admin",
            "debug": false,
            "description": "Management Infrastructure",
            "display_name": null,
            "kubeconfig": "/etc/origin/master/admin.kubeconfig",
            "name": "management-infra",
            "node_selector": null,
            "state": "present"
        }
    },
    "msg": {
        "cmd": "/bin/oc adm new-project management-infra --admin-role=admin --description=Management Infrastructure",
        "results": {},
        "returncode": 1,
        "stderr": "Error from server: etcdserver: request timed out\n",
        "stdout": ""
    }
}

Comment 5 Michael Gugino 2018-05-24 16:20:30 UTC
Created attachment 1441135 [details]
problem-node

All in one master

Comment 6 Michael Gugino 2018-05-24 16:22:20 UTC
This appears to be due to master api and possibly other pods being killed and restarted for some reason.

Comment 7 Michael Gugino 2018-05-24 17:17:02 UTC
After some time (almost an hour) hit same error for creating new project, this time for openshift_node_group on same cluster by re-running the installer.

Comment 8 Michael Gugino 2018-05-24 17:18:15 UTC
Created attachment 1441171 [details]
problem-node-second-run

Same node as before, rerunning the installer results in same failure.

Comment 9 Scott Dodson 2018-05-25 19:27:15 UTC
Looking through the logs from comment 5 we see the sync pod kick in at 11:52:19

May 24 11:52:19 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: I0524 11:52:19.798337   25720 kuberuntime_manager.go:513] Container {Name:sync Image:registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 Command:[/bin/bash -c #!/bin/bash

ovs kernel messages and NetworkManager detects some changes to networking, maybe this disrupts etcd and apiserver communication?

May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device ovs-system entered promiscuous mode
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.6665] manager: (ovs-system): new Generic device (/org/freedesktop/NetworkManager/Devices/4)
May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device br0 entered promiscuous mode
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.7064] manager: (br0): new Generic device (/org/freedesktop/NetworkManager/Devices/5)
May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device vxlan_sys_4789 entered promiscuous mode
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.8139] manager: (vxlan_sys_4789): new Vxlan device (/org/freedesktop/NetworkManager/Devices/6)
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.8151] device (vxlan_sys_4789): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external')
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.8162] device (vxlan_sys_4789): enslaved to non-master-type device ovs-system; ignoring
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.8169] device (vxlan_sys_4789): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external')
May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device tun0 entered promiscuous mode
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.8389] manager: (tun0): new Generic device (/org/freedesktop/NetworkManager/Devices/7)
May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info>  [1527177145.8923] device (tun0): carrier: link connected

Ansible task that failed is here, though as the logs are ordered it's in between  two other log entries that are 1 minute ahead, so not sure what to make of that.

May 24 11:52:42 ip-172-18-6-223.ec2.internal ansible-oc_project[26702]: Invoked with admin_role=admin display_name=None name=management-infra admin=None kubeconfig=/etc/origin/master/admin.kubeconfig state=present node_selector=None debug=False description=Management Infrastructure

the kubelet shows failing api health checks

May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: W0524 11:52:42.074112   25720 prober.go:103] No ref for container "docker://3fe72aa4ab015a42a12eef3cc2543178ae2f33539e08327d26d4c3c40284a030" (master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api)
May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: I0524 11:52:42.074130   25720 prober.go:111] Liveness probe for "master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api" failed (failure): Get https://172.18.6.223:8443/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers)
May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: W0524 11:52:46.453801   25720 prober.go:103] No ref for container "docker://3fe72aa4ab015a42a12eef3cc2543178ae2f33539e08327d26d4c3c40284a030" (master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api)
May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: I0524 11:52:46.453822   25720 prober.go:111] Readiness probe for "master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api" failed (failure): Get https://172.18.6.223:8443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

I think we need logs from the api and etcd pods. It's unfortunate that those don't show up in the journal.

Seth, any ideas?

Comment 10 Seth Jennings 2018-05-25 20:02:59 UTC
Created attachment 1441693 [details]
mirror-pods.log

Looking at the logs when the kubelet restarts, it tries to create mirrors pods for each of the static pods and that fails because they already exist, indicating to me that the API server is working at this point.

I do see this for the etcd pod

I0524 11:52:18.798216   25720 status_manager.go:468] Pod "master-etcd-ip-172-18-6-223.ec2.internal_kube-system(548ee528-5f6a-11e8-971f-0e5bf8ed4b2a)" was deleted and then recreated, skipping status update; old UID "e41c2a57b9d15b28cb8a240c7780597f", new UID "548ee528-5f6a-11e8-971f-0e5bf8ed4b2a"

And do not see it for the master-api|controllers pods.

I does look like the api and controllers containers do died and are then restarted based on PLEG events.

I have no conclusion here.  Just observations.

Comment 11 Vadim Rutkovsky 2018-05-25 22:09:19 UTC
Created attachment 1441724 [details]
no-istag-journal.log

I've hit this on 'openshift_node_group : Remove the image stream tag' task, when  'oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-node istag node:v3.10 --ignore-not-found' is called, the result is 'error: the server doesn't have a resource type "istag"'

This command passes later on though

Comment 12 Vadim Rutkovsky 2018-05-29 13:05:31 UTC
It seems etcd server container restarts between API server liveness checks, as etcd cluster healthcheck didn't pass:

2018-05-29 10:42:10.676749 N | pkg/osutil: received terminated signal, shutting down...
...
2018-05-29 10:42:13.038489 W | etcdserver: apply entries took too long [2.811652927s for 2 entries]
2018-05-29 10:42:13.038507 W | etcdserver: avoid queries with large range/delete range!
2018-05-29 10:42:13.573738 W | wal: sync duration of 3.347277491s, expected less than 1s
2018-05-29 10:42:20.677342 I | etcdserver: skipped leadership transfer for single member cluster
2018-05-29 10:42:20.685481 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
2018-05-29 10:42:20.685885 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")


It seems API server doesn't notice anything between etcd restarts and may throw 'etcd server timeout error' while etcd containers restart.

etcd container prepull in https://github.com/openshift/openshift-ansible/pull/8539 should help, along with tweaking livenessProbe and terminationGracePeriodSeconds for etcd container

Comment 13 Scott Dodson 2018-05-30 13:51:39 UTC
*** Bug 1573604 has been marked as a duplicate of this bug. ***

Comment 14 Vadim Rutkovsky 2018-05-30 14:32:14 UTC
*** Bug 1579261 has been marked as a duplicate of this bug. ***

Comment 15 Vadim Rutkovsky 2018-05-30 14:33:36 UTC
https://github.com/openshift/openshift-ansible/pull/8563 should fix it.

We should wait for all control plane pods to be healthy, wait for additional APIs to register and clean oc cache on the target to resolve "server doesn't have 'istag / dc' resource"

Comment 18 Weihua Meng 2018-06-05 08:30:07 UTC
Fixed.
openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch

  Operating System: Red Hat Enterprise Linux Atomic Host 7.5.1
            Kernel: Linux 3.10.0-862.2.3.el7.x86_64

Comment 20 errata-xmlrpc 2018-07-30 19:16:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.