Bug 1579676
| Summary: | Install failed at TASK [openshift_node_group : Ensure project exists] sometimes | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Weihua Meng <wmeng> | ||||||||||
| Component: | Installer | Assignee: | Vadim Rutkovsky <vrutkovs> | ||||||||||
| Status: | CLOSED ERRATA | QA Contact: | Weihua Meng <wmeng> | ||||||||||
| Severity: | medium | Docs Contact: | |||||||||||
| Priority: | medium | ||||||||||||
| Version: | 3.10.0 | CC: | aos-bugs, jeder, jmencak, jokerman, mark.vinkx, mgugino, mmccomas, sjenning, vrutkovs, wmeng, zitang | ||||||||||
| Target Milestone: | --- | ||||||||||||
| Target Release: | 3.10.0 | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | Environment: | ||||||||||||
| Last Closed: | 2018-07-30 19:16:09 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Weihua Meng
2018-05-18 06:36:52 UTC
I also hit some similar issue:
TASK [openshift_sdn : Ensure project exists] ***********************************
Sunday 20 May 2018 20:25:58 -0400 (0:00:14.700) 0:14:25.866 ************
fatal: [ec2-35-168-12-131.compute-1.amazonaws.com]: FAILED! => {"changed": false, "failed": true, "msg": {"cmd": "/usr/bin/oc adm new-project openshift-sdn --admin-role=admin --node-selector=", "results": {}, "returncode": 1, "stderr": "Error from server: timeout\nError from server: etcdserver: request timed out\n", "stdout": "Created project openshift-sdn\nCould not add service accounts to the system:image-puller role: timeout\nCould not add service accounts to the system:image-builder role: etcdserver: request timed out\n"}}
After that failure, I log into master, run the same command:
# /usr/bin/oc adm new-project openshift-sdn --admin-role=admin --node-selector=
error: project openshift-sdn already exists
Seem like the project is created successfully.
Can we please get `journalctl --no-pager -u atomic-openshift-node` from all master hosts. Can we also get `master-logs etcd` from all master hosts. Or if etcd is still running as system services `journalctl --no-pager -u etcd` Our suspicion is that etcd is either performing poorly or for some reason it's being restarted unexpectedly. Hit this same problem on another step:
TASK [openshift_manageiq : Add Management Infrastructure project] ********************************************************************************
task path: /home/mgugino/git/openshift-ansible/roles/openshift_manageiq/tasks/main.yaml:2
Thursday 24 May 2018 11:52:41 -0400 (0:00:00.441) 0:11:25.343 **********
Using module file /home/mgugino/git/openshift-ansible/roles/lib_openshift/library/oc_project.py
<ec2-35-153-157-166.compute-1.amazonaws.com> ESTABLISH SSH CONNECTION FOR USER: ec2-user
<ec2-35-153-157-166.compute-1.amazonaws.com> SSH: EXEC ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User=ec2-user -o ConnectTimeout=30 -o ControlPath=/home/mgugino/.ansible/cp/%h-%r ec2-35-153-157-166.compute-1.amazonaws.com '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-fsjtelkxmmxrzjzamcksflrkodldukzc; /usr/bin/python'"'"'"'"'"'"'"'"' && sleep 0'"'"''
<ec2-35-153-157-166.compute-1.amazonaws.com> (1, b'\n{"msg": {"returncode": 1, "cmd": "/bin/oc adm new-project management-infra --admin-role=admin --description=Management Infrastructure", "results": {}, "stderr": "Error from server: etcdserver: request timed out\\n", "stdout": ""}, "failed": true, "exception": " File \\"/tmp/ansible_ie_tR8/ansible_module_oc_project.py\\", line 47, in <module>\\n import ruamel.yaml as yaml\\n", "invocation": {"module_args": {"admin_role": "admin", "display_name": null, "name": "management-infra", "admin": null, "kubeconfig": "/etc/origin/master/admin.kubeconfig", "state": "present", "node_selector": null, "debug": false, "description": "Management Infrastructure"}}}\n', b'')
The full traceback is:
File "/tmp/ansible_ie_tR8/ansible_module_oc_project.py", line 47, in <module>
import ruamel.yaml as yaml
fatal: [ec2-35-153-157-166.compute-1.amazonaws.com]: FAILED! => {
"changed": false,
"failed": true,
"invocation": {
"module_args": {
"admin": null,
"admin_role": "admin",
"debug": false,
"description": "Management Infrastructure",
"display_name": null,
"kubeconfig": "/etc/origin/master/admin.kubeconfig",
"name": "management-infra",
"node_selector": null,
"state": "present"
}
},
"msg": {
"cmd": "/bin/oc adm new-project management-infra --admin-role=admin --description=Management Infrastructure",
"results": {},
"returncode": 1,
"stderr": "Error from server: etcdserver: request timed out\n",
"stdout": ""
}
}
Created attachment 1441135 [details]
problem-node
All in one master
This appears to be due to master api and possibly other pods being killed and restarted for some reason. After some time (almost an hour) hit same error for creating new project, this time for openshift_node_group on same cluster by re-running the installer. Created attachment 1441171 [details]
problem-node-second-run
Same node as before, rerunning the installer results in same failure.
Looking through the logs from comment 5 we see the sync pod kick in at 11:52:19 May 24 11:52:19 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: I0524 11:52:19.798337 25720 kuberuntime_manager.go:513] Container {Name:sync Image:registry.reg-aws.openshift.com:443/openshift3/ose-node:v3.10 Command:[/bin/bash -c #!/bin/bash ovs kernel messages and NetworkManager detects some changes to networking, maybe this disrupts etcd and apiserver communication? May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device ovs-system entered promiscuous mode May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.6665] manager: (ovs-system): new Generic device (/org/freedesktop/NetworkManager/Devices/4) May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device br0 entered promiscuous mode May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.7064] manager: (br0): new Generic device (/org/freedesktop/NetworkManager/Devices/5) May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device vxlan_sys_4789 entered promiscuous mode May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.8139] manager: (vxlan_sys_4789): new Vxlan device (/org/freedesktop/NetworkManager/Devices/6) May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.8151] device (vxlan_sys_4789): state change: unmanaged -> unavailable (reason 'connection-assumed', sys-iface-state: 'external') May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.8162] device (vxlan_sys_4789): enslaved to non-master-type device ovs-system; ignoring May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.8169] device (vxlan_sys_4789): state change: unavailable -> disconnected (reason 'none', sys-iface-state: 'external') May 24 11:52:25 ip-172-18-6-223.ec2.internal kernel: device tun0 entered promiscuous mode May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.8389] manager: (tun0): new Generic device (/org/freedesktop/NetworkManager/Devices/7) May 24 11:52:25 ip-172-18-6-223.ec2.internal NetworkManager[13793]: <info> [1527177145.8923] device (tun0): carrier: link connected Ansible task that failed is here, though as the logs are ordered it's in between two other log entries that are 1 minute ahead, so not sure what to make of that. May 24 11:52:42 ip-172-18-6-223.ec2.internal ansible-oc_project[26702]: Invoked with admin_role=admin display_name=None name=management-infra admin=None kubeconfig=/etc/origin/master/admin.kubeconfig state=present node_selector=None debug=False description=Management Infrastructure the kubelet shows failing api health checks May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: W0524 11:52:42.074112 25720 prober.go:103] No ref for container "docker://3fe72aa4ab015a42a12eef3cc2543178ae2f33539e08327d26d4c3c40284a030" (master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api) May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: I0524 11:52:42.074130 25720 prober.go:111] Liveness probe for "master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api" failed (failure): Get https://172.18.6.223:8443/healthz: net/http: request canceled (Client.Timeout exceeded while awaiting headers) May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: W0524 11:52:46.453801 25720 prober.go:103] No ref for container "docker://3fe72aa4ab015a42a12eef3cc2543178ae2f33539e08327d26d4c3c40284a030" (master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api) May 24 11:52:54 ip-172-18-6-223.ec2.internal atomic-openshift-node[25720]: I0524 11:52:46.453822 25720 prober.go:111] Readiness probe for "master-api-ip-172-18-6-223.ec2.internal_kube-system(f8ddfa563d27a7f9ca95644fe2f8cdd0):api" failed (failure): Get https://172.18.6.223:8443/healthz/ready: net/http: request canceled (Client.Timeout exceeded while awaiting headers) I think we need logs from the api and etcd pods. It's unfortunate that those don't show up in the journal. Seth, any ideas? Created attachment 1441693 [details]
mirror-pods.log
Looking at the logs when the kubelet restarts, it tries to create mirrors pods for each of the static pods and that fails because they already exist, indicating to me that the API server is working at this point.
I do see this for the etcd pod
I0524 11:52:18.798216 25720 status_manager.go:468] Pod "master-etcd-ip-172-18-6-223.ec2.internal_kube-system(548ee528-5f6a-11e8-971f-0e5bf8ed4b2a)" was deleted and then recreated, skipping status update; old UID "e41c2a57b9d15b28cb8a240c7780597f", new UID "548ee528-5f6a-11e8-971f-0e5bf8ed4b2a"
And do not see it for the master-api|controllers pods.
I does look like the api and controllers containers do died and are then restarted based on PLEG events.
I have no conclusion here. Just observations.
Created attachment 1441724 [details]
no-istag-journal.log
I've hit this on 'openshift_node_group : Remove the image stream tag' task, when 'oc --config=/etc/origin/master/admin.kubeconfig delete -n openshift-node istag node:v3.10 --ignore-not-found' is called, the result is 'error: the server doesn't have a resource type "istag"'
This command passes later on though
It seems etcd server container restarts between API server liveness checks, as etcd cluster healthcheck didn't pass:
2018-05-29 10:42:10.676749 N | pkg/osutil: received terminated signal, shutting down...
...
2018-05-29 10:42:13.038489 W | etcdserver: apply entries took too long [2.811652927s for 2 entries]
2018-05-29 10:42:13.038507 W | etcdserver: avoid queries with large range/delete range!
2018-05-29 10:42:13.573738 W | wal: sync duration of 3.347277491s, expected less than 1s
2018-05-29 10:42:20.677342 I | etcdserver: skipped leadership transfer for single member cluster
2018-05-29 10:42:20.685481 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
2018-05-29 10:42:20.685885 W | etcdserver/api/v3rpc: failed to receive watch request from gRPC stream ("rpc error: code = Unavailable desc = body closed by handler")
It seems API server doesn't notice anything between etcd restarts and may throw 'etcd server timeout error' while etcd containers restart.
etcd container prepull in https://github.com/openshift/openshift-ansible/pull/8539 should help, along with tweaking livenessProbe and terminationGracePeriodSeconds for etcd container
*** Bug 1573604 has been marked as a duplicate of this bug. *** *** Bug 1579261 has been marked as a duplicate of this bug. *** https://github.com/openshift/openshift-ansible/pull/8563 should fix it. We should wait for all control plane pods to be healthy, wait for additional APIs to register and clean oc cache on the target to resolve "server doesn't have 'istag / dc' resource" Fixed.
openshift-ansible-3.10.0-0.58.0.git.0.d8f6377.el7.noarch
Operating System: Red Hat Enterprise Linux Atomic Host 7.5.1
Kernel: Linux 3.10.0-862.2.3.el7.x86_64
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |