Description of problem: Was doing a RHV(1H + 1E) + OCP(1W + 1M) deployment in a nested virtualization environment. Once it got to the OCP section it seemed to take a very long time. Eventually at 70% it errored out, with the following error on the task: ansible-playbook returned a non-zero exit code during installation. Please refer to the log for more information regarding the failure. I looked at the ansible log in the deployment log directory on fuser and found the following error: ".fatal: [ inaugurate-ocp-node1.b.b]: FAILED! => { \"failed\": true, \"msg\": \"Timeout (12s) waiting for privilege escalation prompt: \" }", "", "NO MORE HOSTS LEFT *************************************************************", "", "inaugurate-ocp-master1.b.b : ok=2 changed=0 unreachable=0 failed=0 ", "inaugurate-ocp-node1.b.b : ok=1 changed=0 unreachable=0 failed=1 ", "localhost : ok=9 changed=8 unreachable=0 failed=0 ", "", "", "Installation Complete: Note: Play count is an estimate and some were skipped because your install does not require them", "", "", "An error was detected. After resolving the problem please relaunch the", "installation process." ], As I looked further I noticed that both the OCP nodes were running at near 100% CPU utilization. I looked at the cpu utilization of the RHV host, and it was pegged at like a load of 3 (so 300% cpu utilization). One of the developers, dymurray, had me run the command: atomic-openshift-installer -u -c /tmp/atomic-openshift-installer.answers.cfg.yml install --force It errored out with: There was a problem fetching the required information. Please see /tmp/ansible.log for details. In the /tmp/ansible.log this error existed: 2017-01-18 17:30:56,882 p=18798 u=root | fatal: [inaugurate-ocp-master1.b.b]: FAILED! => {"attempts": 1, "changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"} Looking at systemctl and journalctl -xe revealed there might be a storage issue of some kind. If that is the case though, the default storage size was used for the containers, so maybe the default needs to be bigger. I have listed the above two commands output below in Version-Release number of selected component (if applicable): QCI-1.1-RHEL-7-20170118.t.0 How reproducible: First time Steps to Reproduce: 1. Do a RHV(1E + 1H) + OCP(2 nodes) deployment in a nested virtualization environment Actual results: Results above in description. Expected results: No errors, deployment complete. Additional info: journalctl -xe output ------------------------- Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:48:49.951775 32066 trace.go:61] Trace "List *batch.JobList" (started 2017-01-19 11:48:47.577803439 -0500 EST): Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [7.794µs] [7.794µs] About to list etcd node Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.282996264s] [2.28298847s] Etcd node listed Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.364967335s] [81.971071ms] Node list decoded Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.364990396s] [23.061µs] END Jan 19 11:48:51 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:48:51.385698 32066 nodecontroller.go:609] NodeController is entering network segmentation mode. Jan 19 11:48:53 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:48:53.648774 32066 trace.go:61] Trace "List /apis/batch/v2alpha1/jobs" (started 2017-01-19 11:48:47.577785434 -0500 EST): Jan 19 11:48:53 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [8.322µs] [8.322µs] About to List from storage Jan 19 11:48:53 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.374032788s] [2.374024466s] Listing from storage done Jan 19 11:48:54 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.524015894s] [149.983106ms] Self-linking done Jan 19 11:48:54 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [3.960856555s] [1.436840661s] Writing http response done (0 items) Jan 19 11:48:54 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [3.960860837s] [4.282µs] END Jan 19 11:48:54 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 1.248066266s, expected less than 1s Jan 19 11:48:55 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [156.995123ms for 2 entries] Jan 19 11:48:55 inaugurate-ocp-master1.b.b etcd[27387]: avoid queries with large range/delete range! Jan 19 11:49:02 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 5.337931061s, expected less than 1s Jan 19 11:49:05 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 2.887410895s, expected less than 1s Jan 19 11:49:05 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:49:05.753897 32066 trace.go:61] Trace "List *extensions.ThirdPartyResourceList" (started 2017-01-19 11:48:49.803762304 -0500 EST): Jan 19 11:49:05 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [13.426µs] [13.426µs] About to list etcd node Jan 19 11:49:06 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [4.585989961s] [4.585976535s] Etcd node listed Jan 19 11:49:06 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [4.586004951s] [14.99µs] Node list decoded Jan 19 11:49:06 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [15.948245834s] [11.362240883s] END Jan 19 11:49:10 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [89.997255ms for 1 entries] Jan 19 11:49:10 inaugurate-ocp-master1.b.b etcd[27387]: avoid queries with large range/delete range! Jan 19 11:49:10 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 4.100239947s, expected less than 1s Jan 19 11:49:16 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [27.157113ms for 1 entries] Jan 19 11:49:16 inaugurate-ocp-master1.b.b etcd[27387]: avoid queries with large range/delete range! Jan 19 11:49:18 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 1.071792101s, expected less than 1s Jan 19 11:49:30 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [1.238851064s for 1 entries] systemctl status atomic-openshift-node -l output ------------------------------------------------- [root@inaugurate-ocp-master1 tmp]# systemctl status atomic-openshift-node -l ● atomic-openshift-node.service - Atomic OpenShift Node Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled) Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d └─openshift-sdn-ovs.conf Active: active (running) since Thu 2017-01-19 08:30:14 EST; 3h 17min ago Docs: https://github.com/openshift/origin Main PID: 87405 (openshift) Memory: 33.2M CGroup: /system.slice/atomic-openshift-node.service ├─87405 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2 └─87525 journalctl -k -f Jan 19 11:39:50 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:39:50.435645 87405 kubelet.go:3011] Recording NodeReady event message for node inaugurate-ocp-master1.b.b Jan 19 11:40:29 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:40:29.569936 87405 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"inaugurate-ocp-master1.b.b.149b2f667b50c988", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"5604", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"inaugurate-ocp-master1.b.b", UID:"inaugurate-ocp-master1.b.b", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node inaugurate-ocp-master1.b.b status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"inaugurate-ocp-master1.b.b"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63620429419, nsec:0, loc:(*time.Location)(0x8398200)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63620440790, nsec:434969158, loc:(*time.Location)(0x8398200)}}, Count:27, Type:"Normal"}': 'client: etcd cluster is unavailable or misconfigured' (will not retry!) Jan 19 11:41:06 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:41:06.567832 87405 kubelet.go:3000] Error updating node status, will retry: Operation cannot be fulfilled on nodes "inaugurate-ocp-master1.b.b": the object has been modified; please apply your changes to the latest version and try again Jan 19 11:43:27 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:43:27.680004 87405 kubelet.go:3011] Recording NodeHasSufficientDisk event message for node inaugurate-ocp-master1.b.b Jan 19 11:43:27 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:43:27.682901 87405 kubelet.go:3011] Recording NodeReady event message for node inaugurate-ocp-master1.b.b Jan 19 11:44:20 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:44:20.283959 87405 container_manager_linux.go:284] Discovered runtime cgroups name: /system.slice/docker.service Jan 19 11:45:12 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:45:12.650682 87405 kubelet.go:3000] Error updating node status, will retry: Operation cannot be fulfilled on nodes "inaugurate-ocp-master1.b.b": the object has been modified; please apply your changes to the latest version and try again Jan 19 11:46:10 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:46:10.120912 87405 kubelet.go:3011] Recording NodeHasSufficientDisk event message for node inaugurate-ocp-master1.b.b Jan 19 11:46:11 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:46:10.306611 87405 kubelet.go:3011] Recording NodeReady event message for node inaugurate-ocp-master1.b.b Jan 19 11:47:00 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:46:56.167949 87405 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"inaugurate-ocp-master1.b.b.149b3018053cd208", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"5651", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"inaugurate-ocp-master1.b.b", UID:"inaugurate-ocp-master1.b.b", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeReady", Message:"Node inaugurate-ocp-master1.b.b status is now: NodeReady", Source:api.EventSource{Component:"kubelet", Host:"inaugurate-ocp-master1.b.b"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63620430181, nsec:0, loc:(*time.Location)(0x8398200)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63620441170, nsec:306668360, loc:(*time.Location)(0x8398200)}}, Count:27, Type:"Normal"}': 'client: etcd cluster is unavailable or misconfigured' (will not retry!)
I did this same deployment again, and it did succeed, but if you looked at the openstack VM's they were both running at 100% or near their. This was with compose: QCI-1.1-RHEL-7-20170118.t.1
I was not able to reproduce this bug. The following are the details of my deployment setup: - RHV & OCP (1 engine & 1 Hypervisor) - Engine : 4GB RAM, 30GB disk, 16 vCPU - Hypervisor : 16GB Ram, 70GB disk, 16vCPU - Satellite VM : 16GB Ram, 250GB disk, 16vCPU - OCP Master Node : 8GB RAM, 30GB disk, 2vCPU (30GB for docker) - OCP worker Node : 8GB RAM, 15GB disk, 1vCPU (30GB for docker) NOTE: The host machine only has 16 cores, so the actual vCPU during the time of deployment may be varied depending on the VM's work load, and the libvirt scheduling. Also, upon quick review of James Oden's setup, his VM's were set to a much higher RAM setting (50GB RAM each for Engine and Hypervisor).
Seen this again, this time the failure was a different command: 2017-01-24 20:50:42,305 p=21157 u=foreman | fatal: [dirty-ocp-master1.b.b]: FAILED! => { "changed": true, "cmd": ["systemctl","restart", "atomic-openshift-master.service"], "delta": "0:01:31.783867", "end": "2017-01-24 20:50:39.756826", "failed": true, "rc": 1, "start": "2017-01-24 20:49:07.972959", "stderr": "Job for atomic-openshift-master.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-master.service\" and \"journalctl -xe\" for details.", "stdout": "", "stdout_lines": [], "warnings": [] } The OCP deploy subtask was now at 90%. However the CPU utilization of the nodes is going between 80%-100%. When I run top on one of the OCP nodes I'm seeing the openshift and etcd processes taking up most of the CPU. Open shift is going from like 25% to sometimes 70% utilization. I don't know what this looks like on a normal system though.
We believe this issue is due to low available hardware resources.