Bug 1414908

Summary: OCP fails to install due to timeout running command
Product: Red Hat Quickstart Cloud Installer Reporter: James Olin Oden <joden>
Component: Installation - OpenShiftAssignee: jkim
Status: CLOSED NOTABUG QA Contact: Sudhir Mallamprabhakara <smallamp>
Severity: high Docs Contact: Derek <dcadzow>
Priority: unspecified    
Version: 1.1CC: arubin, bthurber, jkim, jmatthew, qci-bugzillas
Target Milestone: ---Keywords: Triaged
Target Release: 1.1   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-26 20:02:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description James Olin Oden 2017-01-19 17:19:24 UTC
Description of problem:
Was doing a RHV(1H + 1E) + OCP(1W + 1M) deployment in a nested virtualization environment.   Once it got to the OCP section it seemed to take a very long time.  Eventually at 70% it errored out, with the following error on the task:

   ansible-playbook returned a non-zero exit code during installation. Please refer to the log for more information regarding the failure.

I looked at the ansible log in the deployment log directory on fuser and found the following error:

   ".fatal: [
      inaugurate-ocp-node1.b.b]: FAILED! => {
          \"failed\": true,
          \"msg\": \"Timeout (12s) waiting for privilege escalation prompt: \"
      }",
      "",
      "NO MORE HOSTS LEFT *************************************************************",
      "",
      "inaugurate-ocp-master1.b.b : ok=2    changed=0    unreachable=0    failed=0   ",
      "inaugurate-ocp-node1.b.b   : ok=1    changed=0    unreachable=0    failed=1   ",
      "localhost                  : ok=9    changed=8    unreachable=0    failed=0   ",
      "",
      "",
      "Installation Complete: Note: Play count is an estimate and some were skipped because your install does not require them",
      "",
      "",
      "An error was detected. After resolving the problem please relaunch the",
      "installation process."
   ],

As I looked further I noticed that both the OCP nodes were running at near
100% CPU utilization.   I looked at the cpu utilization of the RHV host, and it was pegged at like a load of 3 (so 300% cpu utilization).   

One of the developers, dymurray, had me run the command:

   atomic-openshift-installer -u -c /tmp/atomic-openshift-installer.answers.cfg.yml install --force

It errored out with:

   There was a problem fetching the required information. Please see /tmp/ansible.log for details.

In the /tmp/ansible.log this error existed:

   2017-01-18 17:30:56,882 p=18798 u=root |  fatal: [inaugurate-ocp-master1.b.b]: FAILED! => {"attempts": 1, "changed": false, "failed": true, "msg": "Unable to start service atomic-openshift-node: Job for atomic-openshift-node.service failed because the control process exited with error code. See \"systemctl status atomic-openshift-node.service\" and \"journalctl -xe\" for details.\n"}

Looking at systemctl and journalctl -xe revealed there might be a storage issue of some kind.   If that is the case though, the default storage size was used
for the containers, so maybe the default needs to be bigger.   I have listed the above two commands output below in 

Version-Release number of selected component (if applicable):
QCI-1.1-RHEL-7-20170118.t.0

How reproducible:
First time

Steps to Reproduce:
1.   Do a RHV(1E + 1H) + OCP(2 nodes) deployment in a nested virtualization
     environment 

Actual results:
Results above in description.

Expected results:
No errors, deployment complete.

Additional info:

journalctl -xe output
-------------------------
    Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:48:49.951775   32066 trace.go:61] Trace "List *batch.JobList" (started 2017-01-19 11:48:47.577803439 -0500 EST):
    Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [7.794µs] [7.794µs] About to list etcd node
    Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.282996264s] [2.28298847s] Etcd node listed
    Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.364967335s] [81.971071ms] Node list decoded
    Jan 19 11:48:50 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.364990396s] [23.061µs] END
    Jan 19 11:48:51 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:48:51.385698   32066 nodecontroller.go:609] NodeController is entering network segmentation mode.
    Jan 19 11:48:53 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:48:53.648774   32066 trace.go:61] Trace "List /apis/batch/v2alpha1/jobs" (started 2017-01-19 11:48:47.577785434 -0500 EST):
    Jan 19 11:48:53 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [8.322µs] [8.322µs] About to List from storage
    Jan 19 11:48:53 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.374032788s] [2.374024466s] Listing from storage done
    Jan 19 11:48:54 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [2.524015894s] [149.983106ms] Self-linking done
    Jan 19 11:48:54 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [3.960856555s] [1.436840661s] Writing http response done (0 items)
    Jan 19 11:48:54 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [3.960860837s] [4.282µs] END
    Jan 19 11:48:54 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 1.248066266s, expected less than 1s
    Jan 19 11:48:55 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [156.995123ms for 2 entries]
    Jan 19 11:48:55 inaugurate-ocp-master1.b.b etcd[27387]: avoid queries with large range/delete range!
    Jan 19 11:49:02 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 5.337931061s, expected less than 1s
    Jan 19 11:49:05 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 2.887410895s, expected less than 1s
    Jan 19 11:49:05 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: I0119 11:49:05.753897   32066 trace.go:61] Trace "List *extensions.ThirdPartyResourceList" (started 2017-01-19 11:48:49.803762304 -0500 EST):
    Jan 19 11:49:05 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [13.426µs] [13.426µs] About to list etcd node
    Jan 19 11:49:06 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [4.585989961s] [4.585976535s] Etcd node listed
    Jan 19 11:49:06 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [4.586004951s] [14.99µs] Node list decoded
    Jan 19 11:49:06 inaugurate-ocp-master1.b.b atomic-openshift-master[32066]: [15.948245834s] [11.362240883s] END
    Jan 19 11:49:10 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [89.997255ms for 1 entries]
    Jan 19 11:49:10 inaugurate-ocp-master1.b.b etcd[27387]: avoid queries with large range/delete range!
    Jan 19 11:49:10 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 4.100239947s, expected less than 1s
    Jan 19 11:49:16 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [27.157113ms for 1 entries]
    Jan 19 11:49:16 inaugurate-ocp-master1.b.b etcd[27387]: avoid queries with large range/delete range!
    Jan 19 11:49:18 inaugurate-ocp-master1.b.b etcd[27387]: sync duration of 1.071792101s, expected less than 1s
    Jan 19 11:49:30 inaugurate-ocp-master1.b.b etcd[27387]: apply entries took too long [1.238851064s for 1 entries]

systemctl status atomic-openshift-node -l output
-------------------------------------------------
    [root@inaugurate-ocp-master1 tmp]# systemctl status atomic-openshift-node -l
    ● atomic-openshift-node.service - Atomic OpenShift Node
       Loaded: loaded (/usr/lib/systemd/system/atomic-openshift-node.service; enabled; vendor preset: disabled)
      Drop-In: /usr/lib/systemd/system/atomic-openshift-node.service.d
               └─openshift-sdn-ovs.conf
       Active: active (running) since Thu 2017-01-19 08:30:14 EST; 3h 17min ago
         Docs: https://github.com/openshift/origin
     Main PID: 87405 (openshift)
       Memory: 33.2M
       CGroup: /system.slice/atomic-openshift-node.service
               ├─87405 /usr/bin/openshift start node --config=/etc/origin/node/node-config.yaml --loglevel=2
               └─87525 journalctl -k -f
     
    Jan 19 11:39:50 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:39:50.435645   87405 kubelet.go:3011] Recording NodeReady event message for node inaugurate-ocp-master1.b.b
    Jan 19 11:40:29 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:40:29.569936   87405 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"inaugurate-ocp-master1.b.b.149b2f667b50c988", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"5604", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"inaugurate-ocp-master1.b.b", UID:"inaugurate-ocp-master1.b.b", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeHasSufficientDisk", Message:"Node inaugurate-ocp-master1.b.b status is now: NodeHasSufficientDisk", Source:api.EventSource{Component:"kubelet", Host:"inaugurate-ocp-master1.b.b"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63620429419, nsec:0, loc:(*time.Location)(0x8398200)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63620440790, nsec:434969158, loc:(*time.Location)(0x8398200)}}, Count:27, Type:"Normal"}': 'client: etcd cluster is unavailable or misconfigured' (will not retry!)
    Jan 19 11:41:06 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:41:06.567832   87405 kubelet.go:3000] Error updating node status, will retry: Operation cannot be fulfilled on nodes "inaugurate-ocp-master1.b.b": the object has been modified; please apply your changes to the latest version and try again
    Jan 19 11:43:27 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:43:27.680004   87405 kubelet.go:3011] Recording NodeHasSufficientDisk event message for node inaugurate-ocp-master1.b.b
    Jan 19 11:43:27 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:43:27.682901   87405 kubelet.go:3011] Recording NodeReady event message for node inaugurate-ocp-master1.b.b
    Jan 19 11:44:20 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:44:20.283959   87405 container_manager_linux.go:284] Discovered runtime cgroups name: /system.slice/docker.service
    Jan 19 11:45:12 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:45:12.650682   87405 kubelet.go:3000] Error updating node status, will retry: Operation cannot be fulfilled on nodes "inaugurate-ocp-master1.b.b": the object has been modified; please apply your changes to the latest version and try again
    Jan 19 11:46:10 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:46:10.120912   87405 kubelet.go:3011] Recording NodeHasSufficientDisk event message for node inaugurate-ocp-master1.b.b
    Jan 19 11:46:11 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: I0119 11:46:10.306611   87405 kubelet.go:3011] Recording NodeReady event message for node inaugurate-ocp-master1.b.b
    Jan 19 11:47:00 inaugurate-ocp-master1.b.b atomic-openshift-node[87405]: E0119 11:46:56.167949   87405 event.go:198] Server rejected event '&api.Event{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:api.ObjectMeta{Name:"inaugurate-ocp-master1.b.b.149b3018053cd208", GenerateName:"", Namespace:"default", SelfLink:"", UID:"", ResourceVersion:"5651", Generation:0, CreationTimestamp:unversioned.Time{Time:time.Time{sec:0, nsec:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*unversioned.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]api.OwnerReference(nil), Finalizers:[]string(nil)}, InvolvedObject:api.ObjectReference{Kind:"Node", Namespace:"", Name:"inaugurate-ocp-master1.b.b", UID:"inaugurate-ocp-master1.b.b", APIVersion:"", ResourceVersion:"", FieldPath:""}, Reason:"NodeReady", Message:"Node inaugurate-ocp-master1.b.b status is now: NodeReady", Source:api.EventSource{Component:"kubelet", Host:"inaugurate-ocp-master1.b.b"}, FirstTimestamp:unversioned.Time{Time:time.Time{sec:63620430181, nsec:0, loc:(*time.Location)(0x8398200)}}, LastTimestamp:unversioned.Time{Time:time.Time{sec:63620441170, nsec:306668360, loc:(*time.Location)(0x8398200)}}, Count:27, Type:"Normal"}': 'client: etcd cluster is unavailable or misconfigured' (will not retry!)

Comment 3 James Olin Oden 2017-01-20 19:11:03 UTC
I did this same deployment again, and it did succeed, but if you looked at the openstack VM's they were both running at 100% or near their.   This was with compose: QCI-1.1-RHEL-7-20170118.t.1

Comment 4 jkim 2017-01-24 16:23:24 UTC
I was not able to reproduce this bug.  The following are the details of my deployment setup:

- RHV & OCP (1 engine & 1 Hypervisor)
- Engine : 4GB RAM, 30GB disk, 16 vCPU
- Hypervisor : 16GB Ram, 70GB disk, 16vCPU
- Satellite VM : 16GB Ram, 250GB disk, 16vCPU
- OCP Master Node : 8GB RAM, 30GB disk, 2vCPU (30GB for docker)
- OCP worker Node : 8GB RAM, 15GB disk, 1vCPU (30GB for docker)

NOTE: The host machine only has 16 cores, so the actual vCPU during the time of deployment may be varied depending on the VM's work load, and the libvirt scheduling.

Also, upon quick review of James Oden's setup, his VM's were set to a much higher RAM setting (50GB RAM each for Engine and Hypervisor).

Comment 5 James Olin Oden 2017-01-25 15:44:27 UTC
Seen this again, this time the failure was a different command:

    2017-01-24 20:50:42,305 p=21157 u=foreman |  
    fatal: [dirty-ocp-master1.b.b]: FAILED! => {
        "changed": true, 
        "cmd": ["systemctl","restart", "atomic-openshift-master.service"], 
        "delta": "0:01:31.783867", 
        "end": "2017-01-24 20:50:39.756826", 
        "failed": true, 
        "rc": 1, 
        "start": "2017-01-24 20:49:07.972959", 
        "stderr": "Job for atomic-openshift-master.service failed because a timeout was exceeded. See \"systemctl status atomic-openshift-master.service\" and \"journalctl -xe\" for details.", 
        "stdout": "", 
        "stdout_lines": [], 
        "warnings": []
   }

The OCP deploy subtask was now at 90%.   However the CPU utilization of the nodes is going between 80%-100%.

When I run top on one of the OCP nodes I'm seeing the openshift and etcd processes taking up most of the CPU.  Open shift is going from like 25% to sometimes 70% utilization.   I don't know what this looks like on a normal system though.

Comment 7 John Matthews 2017-01-26 20:02:36 UTC
We believe this issue is due to low available hardware resources.