1752557 – [vSphere][UPI] Build tests fail with insufficient resources

Bug 1752557 - [vSphere][UPI] Build tests fail with insufficient resources

Summary: [vSphere][UPI] Build tests fail with insufficient resources

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Build
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Ben Parees
QA Contact:	wewang
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1744410 1753221 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-09-16 15:50 UTC by sheng.lao
Modified:	2019-10-16 06:41 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-10-16 06:41:11 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 23825	0	None	closed	bug 1752557: bump s2i quota test memory limit to avoid oomkills	2020-02-11 13:49:12 UTC
Red Hat Product Errata	RHBA-2019:2922	0	None	None	None	2019-10-16 06:41:23 UTC

Description sheng.lao 2019-09-16 15:50:05 UTC

Description of problem:
s2i build with a quota Building from a template should create an s2i build with a quota and run it

[Feature:Builds][Conformance] s2i build with a quota Building from a template should create an s2i build with a quota and run it [Suite:openshift/conformance/parallel/minimal]


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Ben Parees 2019-09-16 20:22:40 UTC

if this represents a CI job failure, please link to the failing job w/ the BZ report, otherwise there's not enough information to go on here to investigate.

Comment 2 sheng.lao 2019-09-17 09:10:25 UTC

I'm sorry, here is the url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-4.2/121

Comment 3 Adam Kaplan 2019-09-17 13:29:41 UTC

Tests are failing with OOMKilled. On AWS these tests pass because there are no default resource limits on pods. It looks like these vsphere tests do set a fairly low default limit for pods (60m cpu, 200Mi memory).

Either the clusters need to be configured with higher default resource limits, or the build tests need to set reasonable resource limits.

Comment 4 Steve Kuznetsov 2019-09-17 14:29:31 UTC

This is a failure in how the installer is setting limits on Pods in the cluster under test, not part of the test infrastructure.

Comment 5 Abhinav Dahiya 2019-09-17 21:15:51 UTC

It looks like the the test pod running in the test cluster is failing due to insufficient memory.

1) The test-case owners should probably set the correct resource requirements.

2) Also Adman pointed out that the resource limits were being set on the pod for vSphere, but not AWS.. which is strange.

In both the cases this doesn't seem like something installer is responsible for controlling.

Comment 6 Ben Parees 2019-09-17 21:25:45 UTC

> 1) The test-case owners should probably set the correct resource requirements.

ideally yes, however the test is setting no limit, thus it has a (reasonable) expectation that no limit would be applied.  That is what happens on other platforms, so it is confusing that on vsphere the pod is getting a default limit applied to it.

I don't know what would be responsible for setting up a default limit in the cluster, my theories would be:

1) installer (but you say it's not, so we can cross that off the list)
2) the CI job itself as part of installing the cluster
3) ????


So yeah, devex can fix the test(s) to set explicit limits on the buildconfig (which in turn will force a limit onto the pod and presumably avoid getting the default) but it would be nice to understand why our vsphere clusters behavior differently from our other clusters.

> 2) Also Adman pointed out that the resource limits were being set on the pod for vSphere, but not AWS.. which is strange.

right, that's what i think we should strive to understand before paving over the problem by having the test set explicit limits.

But if this is not something the installer is doing explicitly for vsphere installs, certainly we should move the bug off the installer component.

Comment 7 Ben Parees 2019-09-17 21:33:05 UTC

Adam, the test in question does set limits on the build:

https://github.com/openshift/origin/blob/master/test/extended/testdata/builds/test-s2i-build-quota.json#L11-L16

So that's where the limit came from.

As for the OOM killing, presumably that was a node oom kill and hopefully the linked PR which increases the node memory resources will avoid this problem in the future.

Comment 8 Abhinav Dahiya 2019-09-18 00:15:43 UTC

we have the same test failing on metal upi jobs too..

[Feature:Builds][Conformance] s2i build with a quota Building from a template should create an s2i build with a quota and run it [Suite:openshift/conformance/parallel/minimal]

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-metal-4.2/126
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-metal-4.2/125

So it probably is not a vSphere specific problem.

and metal compute nodes have a lot of memory `32140300Ki` https://storage.googleapis.com/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-metal-4.2/125/artifacts/e2e-metal/nodes.json

Comment 9 Ben Parees 2019-09-18 01:51:19 UTC

Not the same...  the metal failure shows that the pod ended in an "error", not oomkilled.  

Metal "error" build pod(see "reason" at the end):
    containerStatuses:
    - containerID: cri-o://d7104ee666f588a71ed937f0dc12f673a585c8d637b0127dd4df6e243d2305b9
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b8e96ac446f78ffcbe7852627bd13c108ea690d1bde0768c776e63061d88a25
      imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b8e96ac446f78ffcbe7852627bd13c108ea690d1bde0768c776e63061d88a25
      lastState: {}
      name: sti-build
      ready: false
      restartCount: 0
      state:
        terminated:
          containerID: cri-o://d7104ee666f588a71ed937f0dc12f673a585c8d637b0127dd4df6e243d2305b9
          exitCode: 2
          finishedAt: "2019-09-17T15:43:48Z"
          message: |
            AfterStart:[]io.Closer(nil), closeAfterWait:[]io.Closer(nil), goroutine:[]func() error(nil), errch:(chan error)(nil), waitDone:(chan struct {})(nil)} in &unshare.Cmd{Cmd:(*exec.Cmd)(0xc0002cef20), UnshareFlags:67239936, UseNewuidmap:false, UidMappings:[]specs.LinuxIDMapping(nil), UseNewgidmap:false, GidMappings:[]specs.LinuxIDMapping(nil), GidMappingsEnableSetgroups:false, Setsid:false, Setpgrp:false, Ctty:(*os.File)(nil), OOMScoreAdj:(*int)(nil), Hook:(func(int) error)(0x13aff00)}"
            time="2019-09-17T15:43:47Z" level=debug msg="chrooted into \"/tmp/buildah815075268/mnt/rootfs\""
            time="2019-09-17T15:43:47Z" level=debug msg="setting apparmor profile"
            time="2019-09-17T15:43:47Z" level=debug msg="setting resource limits"
            time="2019-09-17T15:43:47Z" level=debug msg="changed working directory to \"/\""
            time="2019-09-17T15:43:47Z" level=debug msg="setting supplemental groups"
            time="2019-09-17T15:43:47Z" level=debug msg="setting gid"
            time="2019-09-17T15:43:47Z" level=debug msg="setting capabilities"
            time="2019-09-17T15:43:47Z" level=debug msg="setting uid"
            time="2019-09-17T15:43:47Z" level=debug msg="Running &exec.Cmd{Path:\"/bin/sh\", Args:[]string{\"/bin/sh\", \"-c\", \"chown -R 1001:0 /tmp/scripts /tmp/src\"}, Env:[]string{\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\", \"BUILD_LOGLEVEL=6\", \"OPENSHIFT_BUILD_NAMESPACE=e2e-test-s2i-build-quota-hnx6x\", \"OPENSHIFT_BUILD_NAME=s2i-build-quota-1\", \"HOSTNAME=0758d4b7ecd2\", \"HOME=/root\"}, Dir:\"/\", Stdin:(*os.File)(0xc00000e010), Stdout:(*os.File)(0xc00000e018), Stderr:(*os.File)(0xc00000e020), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), lookPathErr:error(nil), finished:false, childFiles:[]*os.File(nil), closeAfterStart:[]io.Closer(nil), closeAfterWait:[]io.Closer(nil), goroutine:[]func() error(nil), errch:(chan error)(nil), waitDone:(chan struct {})(nil)} (PATH = \"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\")"
          reason: Error




As compared with the OOMKilled vsphere pod:

    containerStatuses:
    - containerID: cri-o://17567dbc5a00a5075ed912064903f22ca005962b762240e5917c5a51f129851b
      image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b8e96ac446f78ffcbe7852627bd13c108ea690d1bde0768c776e63061d88a25
      imageID: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b8e96ac446f78ffcbe7852627bd13c108ea690d1bde0768c776e63061d88a25
      lastState: {}
      name: sti-build
      ready: false
      restartCount: 0
      state:
        terminated:
          containerID: cri-o://17567dbc5a00a5075ed912064903f22ca005962b762240e5917c5a51f129851b
          exitCode: 2
          finishedAt: "2019-09-16T19:05:52Z"
          message: |
            os.File(nil), closeAfterStart:[]io.Closer(nil), closeAfterWait:[]io.Closer(nil), goroutine:[]func() error(nil), errch:(chan error)(nil), waitDone:(chan struct {})(nil)} in &unshare.Cmd{Cmd:(*exec.Cmd)(0xc000b86580), UnshareFlags:67239936, UseNewuidmap:false, UidMappings:[]specs.LinuxIDMapping(nil), UseNewgidmap:false, GidMappings:[]specs.LinuxIDMapping(nil), GidMappingsEnableSetgroups:false, Setsid:false, Setpgrp:false, Ctty:(*os.File)(nil), OOMScoreAdj:(*int)(nil), Hook:(func(int) error)(0x13aff00)}"
            time="2019-09-16T19:03:41Z" level=debug msg="chrooted into \"/tmp/buildah827837120/mnt/rootfs\""
            time="2019-09-16T19:03:41Z" level=debug msg="setting apparmor profile"
            time="2019-09-16T19:03:41Z" level=debug msg="setting resource limits"
            time="2019-09-16T19:03:41Z" level=debug msg="changed working directory to \"/\""
            time="2019-09-16T19:03:41Z" level=debug msg="clearing supplemental groups"
            time="2019-09-16T19:03:41Z" level=debug msg="setting gid"
            time="2019-09-16T19:03:41Z" level=debug msg="setting capabilities"
            time="2019-09-16T19:03:41Z" level=debug msg="setting uid"
            time="2019-09-16T19:03:41Z" level=debug msg="Running &exec.Cmd{Path:\"/bin/sh\", Args:[]string{\"/bin/sh\", \"-c\", \"/tmp/scripts/assemble\"}, Env:[]string{\"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\", \"BUILD_LOGLEVEL=6\", \"OPENSHIFT_BUILD_NAMESPACE=e2e-test-s2i-build-quota-9kd2d\", \"OPENSHIFT_BUILD_NAME=s2i-build-quota-1\", \"HOSTNAME=57ef2ea43836\", \"HOME=/\"}, Dir:\"/\", Stdin:(*os.File)(0xc00000e010), Stdout:(*os.File)(0xc00000e018), Stderr:(*os.File)(0xc00000e020), ExtraFiles:[]*os.File(nil), SysProcAttr:(*syscall.SysProcAttr)(nil), Process:(*os.Process)(nil), ProcessState:(*os.ProcessState)(nil), ctx:context.Context(nil), lookPathErr:error(nil), finished:false, childFiles:[]*os.File(nil), closeAfterStart:[]io.Closer(nil), closeAfterWait:[]io.Closer(nil), goroutine:[]func() error(nil), errch:(chan error)(nil), waitDone:(chan struct {})(nil)} (PATH = \"/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin\")"
          reason: OOMKilled



Now, why did the metal build pod end in with an "error"?  Not clear to me..the build log ends somewhat abruptly, so either the container was terminated by the kubelet for some other reason, or perhaps the build container hit the 200meg container memory limit and died?  We can certainly try raising the 200meg limit that's being set in this test as i believe the purpose is simply to ensure we can set limits, not to set them specifically low..and now that we are running buildah inside the container, it's possible we are closer to the limit than we used to be (though that has yet to be seen as an issue anywhere else).  The events for the two pods do look the same:

OOMKilled:
  Normal	Created		3m42s	kubelet, compute-2	Created container sti-build
  Normal	Started		3m42s	kubelet, compute-2	Started container sti-build
  Normal	Killing		8s	kubelet, compute-2	Stopping container sti-build
  Normal	BuildFailed	7s	build-controller	Build e2e-test-s2i-build-quota-9kd2d/s2i-build-quota-1 failed



Error:
  Normal	Created		2m6s	kubelet, worker-2.ci-op-430d20kq-18ac7.origin-ci-int-aws.dev.rhcloud.com	Created container sti-build
  Normal	Started		2m6s	kubelet, worker-2.ci-op-430d20kq-18ac7.origin-ci-int-aws.dev.rhcloud.com	Started container sti-build
  Normal	Killing		8s	kubelet, worker-2.ci-op-430d20kq-18ac7.origin-ci-int-aws.dev.rhcloud.com	Stopping container sti-build
  Normal	BuildFailed	2s	build-controller								Build e2e-test-s2i-build-quota-hnx6x/s2i-build-quota-1 failed


Regardless, I think the metal failure should be tracked as a separate bug.

Comment 10 Abhinav Dahiya 2019-09-18 05:42:54 UTC

Found OOMKilled on GCP for this test.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/319

Even on GCP this test has limits and requests defined.

Comment 11 Abhinav Dahiya 2019-09-18 05:45:01 UTC

hmm, seeing the OOMKilled on AWS too... with requests and limits set..

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/546

Comment 12 Ben Parees 2019-09-18 11:48:30 UTC

> hmm, seeing the OOMKilled on AWS too... with requests and limits set..

yeah so per my comment 7, the limits being set is expected for this test.

As for the oom killing, I cannot recall now how a "process within the container hit the container's memory limit" but i think it may manifest as an oomkill, which would reinforce my suspicion that this test simply needs its memory limit raised because it needs more memory than it used to, with buildah running inside the container unlike how it worked in 3.x.

I'll submit a PR.

Comment 13 Ben Parees 2019-09-18 11:56:26 UTC

https://github.com/openshift/origin/pull/23825

Comment 14 Adam Kaplan 2019-09-18 12:47:39 UTC

Moving to Builds as I think we now can rule this out as an installer issue.

@Ben regarding `Error` vs `OOMKilled` - the status reporting on this front is known to be flaky. We have a separate `OOMKilled` test case for builds that on AWS does not pass 100% of the time (don't know the stats, but it's already been marked flaky).

Comment 15 Adam Kaplan 2019-09-18 12:48:36 UTC

*** Bug 1753221 has been marked as a duplicate of this bug. ***

Comment 16 Adam Kaplan 2019-09-19 01:36:34 UTC

*** Bug 1744410 has been marked as a duplicate of this bug. ***

Comment 17 Adam Kaplan 2019-09-19 01:37:56 UTC

Note for QE - we need to verify this bug on all cloud provider platforms, as this is blocking CI on some:

1. AWS
2. Azure
3. GCP
4. vSphere
5. Baremetal
6. OpenStack

Comment 18 Ben Parees 2019-09-19 01:39:58 UTC

QE doesn't really have a way to verify e2e test fixes (or if nothing else, they can't do it any better than we can).  i'd expect us to verify that ourselves.

Comment 19 David Eads 2019-09-19 12:32:15 UTC

Fix failed.  See failure https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/340

Comment 20 David Eads 2019-09-19 12:34:04 UTC

Grrr.... Late ocp strikes again.  The ocp canaries were running 10 hour old code.

Comment 22 wewang 2019-09-20 05:24:04 UTC

Checked in version:
4.2.0-0.nightly-2019-09-20-013310

Passed in follow env:
1. AWS:        https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-4.2/563
2. Azure:      https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-azure-4.2/327
3. Gcp:        https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-gcp-4.2/348
4. Baremetal:  https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-metal-4.2/156
5. OpenStack:  https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-openstack-4.2/234


but about in vsphere, seems vpshere env setup issue, so didn't run the feature, when new payload come out, will check it.
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-4.2/149
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-serial-4.2/149

Comment 23 wewang 2019-09-23 08:40:59 UTC

Because the latest vsphere env setup failed :https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-4.2/175

so Checked the older payload: 4.2.0-0.nightly-2019-09-20-202209, passed in vsphere env: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/canary-openshift-ocp-installer-e2e-vsphere-upi-4.2/160

Comment 24 errata-xmlrpc 2019-10-16 06:41:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2922

Note You need to log in before you can comment on or make changes to this bug.