Bug 1989205 - Pipelines operator shows unexpected behaviour when used with LimitRange
Summary: Pipelines operator shows unexpected behaviour when used with LimitRange
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenShift Pipelines
Classification: Red Hat
Component: pipelines
Version: 1.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 1.6
Assignee: Vincent Demeester
QA Contact: Ruchir Garg
Robert Krátký
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-08-02 16:26 UTC by David Wilson
Modified: 2024-10-01 19:07 UTC (History)
13 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-05-06 06:40:50 UTC
Target Upstream Version:
Embargoed:
dawilson: needinfo-


Attachments (Terms of Use)
Failure scenario's (48.12 KB, application/pdf)
2021-08-02 16:26 UTC, David Wilson
no flags Details

Description David Wilson 2021-08-02 16:26:14 UTC
Created attachment 1810219 [details]
Failure scenario's

Description of problem:

The working of Pipelines operator is not as expected and has issues. Cu has performed extensive testing and has opened ticket with all the scenarios.

How reproducible:

I did not reproduce all the scenarios, but could get the results from Cu comments.


Steps to Reproduce:

Please refer the doc. attached to understand the failure scenarios.


Actual results:

Cu has to modify their existing configuration of LimitRange to accomodate the operator working but will end up facing issues for their previously working applications in the same namespace, making it impossible for pipeline applications and normal applications to co-exist.


Expected results:

1. Operator should work without issues even if Min value is not set in limitRange.
2. The task's should not fail as their consumption hit the Default set by LimitRange.


Additional info: It would be helpful if any workaround's are there for these issues as Cu has an impact because of this behaviour.

Comment 2 Vincent Demeester 2021-08-04 05:17:47 UTC
This is a duplicate of 1983045 (https://bugzilla.redhat.com/show_bug.cgi?id=1983045)

Comment 4 Vincent Demeester 2021-08-06 12:18:18 UTC
@dawilson There is no easy workaround on this except setting the request/limit on each Task by hand for now. I am actively working on this upstream (tektoncd/pipeline) so that we have a way better support for LimitRanges in OpenShift Pipelines 1.6.

Comment 5 Vincent Demeester 2021-08-06 13:30:30 UTC
Another workaround would/could be for the customer to make sure to "isolate" the namespace where the pipelines are from the namespace where their workloads are. That would allow the customer to tailor its limitrange on a namespace without affecting the rest of its workload. It is probably a bit more "achievable" than having to set request/limits for each task that are in the cluster.

For a bit more detail, I'll repeat a bit what is written in https://bugzilla.redhat.com/show_bug.cgi?id=1983045 (which is very similar and would be fixed at the same time).

According to the documentation 

https://tekton.dev/docs/pipelines/tasks/#defining-steps:
> The CPU, memory, and ephemeral storage resource requests will be set to zero (also known as BestEffort), or, if specified, the minimums set through LimitRanges in that Namespace, if the container image does not have the largest resource request out of all container images in the Task. This ensures that the Pod that executes the Task only requests enough resources to run a single container image in the Task rather than hoard resources for all container images in the Task at once.

 https://tekton.dev/docs/pipelines/taskruns/#specifying-limitrange-values:
> In order to only consume the bare minimum amount of resources needed to execute one Step at a time from the invoked Task, Tekton only requests the maximum values for CPU, memory, and ephemeral storage from within each Step. This is sufficient as Steps only execute one at a time in the Pod. Requests other than the maximum values are set to zero.
> 
> When a LimitRange parameter is present in the namespace in which TaskRuns are executing and minimum values are specified for container resource requests, Tekton searches through all LimitRange values present in the namespace and uses the minimums instead of 0.

It is working as documented, but as you (and the customer noted), but yes the behaviour is confusing and Tekton Pipelines doesn't support fully (or correctly) the LimitRange. 

The way Limits and Requests works in Kubernetes is because it is assumed that all containers run in parallel (which they do — except in tekton with some hack), and init container run before, each one after the others.

That assumption — running in parallel — is not really true in Tekton. They do all start together (because there is no way around this) *but* the /entrypoint hack/ is making sure they actually run in sequence and thus there is always only one container that is actually consuming some resource at the same time.

This means, we need to handle limits, request and LimitRanges in a /non-standard/ way. Let's try to define that. Tekton needs to take into account all the aspect of the LimitRange : the min/max as well as the default. If there is no default, but there is min/max, Tekton need then to *set* a default value that is between the min/max. If we set the value too low, the Pod won't be able to be created, similar if we set the value too high. *But* those values are set on *containers*, so we *have to* do our own computation to know what request to put on each containers. To add to the complexity here, we also need to support =MaxLimitRequestRatio=, which is just adding complexity on top of something complex.

This has to be fixed upstream, and I am hoping I can get this in for tektoncd/pipeline 0.28 (which is going to be shipped as part of RHOSP 1.6). Backporting this to 1.5 and 1.4 is going to be relatively hard and costly (because of the upstream codebase changing and having to maintain / adapt a patchset) and we need to see if the cost is worth (especially given the support policy of RHOSP — e.g. when 1.6 is out, 1.4 is not supported anymore for example, because OCP 4.7 isn't).

In any case, our priority right now is to get this fixed (better LimitRange support) in time for 1.6. Once we do that, we can think / discuss / evaluate if it make sense or not to backport.

Comment 6 David Wilson 2021-08-06 14:11:33 UTC
@ Vincent: Thanks for the explanation. Cu is convinced as well.

Comment 7 Vincent Demeester 2021-08-30 12:22:53 UTC
For what is worth, this should be handled by the following work-in-progress PR upstream : https://github.com/tektoncd/pipeline/pull/4176

Comment 9 Vincent Demeester 2022-05-06 06:40:50 UTC
Recent version have the above pull-request and change in, I think we can close this bug.


Note You need to log in before you can comment on or make changes to this bug.