Bug 1380455 - Eliminate Scheduler hot-loop logging during failure/panic/roll-over conditions.
Summary: Eliminate Scheduler hot-loop logging during failure/panic/roll-over conditions.
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.3.0
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Andy Goldstein
QA Contact: Mike Fiedler
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-09-29 16:08 UTC by Mike Fiedler
Modified: 2017-07-24 14:11 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-04-12 19:06:54 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
sample yaml files (20.00 KB, application/x-tar)
2016-09-29 16:08 UTC, Mike Fiedler
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:0884 0 normal SHIPPED_LIVE Red Hat OpenShift Container Platform 3.5 RPM Release Advisory 2017-04-12 22:50:07 UTC

Description Mike Fiedler 2016-09-29 16:08:15 UTC
Created attachment 1205981 [details]
sample yaml files

Description of problem:

Master vertical scale up of 500 nodes to 125 pause pods/node.   The scaleup script is aggressive - it creates 10 namespaces at a time, each namespace has 4 DC, 4 RC and 6 running pods.

After running for 2 hours, many/tons of the following messages started popping in the master controllers log - messages_master1 log - messages start around Sep 28 17:56 and really pick up around Sep 28 18:00:23

Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237148   74228 priorities.go:39] Combined requested resources 3200 from existing pods exceeds capacity 1000 on node 192.1.3.97
Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237155   74228 priorities.go:39] Combined requested resources 6710886400 from existing pods exceeds capacity 3975618560 on node 192.1.3.97
Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237170   74228 priorities.go:39] Combined requested resources 1700 from existing pods exceeds capacity 1000 on node 192.1.1.247
Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237180   74228 priorities.go:39] Combined requested resources 2300 from existing pods exceeds capacity 1000 on node 192.1.3.46

The odd thing is, the dc/rc/pod have no resource requests defined - see attached yaml.

At 18:02 the master controller fails over to master3.   Unfortunately rsyslog throttled the messages on master1 directly before failover.   I'm opening a bz to get that throttling window widened.   But it looks like this in the master1 log:

Sep 28 18:01:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:01:23.450298   74228 priorities.go:39] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node 192.1.2.45
Sep 28 18:01:23 svt-m-1 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting
Sep 28 18:08:52 svt-m-1 rsyslogd-2177: imjournal: 17351 messages lost due to rate-limiting

At this point master3 picks up as controller and you can see the activty in messages_master3 at Sep 28 18:02:47.   At 18:03:17 master3 starts getting the  "exceeds capacity" messages.

From here it is off to the races, the master controller continuously fails between cluster masters, never lasting very long.   This went on until the AM of Sep 29 when things were shutdown


Configuration
- OpenShift 3.3.0.32 on OSP9
- 3 etcd (c4.8xlarge: 32 vCPU, 60GB RAM)
- 3 masters (c4.8xlarge as above)
- 1 master lb
- 2 router/registry
- 500 application nodes (1 vCPU, 4GB RAM)

Logs and yaml files for dc/rc/pod attached

Version-Release number of selected component (if applicable): 3.3.0.32


How reproducible: Unknown - 1 attempt so far.

Comment 4 Jan Chaloupka 2016-10-03 13:18:31 UTC
Mike,

does the scaleup script changes node's capacity? It seems to me it does. The scheduler caches each node's allocated resources (mem, cpu, gpu) calculated from resources of all pods. The allocated amount (approximated data in cache) is compared with capacity which is stored in node status reported by each node (real data). In this case the recently reported capacities are:

node 192.1.3.97
capacity: 1cpu, 3975618560B of memory

node 192.1.1.247
capacity: 1cpu

node 192.1.3.46
capacity: 1cpu

My theory:
1) each node is given some capacity
2) pods are scheduled and allocated resources consumed more than 50% of capacity
3) capacities of nodes are decreased to half of their amount
4) the scheduler receives new nodes statuses with decreased capacities
5) the scheduler (priority function) starts complaining about exceeded capacity

Can you check the initial capacity of each node to see of they changed? Can you check if some nodes are under memory pressure (kubectl describe node)?

Comment 5 Jan Chaloupka 2016-10-03 14:02:57 UTC
To be more precise the scheduler cache is populated with all pods that are scheduled or assumed to be scheduled. Once any scheduled pod is terminated, it is eventually removed from the cache and the total allocatable is decreased.

So if my theory is correct, the total allocable amount of each node drops down below to capacity once enough pods is evicted.

In general, the scheduler caches does not magically adds up additional resources. So there must be some other hidden pods that consumes resources.

Comment 6 Mike Fiedler 2016-10-03 14:27:55 UTC
The script does not change node capacity.   The script's only job is to continuously create new projects based on a template until the desired number of projects is reached.   The template is here:   https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml

There is no modification of capacity.

Your description of 192.1.3.97 in comment 4 is accurate.   Each node has 1 CPU and 4GB RAM

Comment 9 Jan Chaloupka 2016-10-03 15:17:57 UTC
Other way to smuggle pods into the scheduler cache and increase the total amount of CPU/Memory consumed is syncing with the Apiserver. Each time a new (and not terminated) pod is assigned to a node (can happen outside of the scheduler) and the assignment is reflected in the Apiserver, it is added into the scheduler cache.

Is the issue always reproducible? Can you provide a list of all pods (kubectl describe pod POD) that are currently not terminated and running on some affected node?

Comment 12 Mike Fiedler 2016-10-04 15:06:10 UTC
Jan,  I ran a test - not sure if it answers comment 9 the way you expected - if not, let me know what to modify.

Unlike the original issue, this is on 3.4, not 3.3.   I can re-run on 3.3 if requested.

- Cluster with 2 small application nodes (1 core, 3.7 GB RAM)
- ran the cluster loader script requesting 20 pods/node without specifying any resource requests (as in original issue)
- Immediately, messages like this started popping in the master log:

Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569915   95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-185.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569948   95355 priorities.go:51] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569952   95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817014   95355 priorities.go:51] Combined requested resources 2100 from existing pods exceeds capacity 1000 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817033   95355 priorities.go:51] Combined requested resources 4404019200 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817038   95355 priorities.go:51] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node ip-172-31-17-185.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817042   95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-185.us-west-2.compute.internal

- All 40 pods (20/node) started successfully and entered Running state
- oc describe pod of one of the pods after it starts:

root@ip-172-31-26-26: ~/svt/openshift_scalability # oc describe pod -n clusterproject0 pausepods39
Name:                   pausepods39
Namespace:              clusterproject0
Security Policy:        anyuid
Node:                   ip-172-31-17-185.us-west-2.compute.internal/172.31.17.185
Start Time:             Tue, 04 Oct 2016 10:57:20 -0400
Labels:                 name=pause-amd64
Status:                 Running
IP:                     172.20.3.21
Controllers:            <none>
Containers:
  pause-amd64:
    Container ID:       docker://f94e3c32d57c835f7be588219bb3dc696a2f0bb3e8e8a3144c6a4df107d6aa3a
    Image:              gcr.io/google_containers/pause-amd64:3.0
    Image ID:           docker://sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2
    Port:               8080/TCP
    State:              Running
      Started:          Tue, 04 Oct 2016 10:57:59 -0400
    Ready:              True
    Restart Count:      0
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-ubv9y (ro)
    Environment Variables:      <none>
Conditions:
  Type          Status
  Initialized   True 
  Ready         True 
  PodScheduled  True 
Volumes:
  default-token-ubv9y:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-ubv9y
QoS Class:      BestEffort
Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                                    SubobjectPath                   Type         Reason           Message
  ---------     --------        -----   ----                                                    -------------                   --------     ------           -------
  2m            2m              1       {default-scheduler }                                                                    Normal       Scheduled        Successfully assigned pausepods39 to ip-172-31-17-185.us-west-2.compute.internal
  1m            1m              1       {kubelet ip-172-31-17-185.us-west-2.compute.internal}   spec.containers{pause-amd64}    Normal       Pulled           Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
  1m            1m              1       {kubelet ip-172-31-17-185.us-west-2.compute.internal}   spec.containers{pause-amd64}    Normal       Created          Created container with docker id f94e3c32d57c; Security:[seccomp=unconfined]
  1m            1m              1       {kubelet ip-172-31-17-185.us-west-2.compute.internal}   spec.containers{pause-amd64}    Normal       Started          Started container with docker id f94e3c32d57c

Comment 13 Mike Fiedler 2016-10-04 15:07:13 UTC
kubelet config on these nodes is max-pods = 250 and pods-per-node = 0

Comment 20 Timothy St. Clair 2017-01-18 15:53:38 UTC
We've done a number of efforts to turn down the noise during failure scenarios.  

In 1.5/3.5 we've turned down and analyzed all the hot-loop logging paths. 

In 1.6 we've aggregated summary information in a single event and updated pod-message with the same overall summary.  

The primary objective of this original issue was to not excessively spam to the point of log-clipping when failure conditions occur, those modifications have been made upstream.  

We should probably stress the capacity dimensions again and see if we can re-create fail-over conditions.

Comment 21 Troy Dawson 2017-01-27 17:50:04 UTC
This should be the way you want it in OCP v3.5.0.10 or newer.

Comment 22 Mike Fiedler 2017-02-07 16:21:25 UTC
Preparing cluster horizontal run on 3.5.0.17 - will watch for this issue.

Comment 24 Mike Fiedler 2017-03-14 12:24:24 UTC
Verified on 3.5.0.50.   Forced several master-api failovers during a cluster horizontal scale up and the error logging was informative and useful.   No "rolling" (multiple sequential) failovers were seen.   Namespace and deployment creation continued on successfully.

Comment 26 errata-xmlrpc 2017-04-12 19:06:54 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884


Note You need to log in before you can comment on or make changes to this bug.