Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1380455

Summary:

Eliminate Scheduler hot-loop logging during failure/panic/roll-over conditions.

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Node

Assignee:

Andy Goldstein <agoldste>

Status:

CLOSED ERRATA

QA Contact:

Mike Fiedler <mifiedle>

Severity:

high

Docs Contact:

Priority:

medium

Version:

3.3.0

CC:

aos-bugs, decarr, eparis, jokerman, mifiedle, mmccomas, tdawson, tstclair, xtian

Target Milestone:

---

Keywords:

Reopened

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

No Doc Update

Doc Text:

undefined

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-04-12 19:06:54 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
sample yaml files	none

Description Mike Fiedler 2016-09-29 16:08:15 UTC

Created attachment 1205981 [details]
sample yaml files

Description of problem:

Master vertical scale up of 500 nodes to 125 pause pods/node.   The scaleup script is aggressive - it creates 10 namespaces at a time, each namespace has 4 DC, 4 RC and 6 running pods.

After running for 2 hours, many/tons of the following messages started popping in the master controllers log - messages_master1 log - messages start around Sep 28 17:56 and really pick up around Sep 28 18:00:23

Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237148   74228 priorities.go:39] Combined requested resources 3200 from existing pods exceeds capacity 1000 on node 192.1.3.97
Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237155   74228 priorities.go:39] Combined requested resources 6710886400 from existing pods exceeds capacity 3975618560 on node 192.1.3.97
Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237170   74228 priorities.go:39] Combined requested resources 1700 from existing pods exceeds capacity 1000 on node 192.1.1.247
Sep 28 18:00:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:00:23.237180   74228 priorities.go:39] Combined requested resources 2300 from existing pods exceeds capacity 1000 on node 192.1.3.46

The odd thing is, the dc/rc/pod have no resource requests defined - see attached yaml.

At 18:02 the master controller fails over to master3.   Unfortunately rsyslog throttled the messages on master1 directly before failover.   I'm opening a bz to get that throttling window widened.   But it looks like this in the master1 log:

Sep 28 18:01:23 svt-m-1 atomic-openshift-master-controllers: I0928 18:01:23.450298   74228 priorities.go:39] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node 192.1.2.45
Sep 28 18:01:23 svt-m-1 rsyslogd-2177: imjournal: begin to drop messages due to rate-limiting
Sep 28 18:08:52 svt-m-1 rsyslogd-2177: imjournal: 17351 messages lost due to rate-limiting

At this point master3 picks up as controller and you can see the activty in messages_master3 at Sep 28 18:02:47.   At 18:03:17 master3 starts getting the  "exceeds capacity" messages.

From here it is off to the races, the master controller continuously fails between cluster masters, never lasting very long.   This went on until the AM of Sep 29 when things were shutdown


Configuration
- OpenShift 3.3.0.32 on OSP9
- 3 etcd (c4.8xlarge: 32 vCPU, 60GB RAM)
- 3 masters (c4.8xlarge as above)
- 1 master lb
- 2 router/registry
- 500 application nodes (1 vCPU, 4GB RAM)

Logs and yaml files for dc/rc/pod attached

Version-Release number of selected component (if applicable): 3.3.0.32


How reproducible: Unknown - 1 attempt so far.

Comment 4 Jan Chaloupka 2016-10-03 13:18:31 UTC

Mike,

does the scaleup script changes node's capacity? It seems to me it does. The scheduler caches each node's allocated resources (mem, cpu, gpu) calculated from resources of all pods. The allocated amount (approximated data in cache) is compared with capacity which is stored in node status reported by each node (real data). In this case the recently reported capacities are:

node 192.1.3.97
capacity: 1cpu, 3975618560B of memory

node 192.1.1.247
capacity: 1cpu

node 192.1.3.46
capacity: 1cpu

My theory:
1) each node is given some capacity
2) pods are scheduled and allocated resources consumed more than 50% of capacity
3) capacities of nodes are decreased to half of their amount
4) the scheduler receives new nodes statuses with decreased capacities
5) the scheduler (priority function) starts complaining about exceeded capacity

Can you check the initial capacity of each node to see of they changed? Can you check if some nodes are under memory pressure (kubectl describe node)?

Comment 5 Jan Chaloupka 2016-10-03 14:02:57 UTC

To be more precise the scheduler cache is populated with all pods that are scheduled or assumed to be scheduled. Once any scheduled pod is terminated, it is eventually removed from the cache and the total allocatable is decreased.

So if my theory is correct, the total allocable amount of each node drops down below to capacity once enough pods is evicted.

In general, the scheduler caches does not magically adds up additional resources. So there must be some other hidden pods that consumes resources.

Comment 6 Mike Fiedler 2016-10-03 14:27:55 UTC

The script does not change node capacity.   The script's only job is to continuously create new projects based on a template until the desired number of projects is reached.   The template is here:   https://github.com/openshift/svt/blob/master/openshift_scalability/config/pyconfigMasterVirtScalePause.yaml

There is no modification of capacity.

Your description of 192.1.3.97 in comment 4 is accurate.   Each node has 1 CPU and 4GB RAM

Comment 9 Jan Chaloupka 2016-10-03 15:17:57 UTC

Other way to smuggle pods into the scheduler cache and increase the total amount of CPU/Memory consumed is syncing with the Apiserver. Each time a new (and not terminated) pod is assigned to a node (can happen outside of the scheduler) and the assignment is reflected in the Apiserver, it is added into the scheduler cache.

Is the issue always reproducible? Can you provide a list of all pods (kubectl describe pod POD) that are currently not terminated and running on some affected node?

Comment 12 Mike Fiedler 2016-10-04 15:06:10 UTC

Jan,  I ran a test - not sure if it answers comment 9 the way you expected - if not, let me know what to modify.

Unlike the original issue, this is on 3.4, not 3.3.   I can re-run on 3.3 if requested.

- Cluster with 2 small application nodes (1 core, 3.7 GB RAM)
- ran the cluster loader script requesting 20 pods/node without specifying any resource requests (as in original issue)
- Immediately, messages like this started popping in the master log:

Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569915   95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-185.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569948   95355 priorities.go:51] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.569952   95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817014   95355 priorities.go:51] Combined requested resources 2100 from existing pods exceeds capacity 1000 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817033   95355 priorities.go:51] Combined requested resources 4404019200 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-186.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817038   95355 priorities.go:51] Combined requested resources 2000 from existing pods exceeds capacity 1000 on node ip-172-31-17-185.us-west-2.compute.internal
Oct  4 10:56:53 ip-172-31-26-26 atomic-openshift-master: I1004 10:56:53.817042   95355 priorities.go:51] Combined requested resources 4194304000 from existing pods exceeds capacity 3777728512 on node ip-172-31-17-185.us-west-2.compute.internal

- All 40 pods (20/node) started successfully and entered Running state
- oc describe pod of one of the pods after it starts:

root@ip-172-31-26-26: ~/svt/openshift_scalability # oc describe pod -n clusterproject0 pausepods39
Name:                   pausepods39
Namespace:              clusterproject0
Security Policy:        anyuid
Node:                   ip-172-31-17-185.us-west-2.compute.internal/172.31.17.185
Start Time:             Tue, 04 Oct 2016 10:57:20 -0400
Labels:                 name=pause-amd64
Status:                 Running
IP:                     172.20.3.21
Controllers:            <none>
Containers:
  pause-amd64:
    Container ID:       docker://f94e3c32d57c835f7be588219bb3dc696a2f0bb3e8e8a3144c6a4df107d6aa3a
    Image:              gcr.io/google_containers/pause-amd64:3.0
    Image ID:           docker://sha256:99e59f495ffaa222bfeb67580213e8c28c1e885f1d245ab2bbe3b1b1ec3bd0b2
    Port:               8080/TCP
    State:              Running
      Started:          Tue, 04 Oct 2016 10:57:59 -0400
    Ready:              True
    Restart Count:      0
    Volume Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-ubv9y (ro)
    Environment Variables:      <none>
Conditions:
  Type          Status
  Initialized   True 
  Ready         True 
  PodScheduled  True 
Volumes:
  default-token-ubv9y:
    Type:       Secret (a volume populated by a Secret)
    SecretName: default-token-ubv9y
QoS Class:      BestEffort
Tolerations:    <none>
Events:
  FirstSeen     LastSeen        Count   From                                                    SubobjectPath                   Type         Reason           Message
  ---------     --------        -----   ----                                                    -------------                   --------     ------           -------
  2m            2m              1       {default-scheduler }                                                                    Normal       Scheduled        Successfully assigned pausepods39 to ip-172-31-17-185.us-west-2.compute.internal
  1m            1m              1       {kubelet ip-172-31-17-185.us-west-2.compute.internal}   spec.containers{pause-amd64}    Normal       Pulled           Container image "gcr.io/google_containers/pause-amd64:3.0" already present on machine
  1m            1m              1       {kubelet ip-172-31-17-185.us-west-2.compute.internal}   spec.containers{pause-amd64}    Normal       Created          Created container with docker id f94e3c32d57c; Security:[seccomp=unconfined]
  1m            1m              1       {kubelet ip-172-31-17-185.us-west-2.compute.internal}   spec.containers{pause-amd64}    Normal       Started          Started container with docker id f94e3c32d57c

Comment 13 Mike Fiedler 2016-10-04 15:07:13 UTC

kubelet config on these nodes is max-pods = 250 and pods-per-node = 0

Comment 20 Timothy St. Clair 2017-01-18 15:53:38 UTC

We've done a number of efforts to turn down the noise during failure scenarios.  

In 1.5/3.5 we've turned down and analyzed all the hot-loop logging paths. 

In 1.6 we've aggregated summary information in a single event and updated pod-message with the same overall summary.  

The primary objective of this original issue was to not excessively spam to the point of log-clipping when failure conditions occur, those modifications have been made upstream.  

We should probably stress the capacity dimensions again and see if we can re-create fail-over conditions.

Comment 21 Troy Dawson 2017-01-27 17:50:04 UTC

This should be the way you want it in OCP v3.5.0.10 or newer.

Comment 22 Mike Fiedler 2017-02-07 16:21:25 UTC

Preparing cluster horizontal run on 3.5.0.17 - will watch for this issue.

Comment 24 Mike Fiedler 2017-03-14 12:24:24 UTC

Verified on 3.5.0.50.   Forced several master-api failovers during a cluster horizontal scale up and the error logging was informative and useful.   No "rolling" (multiple sequential) failovers were seen.   Namespace and deployment creation continued on successfully.

Comment 26 errata-xmlrpc 2017-04-12 19:06:54 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:0884