Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1369022

Summary:

Negative CPU requests for a pod

Product:

OpenShift Container Platform

Reporter:

Alexander Koksharov <akokshar>

Component:

Node

Assignee:

Derek Carr <decarr>

Status:

CLOSED DUPLICATE

QA Contact:

DeShuai Ma <dma>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

3.2.0

CC:

agoldste, akokshar, aos-bugs, eparis, hannsj_uhl, jokerman, mmccomas, yanpzhan

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1369160 (view as bug list)

Environment:

Last Closed:

2016-10-25 18:34:42 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1336863

Bug Blocks:

1369160

Attachments:

Description	Flags
QuoraError2	none
QuotaError	none

Description Alexander Koksharov 2016-08-22 10:51:00 UTC

Description of problem:
In the last 12 hours we have detected several Pods (running Cassandra) that have eaten all the available CPU resources on the node, even though they had much lower quotas/limits.
It seems that hawcular detects/interprets the CPU usage in a false way - it told us "-11320 available of 800 millicores" when the usage was 12120 millicores! See screenshot for details.

Version-Release number of selected component (if applicable):
3.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alexander Koksharov 2016-08-22 10:52:29 UTC

Created attachment 1192900 [details]
QuoraError2

Comment 2 Alexander Koksharov 2016-08-22 10:53:08 UTC

Created attachment 1192901 [details]
QuotaError

Comment 3 Alexander Koksharov 2016-08-22 10:56:45 UTC

last check done:

Web console displays: "CPU -344 Available of 800 millicores" 

Whereas direct check through api: 
[root@i89540 ~]# curl http://localhost:8001/api/v1/namespaces/openshift-infra/services/https:heapster:/proxy/api/v1/model/namespaces/redko-dev/pods/cassandra-15-btpca/metrics/cpu-usage
{
  "metrics": [
   {
    "timestamp": "2016-08-22T02:42:00-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:42:10-04:00",
    "value": 1064
   },
   {
    "timestamp": "2016-08-22T02:42:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:43:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:43:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:43:40-04:00",
    "value": 1053
   },
   {
    "timestamp": "2016-08-22T02:44:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:44:10-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:44:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:45:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:45:10-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:45:40-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:45:50-04:00",
    "value": 1064
   },
   {
    "timestamp": "2016-08-22T02:46:00-04:00",
    "value": 1058
   },
   {
    "timestamp": "2016-08-22T02:46:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:46:40-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:46:50-04:00",
    "value": 1102
   },
   {
    "timestamp": "2016-08-22T02:47:20-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:47:30-04:00",
    "value": 1061
   },
   {
    "timestamp": "2016-08-22T02:48:00-04:00",
    "value": 1060
   },
   {
    "timestamp": "2016-08-22T02:48:10-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:48:40-04:00",
    "value": 1082
   },
   {
    "timestamp": "2016-08-22T02:49:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:49:10-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:49:40-04:00",
    "value": 1058
   },
   {
    "timestamp": "2016-08-22T02:49:50-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:50:00-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:50:10-04:00",
    "value": 1061
   },
   {
    "timestamp": "2016-08-22T02:50:20-04:00",
    "value": 1048
   },
   {
    "timestamp": "2016-08-22T02:50:40-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:51:10-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:52:00-04:00",
    "value": 1053
   },
   {
    "timestamp": "2016-08-22T02:52:10-04:00",
    "value": 1071
   },
   {
    "timestamp": "2016-08-22T02:52:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:52:40-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:53:10-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:53:40-04:00",
    "value": 1053
   },
   {
    "timestamp": "2016-08-22T02:54:10-04:00",
    "value": 1068
   },
   {
    "timestamp": "2016-08-22T02:54:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:54:40-04:00",
    "value": 1059
   },
   {
    "timestamp": "2016-08-22T02:55:10-04:00",
    "value": 1068
   },
   {
    "timestamp": "2016-08-22T02:55:40-04:00",
    "value": 1072
   },
   {
    "timestamp": "2016-08-22T02:55:50-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:56:30-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:56:40-04:00",
    "value": 0
   }
  ],
  "latestTimestamp": "2016-08-22T02:56:40-04:00"
}

Comment 4 Jessica Forrester 2016-08-22 14:43:43 UTC

we have spawned a separate bug to stop showing the negative values in the console https://bugzilla.redhat.com/show_bug.cgi?id=1369160

Transferring this bug to cluster infra team if there needs to be investigation into why its going over the limit

Comment 5 Derek Carr 2016-08-22 14:54:34 UTC

I want to determine that cpu cgroup limits were properly set for the pod.

To verify this, can we see the following output for the pod that demonstrated this behavior, where <pod-name> is that pod?

1. Pod YAML

$ oc get pod <pod-name> -o yaml

2. Pod cgroup settings on the node

$ oc exec <pod-name> -- cat /proc/self/cgroup
$ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.shares
$ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
$ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

Comment 6 Andy Goldstein 2016-08-24 12:57:45 UTC

Here's the cgroup info. Waiting on pod yaml.


[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /proc/self/cgroup
10:cpuset:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
9:devices:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
8:blkio:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
7:net_cls:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
6:perf_event:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
5:freezer:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
4:memory:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
3:cpuacct,cpu:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
2:hugetlb:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
1:name=systemd:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope

[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.shares
1433

[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1

[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000

Comment 9 Andy Goldstein 2016-08-24 15:52:57 UTC

Marking UpcomingRelease as we're dependent on a kernel z-stream fix that doesn't have an ETA yet.

Comment 10 Derek Carr 2016-10-25 18:34:42 UTC

I am marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1336863 as it requires no additional code change beyond that bz being released.

*** This bug has been marked as a duplicate of bug 1336863 ***