Bug 1369022 - Negative CPU requests for a pod
Summary: Negative CPU requests for a pod
Keywords:
Status: CLOSED DUPLICATE of bug 1336863
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: ---
Assignee: Derek Carr
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On: 1336863
Blocks: 1369160
TreeView+ depends on / blocked
 
Reported: 2016-08-22 10:51 UTC by Alexander Koksharov
Modified: 2019-12-16 06:26 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1369160 (view as bug list)
Environment:
Last Closed: 2016-10-25 18:34:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
QuoraError2 (49.64 KB, image/png)
2016-08-22 10:52 UTC, Alexander Koksharov
no flags Details
QuotaError (49.64 KB, image/png)
2016-08-22 10:53 UTC, Alexander Koksharov
no flags Details

Description Alexander Koksharov 2016-08-22 10:51:00 UTC
Description of problem:
In the last 12 hours we have detected several Pods (running Cassandra) that have eaten all the available CPU resources on the node, even though they had much lower quotas/limits.
It seems that hawcular detects/interprets the CPU usage in a false way - it told us "-11320 available of 800 millicores" when the usage was 12120 millicores! See screenshot for details.

Version-Release number of selected component (if applicable):
3.2

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Alexander Koksharov 2016-08-22 10:52:29 UTC
Created attachment 1192900 [details]
QuoraError2

Comment 2 Alexander Koksharov 2016-08-22 10:53:08 UTC
Created attachment 1192901 [details]
QuotaError

Comment 3 Alexander Koksharov 2016-08-22 10:56:45 UTC
last check done:

Web console displays: "CPU -344 Available of 800 millicores" 

Whereas direct check through api: 
[root@i89540 ~]# curl http://localhost:8001/api/v1/namespaces/openshift-infra/services/https:heapster:/proxy/api/v1/model/namespaces/redko-dev/pods/cassandra-15-btpca/metrics/cpu-usage
{
  "metrics": [
   {
    "timestamp": "2016-08-22T02:42:00-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:42:10-04:00",
    "value": 1064
   },
   {
    "timestamp": "2016-08-22T02:42:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:43:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:43:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:43:40-04:00",
    "value": 1053
   },
   {
    "timestamp": "2016-08-22T02:44:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:44:10-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:44:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:45:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:45:10-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:45:40-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:45:50-04:00",
    "value": 1064
   },
   {
    "timestamp": "2016-08-22T02:46:00-04:00",
    "value": 1058
   },
   {
    "timestamp": "2016-08-22T02:46:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:46:40-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:46:50-04:00",
    "value": 1102
   },
   {
    "timestamp": "2016-08-22T02:47:20-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:47:30-04:00",
    "value": 1061
   },
   {
    "timestamp": "2016-08-22T02:48:00-04:00",
    "value": 1060
   },
   {
    "timestamp": "2016-08-22T02:48:10-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:48:40-04:00",
    "value": 1082
   },
   {
    "timestamp": "2016-08-22T02:49:00-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:49:10-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:49:40-04:00",
    "value": 1058
   },
   {
    "timestamp": "2016-08-22T02:49:50-04:00",
    "value": 1063
   },
   {
    "timestamp": "2016-08-22T02:50:00-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:50:10-04:00",
    "value": 1061
   },
   {
    "timestamp": "2016-08-22T02:50:20-04:00",
    "value": 1048
   },
   {
    "timestamp": "2016-08-22T02:50:40-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:51:10-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:52:00-04:00",
    "value": 1053
   },
   {
    "timestamp": "2016-08-22T02:52:10-04:00",
    "value": 1071
   },
   {
    "timestamp": "2016-08-22T02:52:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:52:40-04:00",
    "value": 1065
   },
   {
    "timestamp": "2016-08-22T02:53:10-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:53:40-04:00",
    "value": 1053
   },
   {
    "timestamp": "2016-08-22T02:54:10-04:00",
    "value": 1068
   },
   {
    "timestamp": "2016-08-22T02:54:30-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:54:40-04:00",
    "value": 1059
   },
   {
    "timestamp": "2016-08-22T02:55:10-04:00",
    "value": 1068
   },
   {
    "timestamp": "2016-08-22T02:55:40-04:00",
    "value": 1072
   },
   {
    "timestamp": "2016-08-22T02:55:50-04:00",
    "value": 0
   },
   {
    "timestamp": "2016-08-22T02:56:30-04:00",
    "value": 1062
   },
   {
    "timestamp": "2016-08-22T02:56:40-04:00",
    "value": 0
   }
  ],
  "latestTimestamp": "2016-08-22T02:56:40-04:00"
}

Comment 4 Jessica Forrester 2016-08-22 14:43:43 UTC
we have spawned a separate bug to stop showing the negative values in the console https://bugzilla.redhat.com/show_bug.cgi?id=1369160

Transferring this bug to cluster infra team if there needs to be investigation into why its going over the limit

Comment 5 Derek Carr 2016-08-22 14:54:34 UTC
I want to determine that cpu cgroup limits were properly set for the pod.

To verify this, can we see the following output for the pod that demonstrated this behavior, where <pod-name> is that pod?

1. Pod YAML

$ oc get pod <pod-name> -o yaml

2. Pod cgroup settings on the node

$ oc exec <pod-name> -- cat /proc/self/cgroup
$ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.shares
$ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
$ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us

Comment 6 Andy Goldstein 2016-08-24 12:57:45 UTC
Here's the cgroup info. Waiting on pod yaml.


[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /proc/self/cgroup
10:cpuset:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
9:devices:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
8:blkio:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
7:net_cls:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
6:perf_event:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
5:freezer:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
4:memory:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
3:cpuacct,cpu:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
2:hugetlb:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope
1:name=systemd:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope

[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.shares
1433

[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us
-1

[root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
100000

Comment 9 Andy Goldstein 2016-08-24 15:52:57 UTC
Marking UpcomingRelease as we're dependent on a kernel z-stream fix that doesn't have an ETA yet.

Comment 10 Derek Carr 2016-10-25 18:34:42 UTC
I am marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1336863 as it requires no additional code change beyond that bz being released.

*** This bug has been marked as a duplicate of bug 1336863 ***


Note You need to log in before you can comment on or make changes to this bug.