Description of problem: In the last 12 hours we have detected several Pods (running Cassandra) that have eaten all the available CPU resources on the node, even though they had much lower quotas/limits. It seems that hawcular detects/interprets the CPU usage in a false way - it told us "-11320 available of 800 millicores" when the usage was 12120 millicores! See screenshot for details. Version-Release number of selected component (if applicable): 3.2 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1192900 [details] QuoraError2
Created attachment 1192901 [details] QuotaError
last check done: Web console displays: "CPU -344 Available of 800 millicores" Whereas direct check through api: [root@i89540 ~]# curl http://localhost:8001/api/v1/namespaces/openshift-infra/services/https:heapster:/proxy/api/v1/model/namespaces/redko-dev/pods/cassandra-15-btpca/metrics/cpu-usage { "metrics": [ { "timestamp": "2016-08-22T02:42:00-04:00", "value": 1065 }, { "timestamp": "2016-08-22T02:42:10-04:00", "value": 1064 }, { "timestamp": "2016-08-22T02:42:30-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:43:00-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:43:30-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:43:40-04:00", "value": 1053 }, { "timestamp": "2016-08-22T02:44:00-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:44:10-04:00", "value": 1063 }, { "timestamp": "2016-08-22T02:44:30-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:45:00-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:45:10-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:45:40-04:00", "value": 1063 }, { "timestamp": "2016-08-22T02:45:50-04:00", "value": 1064 }, { "timestamp": "2016-08-22T02:46:00-04:00", "value": 1058 }, { "timestamp": "2016-08-22T02:46:30-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:46:40-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:46:50-04:00", "value": 1102 }, { "timestamp": "2016-08-22T02:47:20-04:00", "value": 1065 }, { "timestamp": "2016-08-22T02:47:30-04:00", "value": 1061 }, { "timestamp": "2016-08-22T02:48:00-04:00", "value": 1060 }, { "timestamp": "2016-08-22T02:48:10-04:00", "value": 1063 }, { "timestamp": "2016-08-22T02:48:40-04:00", "value": 1082 }, { "timestamp": "2016-08-22T02:49:00-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:49:10-04:00", "value": 1065 }, { "timestamp": "2016-08-22T02:49:40-04:00", "value": 1058 }, { "timestamp": "2016-08-22T02:49:50-04:00", "value": 1063 }, { "timestamp": "2016-08-22T02:50:00-04:00", "value": 1062 }, { "timestamp": "2016-08-22T02:50:10-04:00", "value": 1061 }, { "timestamp": "2016-08-22T02:50:20-04:00", "value": 1048 }, { "timestamp": "2016-08-22T02:50:40-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:51:10-04:00", "value": 1062 }, { "timestamp": "2016-08-22T02:52:00-04:00", "value": 1053 }, { "timestamp": "2016-08-22T02:52:10-04:00", "value": 1071 }, { "timestamp": "2016-08-22T02:52:30-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:52:40-04:00", "value": 1065 }, { "timestamp": "2016-08-22T02:53:10-04:00", "value": 1062 }, { "timestamp": "2016-08-22T02:53:40-04:00", "value": 1053 }, { "timestamp": "2016-08-22T02:54:10-04:00", "value": 1068 }, { "timestamp": "2016-08-22T02:54:30-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:54:40-04:00", "value": 1059 }, { "timestamp": "2016-08-22T02:55:10-04:00", "value": 1068 }, { "timestamp": "2016-08-22T02:55:40-04:00", "value": 1072 }, { "timestamp": "2016-08-22T02:55:50-04:00", "value": 0 }, { "timestamp": "2016-08-22T02:56:30-04:00", "value": 1062 }, { "timestamp": "2016-08-22T02:56:40-04:00", "value": 0 } ], "latestTimestamp": "2016-08-22T02:56:40-04:00" }
we have spawned a separate bug to stop showing the negative values in the console https://bugzilla.redhat.com/show_bug.cgi?id=1369160 Transferring this bug to cluster infra team if there needs to be investigation into why its going over the limit
I want to determine that cpu cgroup limits were properly set for the pod. To verify this, can we see the following output for the pod that demonstrated this behavior, where <pod-name> is that pod? 1. Pod YAML $ oc get pod <pod-name> -o yaml 2. Pod cgroup settings on the node $ oc exec <pod-name> -- cat /proc/self/cgroup $ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.shares $ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us $ oc exec <pod-name> -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us
Here's the cgroup info. Waiting on pod yaml. [root@i89540 ~]# oc exec cassandra-5-5783w -- cat /proc/self/cgroup 10:cpuset:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 9:devices:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 8:blkio:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 7:net_cls:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 6:perf_event:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 5:freezer:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 4:memory:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 3:cpuacct,cpu:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 2:hugetlb:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope 1:name=systemd:/system.slice/docker-4404981688d3ca233cb690c5fcb7366dfb822f84b381c6646512a6d7afeb6139.scope [root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.shares 1433 [root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.cfs_quota_us -1 [root@i89540 ~]# oc exec cassandra-5-5783w -- cat /sys/fs/cgroup/cpu/cpu.cfs_period_us 100000
Marking UpcomingRelease as we're dependent on a kernel z-stream fix that doesn't have an ETA yet.
I am marking this as a duplicate of https://bugzilla.redhat.com/show_bug.cgi?id=1336863 as it requires no additional code change beyond that bz being released. *** This bug has been marked as a duplicate of bug 1336863 ***