Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1855748

Summary: crictl stats shows values of the whole pod and not of the container
Product: OpenShift Container Platform Reporter: Roman Mohr <rmohr>
Component: NodeAssignee: Peter Hunt <pehunt>
Status: CLOSED ERRATA QA Contact: Weinan Liu <weinliu>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.4CC: aos-bugs, jokerman, pehunt, rphillips
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-27 16:13:50 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Roman Mohr 2020-07-10 12:02:34 UTC
Description of problem:

I am currently diagnosing an OOM problem on a Pod with a container with limits.
When I run 

```
$ docker stats 1349ed010bee
CONTAINER ID        NAME                                                                                                     CPU %               MEM USAGE / LIMIT     MEM %               NET I/O             BLOCK I/O           PIDS
1349ed010bee        k8s_volumecontainerdisk_virt-launcher-vmi-nocloud-zm5ds_default_88e6e011-33c7-4173-a84c-a0fe83012e02_0   0.39%               1.711MiB / 38.14MiB   4.49%               0B / 0B             1.73MB / 0B         1

```

One can see that this container has a memory limit of 40MB applied and how much memory is acutally consumed.

When I start the same pod on openshift and I run


```
# crictl stats fb8662e85892e
CONTAINER           CPU %               MEM                 DISK                INODES
fb8662e85892e       1.62                210.2MB             12.72MB             20

```

I get the memory consumption of the whole pod. I confirmed that `fb8662e85892e` indeed points to the right pid `351019`.


```
# crictl inspect fb8662e85892e
{
  "pid": 351019,
  "sandboxId": "052f3e0fedaf1b797f500da998b71b9709c6934d9985916312d7d6f8ada3e0e1"
}

```

However this sandbox actually belongs to /usr/bin/pod (349553):

```
# runc state 052f3e0fedaf1b797f500da998b71b9709c6934d9985916312d7d6f8ada3e0e1
{
  "ociVersion": "1.0.1-dev",
  "id": "052f3e0fedaf1b797f500da998b71b9709c6934d9985916312d7d6f8ada3e0e1",
  "pid": 349553,

```

maybe that is the source of the confusion and stats takes the wrong entrypoint (e.g. the parent sandbox).


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


I would expect to see the memory consumption of the container and not of the pod. To make diagnosing issues easier.

Additional info:

Comment 1 Peter Hunt 2020-07-10 13:32:24 UTC
fixed upstream by attached pr

Comment 2 Peter Hunt 2020-07-10 13:34:07 UTC
This will be resolved next sprint

Comment 4 Peter Hunt 2020-07-31 21:30:21 UTC
still waiting on fix to go in upstream

Comment 5 Peter Hunt 2020-08-03 19:41:56 UTC
upstream PR has been merged

Comment 10 Peter Hunt 2020-08-07 14:10:02 UTC
Weinan,
I believe you've verified based on a slight misunderstanding of the source of the problem.

For clarification on the original report:

> One can see that this container has a memory limit of 40MB applied and how much memory is acutally consumed.
Yes this is a difference between CRI stats and docker stats

> When I start the same pod on openshift and I run


> ```
> # crictl stats fb8662e85892e
> CONTAINER           CPU %               MEM                 DISK                INODES
> fb8662e85892e       1.62                210.2MB             12.72MB             20
>
> ```

This is the real bug. If you run a pod with two different containers, they should have different stats reports. before the fix, they had the same stats report

> I get the memory consumption of the whole pod. I confirmed that `fb8662e85892e` indeed points to the right pid `351019`.

> ```
> # crictl inspect fb8662e85892e
> {
>  "pid": 351019,
>  "sandboxId": "052f3e0fedaf1b797f500da998b71b9709c6934d9985916312d7d6f8ada3e0e1"
> }

> ```


> However this sandbox actually belongs to /usr/bin/pod (349553):

> ```
> # runc state 052f3e0fedaf1b797f500da998b71b9709c6934d9985916312d7d6f8ada3e0e1
> {
>  "ociVersion": "1.0.1-dev",
>  "id": "052f3e0fedaf1b797f500da998b71b9709c6934d9985916312d7d6f8ada3e0e1",
>  "pid": 349553,
>
>```

This is as designed. The "sandbox" really is a collection of containers. We represent it with the pause container (which runs the process /usr/bin/pod). The pause container is a running container, so it has its own pid. "actually belongs to /usr/bin/pod" is expected.

the real bug here is that we account for the memory in the pause container for the container's stats, when reporting through CRI.
Though, it's much easier to verify if there are two containers in the pod (plus the pause container), as crictl stats will output the information for both, giving you a clear picture of whether they're different, which should now be the case

Comment 12 Weinan Liu 2020-08-10 11:35:41 UTC
As per comment 10, issue got fixed on
$ oc version
Client Version: 4.5.2
Server Version: 4.6.0-0.nightly-2020-08-06-131904
Kubernetes Version: v4.6.0-202008061010.p0-dirty

Comment 14 errata-xmlrpc 2020-10-27 16:13:50 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196