Bug 1892706

Summary: z systems not reporting correct core count in recording rule
Product: OpenShift Container Platform Reporter: Simon Pasquier <spasquie>
Component: MonitoringAssignee: Simon Pasquier <spasquie>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: high Docs Contact:
Priority: high    
Version: 4.6CC: alegrand, anpicker, brueckner, ccoleman, danili, dgilmore, erooth, juzhao, kakkoyun, lcosic, mloibl, pkrupa, psundara, spasquie, surbania
Target Milestone: ---Keywords: UpcomingSprint
Target Release: 4.6.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1882730 Environment:
Last Closed: 2020-12-14 13:50:25 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1882730    
Bug Blocks:    

Description Simon Pasquier 2020-10-29 13:49:20 UTC
+++ This bug was initially created as a clone of Bug #1882730 +++

node_role_os_version_machine:cpu_capacity_cores:sum{label_kubernetes_io_arch="s390x"} should be recording the number of cores for all machines with the given node roles, but it appears to report 1 for each machine.  No clusters were reporting any number other than 1 per cluster, so it’s likely that the metric from node_exporter is reporting an incorrect value on z across all versions (vs the less likely outcome that every z user is only using a single code for every machine they run).

This metric is used to associate subscription data with clusters and thus this is a high severity bug (we cannot correctly communicate to customers what their amount owed.

If this is a limitation of the z virtualization environment we may need to return an estimated value (if possible) since 1 is definitely not correct.

When looking at cluster:capacity_cpu_cores:sum for a z cluster, I’m seeing node role labels appear, so it’s possible that we significantly regressed the calculation of that metric as part of a recent change, so bumping this to urgent (we can’t regress reporting these metrics) until that is resolved.  If reverting does not fix the z core count, then we can drop this to high.

--- Additional comment from Clayton Coleman on 2020-09-25 13:11:53 UTC ---

The capacity change appears to be correct (sum of capacity by _id works ok on 4.5 and 4.6) so dropping severity back to high.  The z core reporting bug is likely independent.

--- Additional comment from Clayton Coleman on 2020-09-25 13:14:43 UTC ---

Data for both metrics from the cluster:

cluster:capacity_cpu_cores:sum{_id="537c2edd-2285-4695-bbad-30b4d52f4a38",label_kubernetes_io_arch="s390x",label_node_openshift_io_os_id="rhcos",label_node_role_kubernetes_io="master",prometheus="openshift-monitoring/k8s",receive="true",tenant_id="FB870BF3-9F3A-44FF-9BF7-D7A047A52F43"}	12
cluster:capacity_cpu_cores:sum{_id="537c2edd-2285-4695-bbad-30b4d52f4a38",label_kubernetes_io_arch="s390x",label_node_openshift_io_os_id="rhcos",prometheus="openshift-monitoring/k8s",receive="true",tenant_id="FB870BF3-9F3A-44FF-9BF7-D7A047A52F43"}	8

node_role_os_version_machine:cpu_capacity_cores:sum{_id="537c2edd-2285-4695-bbad-30b4d52f4a38",label_kubernetes_io_arch="s390x",label_node_hyperthread_enabled="true",label_node_openshift_io_os_id="rhcos",label_node_role_kubernetes_io_master="true",prometheus="openshift-monitoring/k8s",receive="true",tenant_id="FB870BF3-9F3A-44FF-9BF7-D7A047A52F43"}	3
node_role_os_version_machine:cpu_capacity_cores:sum{_id="537c2edd-2285-4695-bbad-30b4d52f4a38",label_kubernetes_io_arch="s390x",label_node_hyperthread_enabled="true",label_node_openshift_io_os_id="rhcos",prometheus="openshift-monitoring/k8s",receive="true",tenant_id="FB870BF3-9F3A-44FF-9BF7-D7A047A52F43"}	3

Total capacity is correct, node role capacity is wrong.

--- Additional comment from Simon Pasquier on 2020-09-25 15:05:05 UTC ---

Looking at the procfs and node_exporter code base, it looks like the node_cpu_info metric is missing the "core" and "package" labels on s390x (while the "cpu" label is set).

Here is the node_exporter's code collecting the node_cpu_info metric:
https://github.com/openshift/node_exporter/blob/b13ea6076dae6819ad0a0f423308ce872a306463/collector/cpu_linux.go#L106-L120

The "core" label is mapped from cpu.CoreID as returned by the procfs library. But nowhere this field is being set in case of s390x (same for the "package" label but "cpu" is parsed):

https://github.com/prometheus/procfs/blob/9dece15c53cd5e9fbfbd72d5108adcf526a3f486/cpuinfo.go#L254-L320

procfs has some fixtures to validate its parsing but I'm not sure what would be the correct fix here:

https://github.com/prometheus/procfs/blob/9dece15c53cd5e9fbfbd72d5108adcf526a3f486/cpuinfo_test.go#L99-L132

Eventually the "core" label being empty, it probaly breaks the "cluster:cpu_core_hyperthreading" recording rule which is used to compute "cluster:cpu_core_node_labels" and "node_role_os_version_machine:cpu_capacity_cores:sum":

https://github.com/openshift/cluster-monitoring-operator/blob/eb64b73c64b97b1312a2d6a7101a0eecb33300fc/assets/prometheus-k8s/rules.yaml#L863-L871

Related to https://bugzilla.redhat.com/show_bug.cgi?id=1854441

--- Additional comment from Simon Pasquier on 2020-09-29 08:45:32 UTC ---

Reassigning to Multi-arch team to understand what makes sense regarding which labels make sense for the node_cpu_info metric.

--- Additional comment from Dan Li on 2020-09-29 12:07:01 UTC ---

Changing the status back to New since the bug has been re-assigned. Also adding UpcomingSprint as this is a relatively new bug and it is unlikely this will be resolved before end of this Sprint.

--- Additional comment from Prashanth Sundararaman on 2020-09-29 12:12:19 UTC ---

This is the cpuinfo from an s390x zVM. it has no "core" as such. I am not sure what the equivalent is - i am adding the IBM folks here so they can chime in:

[root@rock-zvm-3-1 ~]# cat /proc/cpuinfo
vendor_id       : IBM/S390
# processors    : 2
bogomips per cpu: 1048.00
max thread id   : 0
features	: esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx vxd vxe gs sie 
facilities      : 0 1 2 3 4 6 7 8 9 10 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 38 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 57 58 59 60 73 74 75 76 77 80 81 82 128 129 130 131 133 134 135 146 147 156 168
cache0          : level=1 type=Data scope=Private size=128K line_size=256 associativity=8
cache1          : level=1 type=Instruction scope=Private size=128K line_size=256 associativity=8
cache2          : level=2 type=Data scope=Private size=4096K line_size=256 associativity=8
cache3          : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8
cache4          : level=3 type=Unified scope=Shared size=131072K line_size=256 associativity=32
cache5          : level=4 type=Unified scope=Shared size=688128K line_size=256 associativity=42
processor 0: version = FF,  identification = 01F2E8,  machine = 3907
processor 1: version = FF,  identification = 01F2E8,  machine = 3907

cpu number      : 0
cpu MHz dynamic : 4504
cpu MHz static  : 4504

cpu number      : 1
cpu MHz dynamic : 4504
cpu MHz static  : 4504

--- Additional comment from Simon Pasquier on 2020-09-29 14:49:36 UTC ---

For the record this is what node_cpu_info metric looks like for a x86_64 machine with 4 vCPUs (2 cores).

--- Additional comment from Dan Li on 2020-09-30 13:19:53 UTC ---

Re-assigning this to Prashanth as he is in processing of acquiring more information.

--- Additional comment from Dan Li on 2020-09-30 18:29:36 UTC ---

Hi Clayton, currently this bug is marked as a release blocker for OCP 4.6 as its Target Release is 4.6. We have seen several CPU related bugs regarding Z but none of which are release blocking at the moment. Could you let me know the reason that this is release blocking, or can we de-escalate this bug to "Low" or 4.7.0 Target Release? Thank you.

--- Additional comment from Dan Li on 2020-09-30 19:37:53 UTC ---

Changing the Target Release to 4.6.z after chatting with Clayton

--- Additional comment from Prashanth Sundararaman on 2020-09-30 19:38:10 UTC ---

So few things here - most of the Z installations are on zVM guests or zKVMs - both of these are VMs - they have vcpus. Doing an lscpu on one of the zVM nodes show:

[core@master-2 ~]$ lscpu
Architecture:        s390x
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Big Endian
CPU(s):              2
On-line CPU(s) list: 0-3
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s) per book:  1
Book(s) per drawer:  1
Drawer(s):           4
NUMA node(s):        1
--snip--

The cores per socket here is one. And as i pasted above, there is no core information in /proc/cpuinfo


Contrast this to an lscpu and /proc/cpuinfo from an LPAR:

[sharkcz@rock-kvmlp-fedora ~]$ lscpu
Architecture:                    s390x
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Big Endian
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              10
Socket(s) per book:              3
--snip--

There are 10 cores. And even the cpuinfo has all the information like any other architecture and there is a "core id" here unlike for the display in a VM which doesn't have most of this info.

--snip--
cpu number      : 0
physical id     : 2
core id         : 0
book id         : 2
drawer id       : 1
dedicated       : 0
address         : 0
siblings        : 8
cpu cores       : 4
version         : 00
identification  : 06F2E8
machine         : 3907
cpu MHz dynamic : 4504
cpu MHz static  : 4504
--snip--

I would say that for a zVM guest/ zKVM installation the values displayed are correct. If we ever support LPAR installations, this would need to change, but that is a conversation for later i guess.

--- Additional comment from Dan Li on 2020-09-30 19:45:09 UTC ---

Changing the Target Release back to 4.6 as agreement is needed from stakeholders to move this bug off the blocker list

@Dennis, @Clayton, @Hendrick, please see Prashanth's Comment 11. If I am understanding the problem correctly, if we support LPAR installation, then this bug would be a high severity, if we only support VMs, then it would not be a blocker.

--- Additional comment from Simon Pasquier on 2020-10-01 07:43:32 UTC ---

@Prashanth could you attach the full /proc/cpuinfo from an LPAR to the ticket? It would help to fix node_exporter & procfs if we need to eventually.

> I would say that for a zVM guest/ zKVM installation the values displayed are correct.

Can we get also the node_cpu_info metric from a zKVM cluster to confirm that the "core" and "package" labels are missing? If this is the case we need either to fix the recording rules or node_exporter.

--- Additional comment from Prashanth Sundararaman on 2020-10-01 12:17:33 UTC ---



--- Additional comment from Prashanth Sundararaman on 2020-10-01 18:26:59 UTC ---



--- Additional comment from Prashanth Sundararaman on 2020-10-01 18:28:10 UTC ---

Added the cpu info for zKVM and yes - it is the same there too.

--- Additional comment from Dan Li on 2020-10-01 21:45:57 UTC ---

Based on conversations with multiple stakeholders, I am de-escalating this bug by setting the Target Release to 4.7 with likelihood that it will be changed to 4.6.z if the system allows. 

The reason for de-escalation are threefolds:

1. If this bug only occurs on Z environment (not on other environments such as x86_64), it is not a regression and therefore is not a blocker issue
2. LPAR installation is currently not supported by multi-arch OCP 4.6; zVM is the only supported path
3. This bug can be fixed in upcoming 4.6.z releases

Also adding "UpcomingSprint" post de-escalation.

--- Additional comment from Eric Paris on 2020-10-01 22:00:23 UTC ---

This bug sets Target Release equal to a z-stream but has no bug in the 'Depends On' field. As such this is not a valid bug state and the target release is being unset.

Any bug targeting 4.1.z must have a bug targeting 4.2 in 'Depends On.'
Similarly, any bug targeting 4.2.z must have a bug with Target Release of 4.3 in 'Depends On.'

--- Additional comment from Simon Pasquier on 2020-10-02 09:32:32 UTC ---

Can you attach also a copy of /sys/devices/system/cpu/cpu*/topology for a zVM guest/ zKVM installation?

--- Additional comment from Prashanth Sundararaman on 2020-10-02 13:10:19 UTC ---

I have attached the topology info provided by lscpu and lscpu -e which should be what is present in /sys/devices/system/cpu/*/topology/.

On further discussion with the IBM team , it looks like for an LPAR environment each logical processor in linux is one core. With SMT one core becomes 2 processors. But for zKVM/zVM looks like each processor is one core. I am wondering whether for zVM/zKVM we should report it such that cores=cpus

--- Additional comment from Prashanth Sundararaman on 2020-10-06 15:29:46 UTC ---

@Simon - Anything else you need from me ? How do we proceed on this ?

--- Additional comment from Simon Pasquier on 2020-10-07 08:47:13 UTC ---

After more testing, I confirm that the "cluster:cpu_core_hyperthreading" recording rule is broken because node_exporter/procfs reports no "core" and "package" label values for zVM guests and zKVMs.

I can see 2 options:

a) Update the "cluster:cpu_core_hyperthreading" recording rule to account for this corner case: if the "core" label is empty then we fallback to the "cpu" label value. The updated rule would look like this:

            clamp_max(
              (
                label_replace( ( ( sum ( label_replace(label_join(node_cpu_info{core="",package=""}, "core", "", "cpu"), "package", "0", "package", "") ) by (instance, package, core) )  > 1 ), "label_node_hyperthread_enabled", "true", "instance", "(.*)" )
                or on (instance, package)
                label_replace( ( ( sum ( label_replace(label_join(node_cpu_info{core="",package=""}, "core", "", "cpu"), "package", "0", "package", "") ) by (instance, package, core) ) <= 1 ), "label_node_hyperthread_enabled", "false", "instance", "(.*)" )
              ), 1
            )


b) "Fix" procfs upstream for s390x arch to set the "core" label to "cpu" if it's missing from /proc/cpuing.

Personally I'd go for option a) because I don't think it makes sense to hack procfs if the information isn't present in /proc/cpuinfo (and upstream maintainers would probably feel the same since it isn't a pattern already followed in procfs). But happy to get more inputs from SMEs.

--- Additional comment from Simon Pasquier on 2020-10-08 14:44:37 UTC ---

@Clayton what would be your recommendation regarding the options listed in comment 22?

--- Additional comment from Dan Li on 2020-10-19 19:17:26 UTC ---

Hi @Simon, I will leave the "UpcomingSprint" decision to you for this sprint as the bug is currently in the Monitoring team's queue (I tagged it last time in Multi-Arch queue).

--- Additional comment from Simon Pasquier on 2020-10-22 13:39:46 UTC ---

Fix submitted. It needs approval.

Comment 7 errata-xmlrpc 2020-12-14 13:50:25 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.8 security and bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5259