1789260 – update from version 4.2.12-s390x to 4.2.13-s390x fails because of broken node-exporter

Bug 1789260 - update from version 4.2.12-s390x to 4.2.13-s390x fails because of broken node-exporter

Summary: update from version 4.2.12-s390x to 4.2.13-s390x fails because of broken node...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Multi-Arch
Sub Component:
Version:	4.2.z
Hardware:	s390x
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.2.z
Assignee:	David Benoit
QA Contact:	Barry Donahue
Docs Contact:
URL:
Whiteboard:
Duplicates (2):	1795508 1829332 (view as bug list)
Depends On:
Blocks:	1785594 1791413
TreeView+	depends on / blocked

Reported:	2020-01-09 08:24 UTC by Alexander Klein
Modified:	2023-09-07 21:24 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1791413 (view as bug list)
Environment:
Last Closed:	2020-02-24 16:52:45 UTC
Target Upstream Version:
Embargoed:
Flags:	alklein: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2020:0460	0	None	None	None	2020-02-24 16:52:59 UTC

Internal Links: 1838932

Description Alexander Klein 2020-01-09 08:24:26 UTC

Description of problem:
upgrading a cluster from version 4.2.12-s390x to 4.2.13-s390x over the webconsole fails to update the monitoring operator. all other updates seem to apply correctly


How reproducible:


Steps to Reproduce:
1. update from version 4.2.12-s390x to 4.2.13-s390x 
2. wait for all updates to finish


Actual results:
monitoring is degraded

node-exporter in state CrashLoopBackOff, log attached 

Expected results:
update successfull

additional info:

node-exporter log:

time="2020-01-09T08:15:55Z" level=info msg="Starting node_exporter (version=0.18.1, branch=rhaos-4.2-rhel-7, revision=324a1d4caae524181d0c4ccedb070763ea7bd403)" source="node_exporter.go:156"
time="2020-01-09T08:15:55Z" level=info msg="Build context (go=go1.12.12, user=root@a4be65dd8c74, date=20191223-11:56:48)" source="node_exporter.go:157"
time="2020-01-09T08:15:55Z" level=info msg="Enabled collectors:" source="node_exporter.go:97"
time="2020-01-09T08:15:55Z" level=info msg=" - arp" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - bcache" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - bonding" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - conntrack" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - cpu" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - cpufreq" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - diskstats" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - edac" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - entropy" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - filefd" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - filesystem" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - hwmon" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - infiniband" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - ipvs" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - loadavg" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - mdadm" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - meminfo" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - netclass" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - netdev" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - netstat" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - nfs" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - nfsd" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - pressure" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - sockstat" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - stat" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - textfile" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - time" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - timex" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - uname" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - vmstat" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - xfs" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg=" - zfs" source="node_exporter.go:104"
time="2020-01-09T08:15:55Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170"
panic: runtime error: index out of range

goroutine 39 [running]:
github.com/prometheus/procfs.parseCPUInfo(0xc000203000, 0x4c3, 0xe00, 0x4c3, 0xe00, 0x0, 0x0, 0x1)
	/go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/procfs/cpuinfo.go:85 +0x1d32
github.com/prometheus/procfs.FS.CPUInfo(0x3ffcbd7f550, 0xa, 0x24000000000f4684, 0x1d394, 0xc0001eb6c0, 0x70, 0x70)
	/go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/procfs/cpuinfo.go:61 +0x180
github.com/prometheus/node_exporter/collector.(*cpuCollector).updateInfo(0xc00005d180, 0xc0000de180, 0x32ea661f00000000, 0x325e45f98388)
	/go/src/github.com/prometheus/node_exporter/collector/cpu_linux.go:96 +0x3c
github.com/prometheus/node_exporter/collector.(*cpuCollector).Update(0xc00005d180, 0xc0000de180, 0xc311a0, 0x0)
	/go/src/github.com/prometheus/node_exporter/collector/cpu_linux.go:81 +0xdc
github.com/prometheus/node_exporter/collector.execute(0x6e60b2, 0x3, 0x7cb140, 0xc00005d180, 0xc0000de180)
	/go/src/github.com/prometheus/node_exporter/collector/collector.go:127 +0x68
github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1(0xc0000de180, 0xc0000d82f0, 0x6e60b2, 0x3, 0x7cb140, 0xc00005d180)
	/go/src/github.com/prometheus/node_exporter/collector/collector.go:118 +0x48
created by github.com/prometheus/node_exporter/collector.NodeCollector.Collect
	/go/src/github.com/prometheus/node_exporter/collector/collector.go:117 +0xe2

Comment 2 W. Trevor King 2020-01-10 23:55:12 UTC

> I think it is a race condition where the monitoring operator degrades after the install finishes.

What is the Degraded Reason/Message?  I'd like to search for them in Telemetry data to double-check that this is just an s390x issue and just a 4.2.13 issue.

Comment 3 W. Trevor King 2020-01-11 00:01:55 UTC

Also, we've pulled the update edges leading into 4.2.13 out of the recommended-update graph while we look into this [1].

[1]: https://github.com/openshift/cincinnati-graph-data/pull/17

Comment 4 David Benoit 2020-01-11 00:17:58 UTC

The last condition in `oc describe co/monitoring` is:

    Message:  Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 0, unavailable: 5)
    Reason:                UpdatingnodeExporterFailed
    Status:                True
    Type:                  Degraded

Comment 5 David Benoit 2020-01-11 01:16:20 UTC

It looks like (#572) enabled the routine from PR (#46) which may need support added for s390x.

References:
(#572) https://github.com/openshift/cluster-monitoring-operator/pull/572/files#diff-ea42f2a19bba3e07005d6b437ed1a902R26
(#46 a) https://github.com/openshift/node_exporter/pull/46/files#diff-0e15beb18c136e9ae4e6a0eec3700896R39
(#46 b) https://github.com/openshift/node_exporter/pull/46/files#diff-0e15beb18c136e9ae4e6a0eec3700896R80-R115

Comment 6 Paul Gier 2020-01-15 18:09:33 UTC

The problem is due to the format of /proc/cpuinfo being different on s390x vs. x86 platforms and the procfs library doesn't handle these differences very well.
As a workaround we can skip gathering these metrics on s390x for now as suggested in the related PR (https://github.com/openshift/node_exporter/pull/52).
For longer term upstream fix, can someone upload an example of /proc/cpuinfo from an s390x system?

Comment 7 Prashanth Sundararaman 2020-01-15 18:46:40 UTC

Example of /proc/cpuinfo for s390x: 

vendor_id       : IBM/S390
# processors    : 4
bogomips per cpu: 3033.00
max thread id   : 0
features	: esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx sie 
facilities      : 0 1 2 3 4 6 7 8 9 10 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 48 49 50 51 52 53 55 57 73 74 75 76 77 80 81 82 128 129 131
cache0          : level=1 type=Data scope=Private size=128K line_size=256 associativity=8
cache1          : level=1 type=Instruction scope=Private size=96K line_size=256 associativity=6
cache2          : level=2 type=Data scope=Private size=2048K line_size=256 associativity=8
cache3          : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8
cache4          : level=3 type=Unified scope=Shared size=65536K line_size=256 associativity=16
cache5          : level=4 type=Unified scope=Shared size=491520K line_size=256 associativity=30
processor 0: version = FF,  identification = 2733E8,  machine = 2964
processor 1: version = FF,  identification = 2733E8,  machine = 2964
processor 2: version = FF,  identification = 2733E8,  machine = 2964
processor 3: version = FF,  identification = 2733E8,  machine = 2964

cpu number      : 0
cpu MHz dynamic : 5000
cpu MHz static  : 5000

cpu number      : 1
cpu MHz dynamic : 5000
cpu MHz static  : 5000

cpu number      : 2
cpu MHz dynamic : 5000
cpu MHz static  : 5000

cpu number      : 3
cpu MHz dynamic : 5000
cpu MHz static  : 5000

Comment 9 W. Trevor King 2020-01-15 19:46:02 UTC

Comment 7 has the requested information.

Comment 10 Paul Gier 2020-01-20 23:08:01 UTC

Created upstream PR to add support for reading arm, ppc, and s390x cpuinfo files.
https://github.com/prometheus/procfs/pull/257

Comment 11 Alexander Klein 2020-01-27 11:26:51 UTC

this is still broken on 4.2.14 which is the version the 'latest' download is pointing to.

Comment 12 Prashanth Sundararaman 2020-01-27 13:48:30 UTC

ALthough the latest is not pointing to it, 4.2.16 is the latest version: https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.2.16 - could you try this? It worked for me.

Comment 13 Alexander Klein 2020-01-28 07:27:32 UTC

i upgraded the cluster with oc adm upgrade, this is fixed in 4.2.16z

Comment 15 W. Trevor King 2020-02-11 22:59:00 UTC

Need to set Target Release, or the Errata sweeper won't move this to ON_QA.

Comment 18 errata-xmlrpc 2020-02-24 16:52:45 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0460

Comment 19 W. Trevor King 2020-03-14 04:28:16 UTC

*** Bug 1795508 has been marked as a duplicate of this bug. ***

Comment 20 Sergiusz Urbaniak 2020-04-30 12:32:58 UTC

*** Bug 1829332 has been marked as a duplicate of this bug. ***

Note You need to log in before you can comment on or make changes to this bug.