Description of problem: upgrading a cluster from version 4.2.12-s390x to 4.2.13-s390x over the webconsole fails to update the monitoring operator. all other updates seem to apply correctly How reproducible: Steps to Reproduce: 1. update from version 4.2.12-s390x to 4.2.13-s390x 2. wait for all updates to finish Actual results: monitoring is degraded node-exporter in state CrashLoopBackOff, log attached Expected results: update successfull additional info: node-exporter log: time="2020-01-09T08:15:55Z" level=info msg="Starting node_exporter (version=0.18.1, branch=rhaos-4.2-rhel-7, revision=324a1d4caae524181d0c4ccedb070763ea7bd403)" source="node_exporter.go:156" time="2020-01-09T08:15:55Z" level=info msg="Build context (go=go1.12.12, user=root@a4be65dd8c74, date=20191223-11:56:48)" source="node_exporter.go:157" time="2020-01-09T08:15:55Z" level=info msg="Enabled collectors:" source="node_exporter.go:97" time="2020-01-09T08:15:55Z" level=info msg=" - arp" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - bcache" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - bonding" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - conntrack" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - cpu" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - cpufreq" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - diskstats" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - edac" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - entropy" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - filefd" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - filesystem" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - hwmon" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - infiniband" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - ipvs" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - loadavg" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - mdadm" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - meminfo" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - netclass" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - netdev" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - netstat" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - nfs" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - nfsd" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - pressure" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - sockstat" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - stat" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - textfile" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - time" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - timex" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - uname" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - vmstat" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - xfs" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg=" - zfs" source="node_exporter.go:104" time="2020-01-09T08:15:55Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170" panic: runtime error: index out of range goroutine 39 [running]: github.com/prometheus/procfs.parseCPUInfo(0xc000203000, 0x4c3, 0xe00, 0x4c3, 0xe00, 0x0, 0x0, 0x1) /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/procfs/cpuinfo.go:85 +0x1d32 github.com/prometheus/procfs.FS.CPUInfo(0x3ffcbd7f550, 0xa, 0x24000000000f4684, 0x1d394, 0xc0001eb6c0, 0x70, 0x70) /go/src/github.com/prometheus/node_exporter/vendor/github.com/prometheus/procfs/cpuinfo.go:61 +0x180 github.com/prometheus/node_exporter/collector.(*cpuCollector).updateInfo(0xc00005d180, 0xc0000de180, 0x32ea661f00000000, 0x325e45f98388) /go/src/github.com/prometheus/node_exporter/collector/cpu_linux.go:96 +0x3c github.com/prometheus/node_exporter/collector.(*cpuCollector).Update(0xc00005d180, 0xc0000de180, 0xc311a0, 0x0) /go/src/github.com/prometheus/node_exporter/collector/cpu_linux.go:81 +0xdc github.com/prometheus/node_exporter/collector.execute(0x6e60b2, 0x3, 0x7cb140, 0xc00005d180, 0xc0000de180) /go/src/github.com/prometheus/node_exporter/collector/collector.go:127 +0x68 github.com/prometheus/node_exporter/collector.NodeCollector.Collect.func1(0xc0000de180, 0xc0000d82f0, 0x6e60b2, 0x3, 0x7cb140, 0xc00005d180) /go/src/github.com/prometheus/node_exporter/collector/collector.go:118 +0x48 created by github.com/prometheus/node_exporter/collector.NodeCollector.Collect /go/src/github.com/prometheus/node_exporter/collector/collector.go:117 +0xe2
> I think it is a race condition where the monitoring operator degrades after the install finishes. What is the Degraded Reason/Message? I'd like to search for them in Telemetry data to double-check that this is just an s390x issue and just a 4.2.13 issue.
Also, we've pulled the update edges leading into 4.2.13 out of the recommended-update graph while we look into this [1]. [1]: https://github.com/openshift/cincinnati-graph-data/pull/17
The last condition in `oc describe co/monitoring` is: Message: Failed to rollout the stack. Error: running task Updating node-exporter failed: reconciling node-exporter DaemonSet failed: updating DaemonSet object failed: waiting for DaemonSetRollout of node-exporter: daemonset node-exporter is not ready. status: (desired: 5, updated: 5, ready: 0, unavailable: 5) Reason: UpdatingnodeExporterFailed Status: True Type: Degraded
It looks like (#572) enabled the routine from PR (#46) which may need support added for s390x. References: (#572) https://github.com/openshift/cluster-monitoring-operator/pull/572/files#diff-ea42f2a19bba3e07005d6b437ed1a902R26 (#46 a) https://github.com/openshift/node_exporter/pull/46/files#diff-0e15beb18c136e9ae4e6a0eec3700896R39 (#46 b) https://github.com/openshift/node_exporter/pull/46/files#diff-0e15beb18c136e9ae4e6a0eec3700896R80-R115
The problem is due to the format of /proc/cpuinfo being different on s390x vs. x86 platforms and the procfs library doesn't handle these differences very well. As a workaround we can skip gathering these metrics on s390x for now as suggested in the related PR (https://github.com/openshift/node_exporter/pull/52). For longer term upstream fix, can someone upload an example of /proc/cpuinfo from an s390x system?
Example of /proc/cpuinfo for s390x: vendor_id : IBM/S390 # processors : 4 bogomips per cpu: 3033.00 max thread id : 0 features : esan3 zarch stfle msa ldisp eimm dfp edat etf3eh highgprs te vx sie facilities : 0 1 2 3 4 6 7 8 9 10 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 48 49 50 51 52 53 55 57 73 74 75 76 77 80 81 82 128 129 131 cache0 : level=1 type=Data scope=Private size=128K line_size=256 associativity=8 cache1 : level=1 type=Instruction scope=Private size=96K line_size=256 associativity=6 cache2 : level=2 type=Data scope=Private size=2048K line_size=256 associativity=8 cache3 : level=2 type=Instruction scope=Private size=2048K line_size=256 associativity=8 cache4 : level=3 type=Unified scope=Shared size=65536K line_size=256 associativity=16 cache5 : level=4 type=Unified scope=Shared size=491520K line_size=256 associativity=30 processor 0: version = FF, identification = 2733E8, machine = 2964 processor 1: version = FF, identification = 2733E8, machine = 2964 processor 2: version = FF, identification = 2733E8, machine = 2964 processor 3: version = FF, identification = 2733E8, machine = 2964 cpu number : 0 cpu MHz dynamic : 5000 cpu MHz static : 5000 cpu number : 1 cpu MHz dynamic : 5000 cpu MHz static : 5000 cpu number : 2 cpu MHz dynamic : 5000 cpu MHz static : 5000 cpu number : 3 cpu MHz dynamic : 5000 cpu MHz static : 5000
Comment 7 has the requested information.
Created upstream PR to add support for reading arm, ppc, and s390x cpuinfo files. https://github.com/prometheus/procfs/pull/257
this is still broken on 4.2.14 which is the version the 'latest' download is pointing to.
ALthough the latest is not pointing to it, 4.2.16 is the latest version: https://mirror.openshift.com/pub/openshift-v4/s390x/clients/ocp/4.2.16 - could you try this? It worked for me.
i upgraded the cluster with oc adm upgrade, this is fixed in 4.2.16z
Need to set Target Release, or the Errata sweeper won't move this to ON_QA.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0460
*** Bug 1795508 has been marked as a duplicate of this bug. ***
*** Bug 1829332 has been marked as a duplicate of this bug. ***