Description of problem: Upon an upgrade from 4.4.16 to 4.5.5, some node-exporter pods began to show the following log message: time="2020-08-17T19:37:33Z" level=error msg="ERROR: mountstats collector failed after 0.000539s: failed to parse mountstats: invalid NFS per-operations stats: [NULL: 1 1 0 44 24 2 0 3 0]" source="collector.go:132" Version-Release number of selected component (if applicable): 4.5.5 How reproducible: I tried on my 4.5 cluster and it was able to successfully listen, so I'm not sure if it depends on the upgrade from 4.4 and/or some custom resource: time="2020-08-10T17:14:06Z" level=info msg="Build context (go=go1.13.4, user=root@19c35050d8da, date=20200724-06:51:03)" source="node_exporter.go:157" time="2020-08-10T17:14:06Z" level=info msg="Enabled collectors:" source="node_exporter.go:97" time="2020-08-10T17:14:06Z" level=info msg=" - arp" source="node_exporter.go:104" time="2020-08-10T17:14:06Z" level=info msg=" - bcache" source="node_exporter.go:104" time="2020-08-10T17:14:06Z" level=info msg=" - bonding" ... time="2020-08-10T17:14:06Z" level=info msg=" - zfs" source="node_exporter.go:104" time="2020-08-10T17:14:06Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170" Steps to Reproduce: 1. Upgrade 4.4.16 to 4.5.5 2. Observe errors in node-exporter Actual results: $ oc logs node-exporter-2b2bj -c node-exporter | tail time="2020-08-17T19:37:33Z" level=error msg="ERROR: mountstats collector failed after 0.000539s: failed to parse mountstats: invalid NFS per-operations stats: [NULL: 1 1 0 44 24 2 0 3 0]" source="collector.go:132" Expected results: $ oc logs node-exporter-2b2bj -c node-exporter | tail time="2020-08-10T17:14:06Z" level=info msg="Listening on 127.0.0.1:9100" source="node_exporter.go:170" Additional info: The following two issues were submitted upstream: [0] https://github.com/prometheus/node_exporter/issues/1583 [1] https://github.com/prometheus/procfs/issues/275
This has been fixed upstream in https://github.com/prometheus/procfs/pull/276 and vendored into node_exporter in v1.0.1 which is going available in OpenShift 4.6.
Single patch update is not something that can be easily done and backporting fix to 4.5 would mean updating node_exporter to newer version and thus adding new features. Since this would add new features please open an RFE.
Node_exporter is a crucial component in monitoring stack which provides data for alerts, dashboards, and telemetry. Unfortunatelly version 1.0 changes lots of metrics and as such doing an in-place upgrade from v0.18 has high potential of breaking other parts of the stack. Taking this into account we recommend upgrading to OpenShift 4.6 as this is where everything was already tested.
Due to how our internal processes for backporting works this bug needs now verification if everything works correctly in 4.6. The tracking of the backport process for 4.5 is done in https://bugzilla.redhat.com/show_bug.cgi?id=1890466. For QE: In the latest node_exporter included in 4.6 this bug shouldn't happen. In 4.5 this could be reproduced by mounting any NFS volume on openshift host and checking node_exporter logs for that host. Also kernel version higher or equal to 5.3 is necessary for reproduction.
tested with 4.5.0-0.nightly-2020-10-20-022340 and attached NFS PVs, issue is reproduced # oc -n openshift-monitoring logs -c node-exporter node-exporter-2zbvp | grep "failed to parse mountstats: invalid NFS per-operations stats" | tail -n 4 time="2020-10-23T06:00:28Z" level=error msg="ERROR: mountstats collector failed after 0.001108s: failed to parse mountstats: invalid NFS per-operations stats: [NULL: 1 1 0 44 24 1 0 2 0]" source="collector.go:132" time="2020-10-23T06:00:31Z" level=error msg="ERROR: mountstats collector failed after 0.000795s: failed to parse mountstats: invalid NFS per-operations stats: [NULL: 1 1 0 44 24 1 0 2 0]" source="collector.go:132" time="2020-10-23T06:00:43Z" level=error msg="ERROR: mountstats collector failed after 0.000780s: failed to parse mountstats: invalid NFS per-operations stats: [NULL: 1 1 0 44 24 1 0 2 0]" source="collector.go:132" time="2020-10-23T06:00:46Z" level=error msg="ERROR: mountstats collector failed after 0.000836s: failed to parse mountstats: invalid NFS per-operations stats: [NULL: 1 1 0 44 24 1 0 2 0]" source="collector.go:132"
(In reply to Junqi Zhao from comment #20) > tested with 4.5.0-0.nightly-2020-10-20-022340 and attached NFS PVs, issue is > reproduced node_exporter (version=0.18.1)
bound NFS PVs to 4.5.0-0.nightly-2020-10-20-022340 and upgrade to 4.6.0-0.nightly-2020-10-22-034051, node_exporter version=1.0.1, no "failed to parse mountstats: invalid NFS per-operations stats" errors, example: # oc -n openshift-monitoring logs -c node-exporter node-exporter-6qpr9 | grep "invalid NFS per-operations stats" no result
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196