Bug 1973491 - Node exporter veth optimizations do not work if the network type is OVN
Summary: Node exporter veth optimizations do not work if the network type is OVN
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Monitoring
Version: 4.8
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
: 4.9.0
Assignee: Prashant Balachandran
QA Contact: Junqi Zhao
URL:
Whiteboard:
Depends On:
Blocks: 1984753
TreeView+ depends on / blocked
 
Reported: 2021-06-18 01:05 UTC by browsell
Modified: 2021-10-18 17:35 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1984753 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:35:26 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-monitoring-operator pull 1302 0 None open Bug 1973491: jsonnet: update deps 2021-07-29 11:13:34 UTC
Github prometheus-operator kube-prometheus pull 1224 0 None closed jsonnet: kube-prometheus adapt to changes to veth interfaces names 2021-07-21 08:42:52 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:35:31 UTC

Description browsell 2021-06-18 01:05:33 UTC
Description of problem:

The following node-exporter args do not work on bare metal as the i/f's are renamed. 

- --collector.netclass.ignored-devices=^(veth.*)$
- --collector.netdev.device-exclude=^(veth.*)$

[  852.832844] IPv6: ADDRCONF(NETDEV_UP): veth280c24a2: link is not ready
[  852.846749] IPv6: ADDRCONF(NETDEV_CHANGE): veth280c24a2: link becomes ready
[  852.855149] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  852.969991] ca38b71a654d329: renamed from veth280c24a2



Version-Release number of selected component (if applicable):
4.8.0-0.nightly-2021-06-13-101614

How reproducible:
100% 


Steps to Reproduce:
1.Install system
2.
3.

Actual results:



Expected results:


Additional info:

Comment 9 Simon Pasquier 2021-07-22 07:22:31 UTC
The fix on 4.9 was included in https://github.com/openshift/cluster-monitoring-operator/pull/1269

Comment 11 Junqi Zhao 2021-07-26 12:11:48 UTC
tested with 4.9.0-0.nightly-2021-07-25-125326 baremetal cluster, the fix is in the payload, still can see renamed info, example: "4ccbb8f4ae0200b : renamed from vethc2f58213"
# oc get infrastructures/cluster -o jsonpath="{..status.platform}"
None

# oc get network/cluster -oyaml
...
spec:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  externalIP:
    policy: {}
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
...

# oc -n openshift-monitoring get ds node-exporter -oyaml
...
    spec:
      containers:
      - args:
        - --web.listen-address=127.0.0.1:9100
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*|[a-z0-9]+@if\d+)$
        - --collector.netdev.device-exclude=^(veth.*|[a-z0-9]+@if\d+)$
        - --collector.cpu.info
        - --collector.textfile.directory=/var/node_exporter/textfile
        - --no-collector.cpufreq
        image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d491ea32e33744c82eafad06ff33f0a19ebbee1e7fcbc3d6a93cff39aa91f3d
        imagePullPolicy: IfNotPresent
...

# oc debug node/juzhao-bm-26kf5-compute-2
sh-4.2# chroot /host
sh-4.4# journalctl
*******
Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc2f58213: link becomes ready
Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: 4ccbb8f4ae0200b: renamed from vethc2f58213
*******
sh-4.4# dmesg | grep 'renamed from veth'
[  238.091266] 4ccbb8f4ae0200b: renamed from vethc2f58213
[  238.106811] 0c53b480a2abc85: renamed from veth92ba2719
[  268.760350] 19ae2ab86a31c7f: renamed from vethf5faf749
[  290.433717] 5411c57536e8b6c: renamed from veth6044fdf6
[  680.972413] 60c12c622d58048: renamed from vethe757e5be
[  754.794168] 1d3083bf2efdee4: renamed from vetha38913a0
[  810.309462] 7d07b208ae80c72: renamed from veth6292f9e4
[  867.838939] e1d5f9e8769fe5a: renamed from veth19f339b5
[ 1159.311463] f8177fcdb0e22a9: renamed from veth82626846
[ 1165.526130] 2175efb937302c8: renamed from veth8184a07e
[ 1552.881951] b9b43bd0c830b86: renamed from veth76983486
[ 1570.658756] 758d1858ecc1633: renamed from veth549c2c10
[ 1582.499308] d4c18093c72dc02: renamed from veth2554af92
[ 2084.626352] c5b3d7515916d25: renamed from vethf3726c4f
[ 2084.742157] c8117f37b21a409: renamed from vethf917d152
[ 2219.525385] a1d766cbc7840c2: renamed from veth8486a99c
[ 2220.243239] bda5b44e5d88594: renamed from vethb45f554f
[ 2862.810886] b3a1b1cdfd30f11: renamed from veth9ec63641
[ 2977.608013] e1b8ca86061a4bb: renamed from veth7f9716b2
[ 2999.148016] aac627351c03eed: renamed from vethf3044dc0
[ 3517.637889] 0f19a90ddde7623: renamed from veth0d0dc433
....

Comment 12 Junqi Zhao 2021-07-26 12:30:14 UTC
example, for log
4ccbb8f4ae0200b: renamed from vethc2f58213
search by API with metrics node_network_info which device is renamed to  4ccbb8f4ae0200b now
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-
          1627302317.131,
          "1"
        ]
      },
      {
        "metric": {
          "__name__": "node_network_info",
          "address": "1a:62:a1:eb:b1:b8",
          "broadcast": "ff:ff:ff:ff:ff:ff",
          "container": "kube-rbac-proxy",
          "device": "4ccbb8f4ae0200b",
          "duplex": "full",
          "endpoint": "https",
          "instance": "juzhao-bm-26kf5-compute-2",
          "job": "node-exporter",
          "namespace": "openshift-monitoring",
          "operstate": "up",
          "pod": "node-exporter-7fm4d",
          "service": "node-exporter"
        },
        "value": [

Comment 13 Simon Pasquier 2021-07-26 14:00:41 UTC
Looking at https://github.com/openshift/ovn-kubernetes/blob/157c0a7215a610ac852224f50f6d7a96f4d7384d/go-controller/pkg/cni/helper_linux.go#L148, CNI renames the host end of the veth pair with the last 15 characters of the container ID. While the "ip addr" command will return interface names like "9f2886d201ca42a@if3", node_exporter reads information from /sys/class/net which only contain the ID string:

$ oc exec node-exporter-2bpxg -c node-exporter -t -- ls /sys/class/net | head
0ae02dac897204e
0bb1d13818c455a
0d38482247a03d9
22f75973f5bb373
30238c61402c884
35d7ba8ebd6a817
3e4f2f3a5e1fa95
3f8d77b17598892
40a4a4c43a5d28f
5049a28d2add36e

So instead of the regexp being "^(veth.*|[a-z0-9]+@if\d+)$", it needs to be "^(veth.*|[a-f0-9]{15})$".

Comment 15 Junqi Zhao 2021-08-02 06:22:39 UTC
tested with 4.9.0-0.nightly-2021-08-01-132055 baremetal ovn cluster,
# oc -n openshift-monitoring get ds node-exporter -oyaml
...
    spec:
      containers:
      - args:
        - --web.listen-address=127.0.0.1:9100
        - --path.sysfs=/host/sys
        - --path.rootfs=/host/root
        - --no-collector.wifi
        - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
        - --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$
        - --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$
        - --collector.cpu.info
        - --collector.textfile.directory=/var/node_exporter/textfile
        - --no-collector.cpufreq

# oc debug node/juzhao-49-6v6v9-compute-0 
sh-4.4# chroot /host
sh-4.4# dmesg | grep 'renamed from veth'
[  231.196703] a601f8c41cde05c: renamed from veth528ef52f
[  232.270686] 2a27328546d123b: renamed from veth60feafc9
[  238.338056] 9a1652e2c688302: renamed from veth583356f4
[  238.798171] baceffec99a204f: renamed from vethea526057
[  279.903613] d3c837b7dcf0c2a: renamed from vethd325fc06
[  290.895184] 37b9fa89d3c55f6: renamed from veth5a621dfa
[  529.256565] b432bbcd31abc05: renamed from veth03a4f764
[  529.275102] 6f24e139af20d26: renamed from veth1b4be232
[  546.409958] 4712be32ab6118d: renamed from veth131d215f
[  572.835334] 24dd3b8f6a16b3b: renamed from veth0950dc4c

checked from API, there is not node_network_info for device which renamed from 'veth**'
example
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=node_network_info' | jq | grep a601f8c41cde05c
no result

Comment 22 errata-xmlrpc 2021-10-18 17:35:26 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.