Bug 1973491
| Summary: | Node exporter veth optimizations do not work if the network type is OVN | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | browsell | |
| Component: | Monitoring | Assignee: | Prashant Balachandran <pnair> | |
| Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.8 | CC: | anpicker, aos-bugs, erooth, janantha, keyoung, spasquie | |
| Target Milestone: | --- | |||
| Target Release: | 4.9.0 | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | No Doc Update | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1984753 (view as bug list) | Environment: | ||
| Last Closed: | 2021-10-18 17:35:26 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1984753 | |||
|
Description
browsell
2021-06-18 01:05:33 UTC
The fix on 4.9 was included in https://github.com/openshift/cluster-monitoring-operator/pull/1269 tested with 4.9.0-0.nightly-2021-07-25-125326 baremetal cluster, the fix is in the payload, still can see renamed info, example: "4ccbb8f4ae0200b : renamed from vethc2f58213"
# oc get infrastructures/cluster -o jsonpath="{..status.platform}"
None
# oc get network/cluster -oyaml
...
spec:
clusterNetwork:
- cidr: 10.128.0.0/14
hostPrefix: 23
externalIP:
policy: {}
networkType: OVNKubernetes
serviceNetwork:
- 172.30.0.0/16
...
# oc -n openshift-monitoring get ds node-exporter -oyaml
...
spec:
containers:
- args:
- --web.listen-address=127.0.0.1:9100
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --no-collector.wifi
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
- --collector.netclass.ignored-devices=^(veth.*|[a-z0-9]+@if\d+)$
- --collector.netdev.device-exclude=^(veth.*|[a-z0-9]+@if\d+)$
- --collector.cpu.info
- --collector.textfile.directory=/var/node_exporter/textfile
- --no-collector.cpufreq
image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d491ea32e33744c82eafad06ff33f0a19ebbee1e7fcbc3d6a93cff39aa91f3d
imagePullPolicy: IfNotPresent
...
# oc debug node/juzhao-bm-26kf5-compute-2
sh-4.2# chroot /host
sh-4.4# journalctl
*******
Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc2f58213: link becomes ready
Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: 4ccbb8f4ae0200b: renamed from vethc2f58213
*******
sh-4.4# dmesg | grep 'renamed from veth'
[ 238.091266] 4ccbb8f4ae0200b: renamed from vethc2f58213
[ 238.106811] 0c53b480a2abc85: renamed from veth92ba2719
[ 268.760350] 19ae2ab86a31c7f: renamed from vethf5faf749
[ 290.433717] 5411c57536e8b6c: renamed from veth6044fdf6
[ 680.972413] 60c12c622d58048: renamed from vethe757e5be
[ 754.794168] 1d3083bf2efdee4: renamed from vetha38913a0
[ 810.309462] 7d07b208ae80c72: renamed from veth6292f9e4
[ 867.838939] e1d5f9e8769fe5a: renamed from veth19f339b5
[ 1159.311463] f8177fcdb0e22a9: renamed from veth82626846
[ 1165.526130] 2175efb937302c8: renamed from veth8184a07e
[ 1552.881951] b9b43bd0c830b86: renamed from veth76983486
[ 1570.658756] 758d1858ecc1633: renamed from veth549c2c10
[ 1582.499308] d4c18093c72dc02: renamed from veth2554af92
[ 2084.626352] c5b3d7515916d25: renamed from vethf3726c4f
[ 2084.742157] c8117f37b21a409: renamed from vethf917d152
[ 2219.525385] a1d766cbc7840c2: renamed from veth8486a99c
[ 2220.243239] bda5b44e5d88594: renamed from vethb45f554f
[ 2862.810886] b3a1b1cdfd30f11: renamed from veth9ec63641
[ 2977.608013] e1b8ca86061a4bb: renamed from veth7f9716b2
[ 2999.148016] aac627351c03eed: renamed from vethf3044dc0
[ 3517.637889] 0f19a90ddde7623: renamed from veth0d0dc433
....
example, for log 4ccbb8f4ae0200b: renamed from vethc2f58213 search by API with metrics node_network_info which device is renamed to 4ccbb8f4ae0200b now # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift- 1627302317.131, "1" ] }, { "metric": { "__name__": "node_network_info", "address": "1a:62:a1:eb:b1:b8", "broadcast": "ff:ff:ff:ff:ff:ff", "container": "kube-rbac-proxy", "device": "4ccbb8f4ae0200b", "duplex": "full", "endpoint": "https", "instance": "juzhao-bm-26kf5-compute-2", "job": "node-exporter", "namespace": "openshift-monitoring", "operstate": "up", "pod": "node-exporter-7fm4d", "service": "node-exporter" }, "value": [ Looking at https://github.com/openshift/ovn-kubernetes/blob/157c0a7215a610ac852224f50f6d7a96f4d7384d/go-controller/pkg/cni/helper_linux.go#L148, CNI renames the host end of the veth pair with the last 15 characters of the container ID. While the "ip addr" command will return interface names like "9f2886d201ca42a@if3", node_exporter reads information from /sys/class/net which only contain the ID string: $ oc exec node-exporter-2bpxg -c node-exporter -t -- ls /sys/class/net | head 0ae02dac897204e 0bb1d13818c455a 0d38482247a03d9 22f75973f5bb373 30238c61402c884 35d7ba8ebd6a817 3e4f2f3a5e1fa95 3f8d77b17598892 40a4a4c43a5d28f 5049a28d2add36e So instead of the regexp being "^(veth.*|[a-z0-9]+@if\d+)$", it needs to be "^(veth.*|[a-f0-9]{15})$". tested with 4.9.0-0.nightly-2021-08-01-132055 baremetal ovn cluster,
# oc -n openshift-monitoring get ds node-exporter -oyaml
...
spec:
containers:
- args:
- --web.listen-address=127.0.0.1:9100
- --path.sysfs=/host/sys
- --path.rootfs=/host/root
- --no-collector.wifi
- --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/)
- --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$
- --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$
- --collector.cpu.info
- --collector.textfile.directory=/var/node_exporter/textfile
- --no-collector.cpufreq
# oc debug node/juzhao-49-6v6v9-compute-0
sh-4.4# chroot /host
sh-4.4# dmesg | grep 'renamed from veth'
[ 231.196703] a601f8c41cde05c: renamed from veth528ef52f
[ 232.270686] 2a27328546d123b: renamed from veth60feafc9
[ 238.338056] 9a1652e2c688302: renamed from veth583356f4
[ 238.798171] baceffec99a204f: renamed from vethea526057
[ 279.903613] d3c837b7dcf0c2a: renamed from vethd325fc06
[ 290.895184] 37b9fa89d3c55f6: renamed from veth5a621dfa
[ 529.256565] b432bbcd31abc05: renamed from veth03a4f764
[ 529.275102] 6f24e139af20d26: renamed from veth1b4be232
[ 546.409958] 4712be32ab6118d: renamed from veth131d215f
[ 572.835334] 24dd3b8f6a16b3b: renamed from veth0950dc4c
checked from API, there is not node_network_info for device which renamed from 'veth**'
example
# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=node_network_info' | jq | grep a601f8c41cde05c
no result
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |