Bug 1973491
Summary: | Node exporter veth optimizations do not work if the network type is OVN | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | browsell | |
Component: | Monitoring | Assignee: | Prashant Balachandran <pnair> | |
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | |
Severity: | medium | Docs Contact: | ||
Priority: | unspecified | |||
Version: | 4.8 | CC: | anpicker, aos-bugs, erooth, janantha, keyoung, spasquie | |
Target Milestone: | --- | |||
Target Release: | 4.9.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | No Doc Update | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1984753 (view as bug list) | Environment: | ||
Last Closed: | 2021-10-18 17:35:26 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1984753 |
Description
browsell
2021-06-18 01:05:33 UTC
The fix on 4.9 was included in https://github.com/openshift/cluster-monitoring-operator/pull/1269 tested with 4.9.0-0.nightly-2021-07-25-125326 baremetal cluster, the fix is in the payload, still can see renamed info, example: "4ccbb8f4ae0200b : renamed from vethc2f58213" # oc get infrastructures/cluster -o jsonpath="{..status.platform}" None # oc get network/cluster -oyaml ... spec: clusterNetwork: - cidr: 10.128.0.0/14 hostPrefix: 23 externalIP: policy: {} networkType: OVNKubernetes serviceNetwork: - 172.30.0.0/16 ... # oc -n openshift-monitoring get ds node-exporter -oyaml ... spec: containers: - args: - --web.listen-address=127.0.0.1:9100 - --path.sysfs=/host/sys - --path.rootfs=/host/root - --no-collector.wifi - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) - --collector.netclass.ignored-devices=^(veth.*|[a-z0-9]+@if\d+)$ - --collector.netdev.device-exclude=^(veth.*|[a-z0-9]+@if\d+)$ - --collector.cpu.info - --collector.textfile.directory=/var/node_exporter/textfile - --no-collector.cpufreq image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d491ea32e33744c82eafad06ff33f0a19ebbee1e7fcbc3d6a93cff39aa91f3d imagePullPolicy: IfNotPresent ... # oc debug node/juzhao-bm-26kf5-compute-2 sh-4.2# chroot /host sh-4.4# journalctl ******* Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): vethc2f58213: link becomes ready Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready Jul 26 08:34:42 juzhao-bm-26kf5-compute-2 kernel: 4ccbb8f4ae0200b: renamed from vethc2f58213 ******* sh-4.4# dmesg | grep 'renamed from veth' [ 238.091266] 4ccbb8f4ae0200b: renamed from vethc2f58213 [ 238.106811] 0c53b480a2abc85: renamed from veth92ba2719 [ 268.760350] 19ae2ab86a31c7f: renamed from vethf5faf749 [ 290.433717] 5411c57536e8b6c: renamed from veth6044fdf6 [ 680.972413] 60c12c622d58048: renamed from vethe757e5be [ 754.794168] 1d3083bf2efdee4: renamed from vetha38913a0 [ 810.309462] 7d07b208ae80c72: renamed from veth6292f9e4 [ 867.838939] e1d5f9e8769fe5a: renamed from veth19f339b5 [ 1159.311463] f8177fcdb0e22a9: renamed from veth82626846 [ 1165.526130] 2175efb937302c8: renamed from veth8184a07e [ 1552.881951] b9b43bd0c830b86: renamed from veth76983486 [ 1570.658756] 758d1858ecc1633: renamed from veth549c2c10 [ 1582.499308] d4c18093c72dc02: renamed from veth2554af92 [ 2084.626352] c5b3d7515916d25: renamed from vethf3726c4f [ 2084.742157] c8117f37b21a409: renamed from vethf917d152 [ 2219.525385] a1d766cbc7840c2: renamed from veth8486a99c [ 2220.243239] bda5b44e5d88594: renamed from vethb45f554f [ 2862.810886] b3a1b1cdfd30f11: renamed from veth9ec63641 [ 2977.608013] e1b8ca86061a4bb: renamed from veth7f9716b2 [ 2999.148016] aac627351c03eed: renamed from vethf3044dc0 [ 3517.637889] 0f19a90ddde7623: renamed from veth0d0dc433 .... example, for log 4ccbb8f4ae0200b: renamed from vethc2f58213 search by API with metrics node_network_info which device is renamed to 4ccbb8f4ae0200b now # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift- 1627302317.131, "1" ] }, { "metric": { "__name__": "node_network_info", "address": "1a:62:a1:eb:b1:b8", "broadcast": "ff:ff:ff:ff:ff:ff", "container": "kube-rbac-proxy", "device": "4ccbb8f4ae0200b", "duplex": "full", "endpoint": "https", "instance": "juzhao-bm-26kf5-compute-2", "job": "node-exporter", "namespace": "openshift-monitoring", "operstate": "up", "pod": "node-exporter-7fm4d", "service": "node-exporter" }, "value": [ Looking at https://github.com/openshift/ovn-kubernetes/blob/157c0a7215a610ac852224f50f6d7a96f4d7384d/go-controller/pkg/cni/helper_linux.go#L148, CNI renames the host end of the veth pair with the last 15 characters of the container ID. While the "ip addr" command will return interface names like "9f2886d201ca42a@if3", node_exporter reads information from /sys/class/net which only contain the ID string: $ oc exec node-exporter-2bpxg -c node-exporter -t -- ls /sys/class/net | head 0ae02dac897204e 0bb1d13818c455a 0d38482247a03d9 22f75973f5bb373 30238c61402c884 35d7ba8ebd6a817 3e4f2f3a5e1fa95 3f8d77b17598892 40a4a4c43a5d28f 5049a28d2add36e So instead of the regexp being "^(veth.*|[a-z0-9]+@if\d+)$", it needs to be "^(veth.*|[a-f0-9]{15})$". tested with 4.9.0-0.nightly-2021-08-01-132055 baremetal ovn cluster, # oc -n openshift-monitoring get ds node-exporter -oyaml ... spec: containers: - args: - --web.listen-address=127.0.0.1:9100 - --path.sysfs=/host/sys - --path.rootfs=/host/root - --no-collector.wifi - --collector.filesystem.ignored-mount-points=^/(dev|proc|sys|var/lib/docker/.+|var/lib/kubelet/pods/.+)($|/) - --collector.netclass.ignored-devices=^(veth.*|[a-f0-9]{15})$ - --collector.netdev.device-exclude=^(veth.*|[a-f0-9]{15})$ - --collector.cpu.info - --collector.textfile.directory=/var/node_exporter/textfile - --no-collector.cpufreq # oc debug node/juzhao-49-6v6v9-compute-0 sh-4.4# chroot /host sh-4.4# dmesg | grep 'renamed from veth' [ 231.196703] a601f8c41cde05c: renamed from veth528ef52f [ 232.270686] 2a27328546d123b: renamed from veth60feafc9 [ 238.338056] 9a1652e2c688302: renamed from veth583356f4 [ 238.798171] baceffec99a204f: renamed from vethea526057 [ 279.903613] d3c837b7dcf0c2a: renamed from vethd325fc06 [ 290.895184] 37b9fa89d3c55f6: renamed from veth5a621dfa [ 529.256565] b432bbcd31abc05: renamed from veth03a4f764 [ 529.275102] 6f24e139af20d26: renamed from veth1b4be232 [ 546.409958] 4712be32ab6118d: renamed from veth131d215f [ 572.835334] 24dd3b8f6a16b3b: renamed from veth0950dc4c checked from API, there is not node_network_info for device which renamed from 'veth**' example # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=node_network_info' | jq | grep a601f8c41cde05c no result Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759 |