Bug 2007495 - Large label value for the metric kubelet_started_pods_errors_total with label message when there is a error
Summary: Large label value for the metric kubelet_started_pods_errors_total with label...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.10
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.10.0
Assignee: Sai Ramesh Vanka
QA Contact: Sunil Choudhary
Depends On:
TreeView+ depends on / blocked
Reported: 2021-09-24 05:05 UTC by Jayapriya Pai
Modified: 2022-03-12 04:38 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-03-12 04:38:27 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Github kubernetes kubernetes issues 105163 0 None open Large label value for the metric `kubelet_started_pods_errors_total` with label `message` when there is a error 2021-09-24 05:05:04 UTC
Github openshift kubernetes pull 988 0 None open Bug 2007495: UPSTREAM: 105213: remove StartedPodsErrorsTotal metrice message 2021-10-01 10:05:18 UTC
Red Hat Product Errata RHSA-2022:0056 0 None None None 2022-03-12 04:38:46 UTC

Description Jayapriya Pai 2021-09-24 05:05:04 UTC
Description of problem:

While reviewing labels in prometheus noticed the metric kubelet_started_pods_errors_total whose label value is large (700+ characters) since it was an error log message.


kubelet_started_pods_errors_total{endpoint="https-metrics", instance="", job="kubelet", message="rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-image-registry-operator-6c786c84-svlhd_openshift-image-registry_eeb66cce-0b91-4c7d-870c-bf02234342cd_0(8381f96f82dd16971b009a09a3470a19766e316236e7337f194d5dcd5dc5e540): error adding pod openshift-image-registry_cluster-image-registry-operator-6c786c84-svlhd to CNI network "multus-cni-network": Multus: [openshift-image-registry/cluster-image-registry-operator-6c786c84-svlhd]: error getting pod: Get "https://[api-int.ci-ln-121bc45-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-image-registry/pods/cluster-image-registry-operator-6c786c84-svlhd?timeout=1m0s": dial tcp connect: connection refused", metrics_path="/metrics", namespace="kube-system", node="ci-ln-121bc45-f76d1-qg7xx-master-2", service="kubelet"}

The metric is from https://github.com/kubernetes/kubernetes/blob/v1.22.0-rc.0/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L813

Version-Release number of selected component (if applicable):

How reproducible:

Query the metric kubelet_started_pods_errors_total in prometheus UI or through api where kubelet is one of the target in prometheus. If there is error the message size will be very large like above example

Steps to Reproduce:

Actual results:

Large label value like the example pasted above

Expected results:

It doesn't look like we should have error/ log message as a metric

Additional info:

Upstream issue created for this https://github.com/kubernetes/kubernetes/issues/105163

Creating the bug so that we don't lose the track about this issue

Comment 5 Sunil Choudhary 2021-12-09 10:52:30 UTC
Verified on 4.10.0-0.nightly-2021-12-06-201335.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-06-201335   True        False         3h52m   Cluster version is 4.10.0-0.nightly-2021-12-06-201335

Comment 8 errata-xmlrpc 2022-03-12 04:38:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.