Bug 2007495

Summary: Large label value for the metric kubelet_started_pods_errors_total with label message when there is a error
Product: OpenShift Container Platform Reporter: Jayapriya Pai <janantha>
Component: NodeAssignee: Sai Ramesh Vanka <svanka>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED ERRATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aos-bugs, ehashman, harpatil, spasquie
Version: 4.10   
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-12 04:38:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jayapriya Pai 2021-09-24 05:05:04 UTC
Description of problem:

While reviewing labels in prometheus noticed the metric kubelet_started_pods_errors_total whose label value is large (700+ characters) since it was an error log message.

Example:

kubelet_started_pods_errors_total{endpoint="https-metrics", instance="10.0.0.3:10250", job="kubelet", message="rpc error: code = Unknown desc = failed to create pod network sandbox k8s_cluster-image-registry-operator-6c786c84-svlhd_openshift-image-registry_eeb66cce-0b91-4c7d-870c-bf02234342cd_0(8381f96f82dd16971b009a09a3470a19766e316236e7337f194d5dcd5dc5e540): error adding pod openshift-image-registry_cluster-image-registry-operator-6c786c84-svlhd to CNI network "multus-cni-network": Multus: [openshift-image-registry/cluster-image-registry-operator-6c786c84-svlhd]: error getting pod: Get "https://[api-int.ci-ln-121bc45-f76d1.origin-ci-int-gce.dev.openshift.com]:6443/api/v1/namespaces/openshift-image-registry/pods/cluster-image-registry-operator-6c786c84-svlhd?timeout=1m0s": dial tcp 10.0.0.2:6443: connect: connection refused", metrics_path="/metrics", namespace="kube-system", node="ci-ln-121bc45-f76d1-qg7xx-master-2", service="kubelet"}

The metric is from https://github.com/kubernetes/kubernetes/blob/v1.22.0-rc.0/pkg/kubelet/kuberuntime/kuberuntime_manager.go#L813

Version-Release number of selected component (if applicable):


How reproducible:

Query the metric kubelet_started_pods_errors_total in prometheus UI or through api where kubelet is one of the target in prometheus. If there is error the message size will be very large like above example

Steps to Reproduce:
1.
2.
3.

Actual results:

Large label value like the example pasted above

Expected results:

It doesn't look like we should have error/ log message as a metric

Additional info:

Upstream issue created for this https://github.com/kubernetes/kubernetes/issues/105163

Creating the bug so that we don't lose the track about this issue

Comment 5 Sunil Choudhary 2021-12-09 10:52:30 UTC
Verified on 4.10.0-0.nightly-2021-12-06-201335.

$ oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-12-06-201335   True        False         3h52m   Cluster version is 4.10.0-0.nightly-2021-12-06-201335

Comment 8 errata-xmlrpc 2022-03-12 04:38:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056