2034645 – [4.8.20] failed to get pod annotation: timed out waiting for annotations

Bug 2034645 - [4.8.20] failed to get pod annotation: timed out waiting for annotations

Summary: [4.8.20] failed to get pod annotation: timed out waiting for annotations

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.8
Hardware:	All
OS:	All
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Andreas Karis
QA Contact:	Anurag saxena
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-12-21 15:26 UTC by Francesco Cristini
Modified:	2023-09-15 01:50 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1959352
Environment:
Last Closed:	2022-02-03 09:21:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Comment 1 Andreas Karis 2021-12-21 17:58:38 UTC

Hi,

I talked to the OVN team we agreed that this is not necessarily related to the aforementioned scale BZ. "failed to get pod annotation: timed out waiting for annotations" can have several different causes.

This particular case's cluster only has 6 nodes. The scale BZ  talked about 300+ nodes

Example event from the marketplace namespace (from the attached must-gather):
~~~
apiVersion: v1
items:
- apiVersion: v1
  count: 5137
  eventTime: null
  firstTimestamp: "2021-12-18T00:07:03Z"
  involvedObject:
    apiVersion: v1
    kind: Pod
    name: image-pruner-27329760-6m6bn
    namespace: openshift-image-registry
    resourceVersion: "10264877"
    uid: c7f02a03-aa39-428b-ad98-262f607818dd
  kind: Event
  lastTimestamp: "2021-12-20T14:30:31Z"
  message: |-
    (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to create pod network sandbox k8s_image-pruner-27329760-6m6bn_openshift-image-registry_c7f02a03-aa39-428b-ad98-262f607818dd_0(4df65342b14239568ba6c396dd909e7257690541a7b76a9d522d585f20a7cc9a): error adding pod openshift-image-registry_image-pruner-27329760-6m6bn to CNI network "multus-cni-network": [openshift-image-registry/image-pruner-27329760-6m6bn:ovn-kubernetes]: error adding container to network "ovn-kubernetes": CNI request failed with status 400: '[openshift-image-registry/image-pruner-27329760-6m6bn 4df65342b14239568ba6c396dd909e7257690541a7b76a9d522d585f20a7cc9a] [openshift-image-registry/image-pruner-27329760-6m6bn 4df65342b14239568ba6c396dd909e7257690541a7b76a9d522d585f20a7cc9a] failed to get pod annotation: timed out waiting for annotations
~~~

The error message can be ambiguous, because all that it really reveals is that:
* the CNI plugin on the node dialed into the K8s API and looked for annotation k8s.ovn.org/pod-networks on the pod in question
* the lookup failed after 30 seconds, meaning the pod was not annotated with that annotation.

The issue is being hit here ...
https://github.com/openshift/ovn-kubernetes/blob/6c81aa64359cf47eea0211c4c4a7113a8f8325f6/go-controller/pkg/cni/cni.go#L118
~~~
func (pr *PodRequest) cmdAdd(kubeAuth *KubeAPIAuth, podLister corev1listers.PodLister, useOVSExternalIDs bool, kclient kubernetes.Interface) ([]byte, error) {
	namespace := pr.PodNamespace
	podName := pr.PodName
	if namespace == "" || podName == "" {
		return nil, fmt.Errorf("required CNI variable missing")
	}

	kubecli := &kube.Kube{KClient: kclient}
	annotCondFn := isOvnReady

	if pr.IsSmartNIC {
		// Add Smart-NIC connection-details annotation so ovnkube-node running on smart-NIC
		// performs the needed network plumbing.
		if err := pr.addSmartNICConnectionDetailsAnnot(kubecli); err != nil {
			return nil, err
		}
		annotCondFn = isSmartNICReady
	}
	// Get the IP address and MAC address of the pod
	// for Smart-Nic, ensure connection-details is present
	podUID, annotations, err := GetPodAnnotations(pr.ctx, podLister, kclient, namespace, podName, annotCondFn)
	if err != nil {
		return nil, fmt.Errorf("failed to get pod annotation: %v", err)
	}
	if err := pr.checkOrUpdatePodUID(podUID); err != nil {
		return nil, err
	}
~~~

... and here:
https://github.com/openshift/ovn-kubernetes/blob/6c81aa64359cf47eea0211c4c4a7113a8f8325f6/go-controller/pkg/cni/utils.go#L78
~~~
// GetPodAnnotations obtains the pod UID and annotation from the cache or apiserver
func GetPodAnnotations(ctx context.Context, podLister corev1listers.PodLister, kclient kubernetes.Interface, namespace, name string, annotCond podAnnotWaitCond) (string, map[string]string, error) {
	var notFoundCount uint

	timeout := time.After(30 * time.Second)
	for {
		select {
		case <-ctx.Done():
			return "", nil, fmt.Errorf("canceled waiting for annotations")
		case <-timeout:
			return "", nil, fmt.Errorf("timed out waiting for annotations")
		default:
			pod, err := getPod(podLister, kclient, namespace, name)
			if err != nil {
				if !apierrors.IsNotFound(err) {
					return "", nil, fmt.Errorf("failed to get pod for annotations: %v", err)
				}
				// Allow up to 1 second for pod to be found
				notFoundCount++
				if notFoundCount >= 5 {
					return "", nil, fmt.Errorf("timed out waiting for pod after 1s: %v", err)
				}
				// drop through to try again
			} else if pod != nil {
				if annotCond(pod.Annotations) {
					return string(pod.UID), pod.Annotations, nil
				}
			}

			// try again later
			time.Sleep(200 * time.Millisecond)
		}
	}
}
~~~

The pod is either read from cache or directly form the API:
https://github.com/openshift/ovn-kubernetes/blob/6c81aa64359cf47eea0211c4c4a7113a8f8325f6/go-controller/pkg/cni/utils.go#L48
~~~
// getPod tries to read a Pod object from the informer cache, or if the pod
// doesn't exist there, the apiserver. If neither a list or a kube client is
// given, returns no pod and no error
func getPod(podLister corev1listers.PodLister, kclient kubernetes.Interface, namespace, name string) (*kapi.Pod, error) {
	var pod *kapi.Pod
	var err error

	if podLister != nil {
		pod, err = podLister.Pods(namespace).Get(name)
		if err != nil && !apierrors.IsNotFound(err) {
			return nil, err
		}
		// drop through
	}

	if kclient != nil {
		// If the pod wasn't in our local cache, ask for it directly
		pod, err = kclient.CoreV1().Pods(namespace).Get(context.TODO(), name, metav1.GetOptions{})
	}

	return pod, err
}
~~~

After that we run the passed annotCond function to extract the annotation:
https://github.com/openshift/ovn-kubernetes/blob/6c81aa64359cf47eea0211c4c4a7113a8f8325f6/go-controller/pkg/cni/utils.go#L92

https://github.com/openshift/ovn-kubernetes/blob/6c81aa64359cf47eea0211c4c4a7113a8f8325f6/go-controller/pkg/cni/utils.go#L22
~~~
// isOvnReady is a wait condition for OVN master to set pod-networks annotation
func isOvnReady(podAnnotation map[string]string) bool {
	if _, ok := podAnnotation[util.OvnPodAnnotationName]; ok {
		return true
	}
	return false
}
~~~

With: go-controller/pkg/util/pod_annotation.go:	OvnPodAnnotationName = "k8s.ovn.org/pod-networks"

Just a side note if this turned out to be related to the scale issue:
The BZ also mentions the following: https://bugzilla.redhat.com/show_bug.cgi?id=2014332#c5
>> "We are only backporting to 4.8 and 4.7 what is necessary to fix the problem in the <redacted customer name> scenario, which is only 20-25 pods per node instead of 250. I'm not sure if you want to try reducing the number of pods per node (...)"

There's also the next comment here:
https://bugzilla.redhat.com/show_bug.cgi?id=2014332#c7
>> "Verified on 4.8 cluster-bot cluster build from openshift ovn-kubernetes pull 798 using the workload from comment 6 on a 120 node AWS cluster.  No annotation timeouts were seen.   Also regression tested on a small bare metal cluster"

Which refers to comment 6 with this:
>> "but we are really only looking to back port to 4.7 the fixes that will get us to <redacted customer name> level scale targets. Right now as per our understanding there won't be more than 50 pods per node (including OCP/infra pods) on the 120 node cluster."

Comment 15 Red Hat Bugzilla 2023-09-15 01:50:42 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 365 days

Note You need to log in before you can comment on or make changes to this bug.