Bug 2103486

Summary: [Assisted-Installer] CVO installation hang on pending service-ca, as a result cluster installation fails
Product: OpenShift Container Platform Reporter: Lital Alon <lalon>
Component: NodeAssignee: Harshal Patil <harpatil>
Node sub component: CRI-O QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: unspecified    
Priority: unspecified CC: jchaloup, mfojtik, qiwan, rfreiman, rravaiol, surbania, xxia
Version: 4.10   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-09-13 08:53:20 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
pods none

Description Lital Alon 2022-07-03 18:06:13 UTC
Created attachment 1894351 [details]
pods

Created attachment 1894351 [details]
pods

Description of problem:
I was installing SNO OCP 4.10.18 - Issue is that installation failed to complete.
CVO was hang.
from what we seen in the logs: 

Warning  FailedMount  2m47s (x19 over 25m)  kubelet            MountVolume.SetUp failed for volume "serving-cert" : secret "cluster-version-operator-serving-cert" not found

manual restart to kubelet helped to overcome this issue but for Assisted-installation cluster it causes user's installation to fail.

Version-Release number of selected component (if applicable):
AI 2.5, Staging, OCP 4.10.18, used libvrit nodes

How reproducible:
20%-30%, very random

Steps to Reproduce:
1. Install SNO cluster


Actual results:
CVO installation hang

Expected results:
Installation completed successfully


Additional info:
QE noticed similar failures also with multi node

Comment 1 Lital Alon 2022-07-03 18:06:49 UTC
[root@master-0-0 core]# oc describe pod service-ca-7567987bc8-hdnjx -n openshift-service-ca
Name:                 service-ca-7567987bc8-hdnjx
Namespace:            openshift-service-ca
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               app=service-ca
                      pod-template-hash=7567987bc8
                      service-ca=true
Annotations:          openshift.io/scc: restricted
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/service-ca-7567987bc8
Containers:
  service-ca-controller:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b0e194c1cb6babeb2d8326091deb2bba63c430938abe056eb89398c78733eb6
    Port:       8443/TCP
    Host Port:  0/TCP
    Command:
      service-ca-operator
      controller
    Args:
      -v=2
    Requests:
      cpu:        10m
      memory:     120Mi
    Environment:  <none>
    Mounts:
      /var/run/configmaps/signing-cabundle from signing-cabundle (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9wkvc (ro)
      /var/run/secrets/signing-key from signing-key (rw)
Volumes:
  signing-key:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  signing-key
    Optional:    false
  signing-cabundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      signing-cabundle
    Optional:  false
  kube-api-access-9wkvc:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              node-role.kubernetes.io/master=
Tolerations:                 node-role.kubernetes.io/master:NoSchedule op=Exists
                             node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
  Type     Reason                  Age                From          Message
  ----     ------                  ----               ----          -------
  Warning  ErrorAddingLogicalPort  24m (x2 over 25m)  controlplane  failed to ensurePod openshift-service-ca/service-ca-7567987bc8-hdnjx since it is not yet scheduled

Comment 4 Standa Laznicka 2022-07-08 09:56:15 UTC
The resources required by service-ca controllers haven't changed for 3 years - https://github.com/openshift/service-ca-operator/commit/6bcc7a3fb1a45adfa226bd7d3c86df82743b09ca#diff-c0ea315722dd89f0eae8a277ef034d8a6397f9470017f0dd0a371efe62a3fe9a. If your system is running out of memory/CPU, there's nothing we can do about it.

Moving to scheduler, adding a needinfo to supply a proof that there is enough CPU and memory on the node to run the pods of the above mentioned deployment.

Comment 5 ravig 2022-07-08 13:34:39 UTC
The warning ` ErrorAddingLogicalPort  24m (x2 over 25m)  controlplane  failed to` makes me think if underlying network is borked. Is the must-gather collected after restarting the kubelet? I don't see references to service-ca-7567987bc8-hdnjx in the kubelet.log attached. Assigning it to networking team. If you think it is not the case, feel free to assign it back.

Comment 6 Lital Alon 2022-07-13 00:31:14 UTC
We haven't take the logs after restart, and i dont have such cluster reproduced at the moment

Comment 7 Riccardo Ravaioli 2022-07-13 16:17:35 UTC
Hi,

From the networking point of view the error reported above shows that ovn-k couldn't possibly create a port for this pod on any given node, since the pod itself hadn't been scheduled on a node yet for 24 minutes:

Events:
  Type     Reason                  Age                From          Message
  ----     ------                  ----               ----          -------
  Warning  ErrorAddingLogicalPort  24m (x2 over 25m)  controlplane  failed to ensurePod openshift-service-ca/service-ca-7567987bc8-hdnjx since it is not yet scheduled

So the real question is: why wasn't the pod scheduled for such a long time?

Setting the component to "kube-apiserver" for further investigation (there are four components that include "api", this seemed the most appropriate at a first glance). Please feel free to reassign back to networking if you think there's more evidence pointing at the cluster network.