Bug 2103486
| Summary: | [Assisted-Installer] CVO installation hang on pending service-ca, as a result cluster installation fails | ||||||
|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Lital Alon <lalon> | ||||
| Component: | Node | Assignee: | Harshal Patil <harpatil> | ||||
| Node sub component: | CRI-O | QA Contact: | Sunil Choudhary <schoudha> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |||||
| Severity: | unspecified | ||||||
| Priority: | unspecified | CC: | jchaloup, mfojtik, qiwan, rfreiman, rravaiol, surbania, xxia | ||||
| Version: | 4.10 | ||||||
| Target Milestone: | --- | ||||||
| Target Release: | --- | ||||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2022-09-13 08:53:20 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
|
Description
Lital Alon
2022-07-03 18:06:13 UTC
[root@master-0-0 core]# oc describe pod service-ca-7567987bc8-hdnjx -n openshift-service-ca
Name: service-ca-7567987bc8-hdnjx
Namespace: openshift-service-ca
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: app=service-ca
pod-template-hash=7567987bc8
service-ca=true
Annotations: openshift.io/scc: restricted
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/service-ca-7567987bc8
Containers:
service-ca-controller:
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8b0e194c1cb6babeb2d8326091deb2bba63c430938abe056eb89398c78733eb6
Port: 8443/TCP
Host Port: 0/TCP
Command:
service-ca-operator
controller
Args:
-v=2
Requests:
cpu: 10m
memory: 120Mi
Environment: <none>
Mounts:
/var/run/configmaps/signing-cabundle from signing-cabundle (rw)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-9wkvc (ro)
/var/run/secrets/signing-key from signing-key (rw)
Volumes:
signing-key:
Type: Secret (a volume populated by a Secret)
SecretName: signing-key
Optional: false
signing-cabundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: signing-cabundle
Optional: false
kube-api-access-9wkvc:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: node-role.kubernetes.io/master=
Tolerations: node-role.kubernetes.io/master:NoSchedule op=Exists
node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 120s
node.kubernetes.io/unreachable:NoExecute op=Exists for 120s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning ErrorAddingLogicalPort 24m (x2 over 25m) controlplane failed to ensurePod openshift-service-ca/service-ca-7567987bc8-hdnjx since it is not yet scheduled
The resources required by service-ca controllers haven't changed for 3 years - https://github.com/openshift/service-ca-operator/commit/6bcc7a3fb1a45adfa226bd7d3c86df82743b09ca#diff-c0ea315722dd89f0eae8a277ef034d8a6397f9470017f0dd0a371efe62a3fe9a. If your system is running out of memory/CPU, there's nothing we can do about it. Moving to scheduler, adding a needinfo to supply a proof that there is enough CPU and memory on the node to run the pods of the above mentioned deployment. The warning ` ErrorAddingLogicalPort 24m (x2 over 25m) controlplane failed to` makes me think if underlying network is borked. Is the must-gather collected after restarting the kubelet? I don't see references to service-ca-7567987bc8-hdnjx in the kubelet.log attached. Assigning it to networking team. If you think it is not the case, feel free to assign it back. We haven't take the logs after restart, and i dont have such cluster reproduced at the moment Hi, From the networking point of view the error reported above shows that ovn-k couldn't possibly create a port for this pod on any given node, since the pod itself hadn't been scheduled on a node yet for 24 minutes: Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning ErrorAddingLogicalPort 24m (x2 over 25m) controlplane failed to ensurePod openshift-service-ca/service-ca-7567987bc8-hdnjx since it is not yet scheduled So the real question is: why wasn't the pod scheduled for such a long time? Setting the component to "kube-apiserver" for further investigation (there are four components that include "api", this seemed the most appropriate at a first glance). Please feel free to reassign back to networking if you think there's more evidence pointing at the cluster network. |