Bug 2105973
| Summary: | Day-1 Networking - Static IPs on ipv6 Deployment Failure | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Adina Wolff <awolff> |
| Component: | Bare Metal Hardware Provisioning | Assignee: | Iury Gregory Melo Ferreira <imelofer> |
| Bare Metal Hardware Provisioning sub component: | ironic | QA Contact: | Amit Ugol <augol> |
| Status: | CLOSED DEFERRED | Docs Contact: | |
| Severity: | unspecified | ||
| Priority: | unspecified | CC: | rpittau, shardy, yporagpa |
| Version: | 4.11 | ||
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2023-03-09 01:24:07 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
I just retested with dhcp: false in the networkConfig section. In this case, the deployment failed earlier, with the issue described in bz2050296 Another update: this occurs when "dhcp: false" in install-config if there is no dhcp server running on the environment. Another deployment that was just run, succeeded to create one of two workers must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.5391045284423863836.tar.gz [kni@provisionhost-0-0 ~]$ oc logs machine-approver-c468d47c8-8sfkx -n openshift-cluster-machine-approver -c machine-approver-controller ...... ....... E0717 11:45:43.643158 1 csr_check.go:257] csr-hnglk: failed to find machine for node worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com, cannot approve I0717 11:45:43.643201 1 controller.go:233] csr-hnglk: CSR not authorized [kni@provisionhost-0-0 ~]$ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ocp-edge-cluster-0-wnkz6-master-0 Running 3h48m openshift-machine-api ocp-edge-cluster-0-wnkz6-master-1 Running 3h48m openshift-machine-api ocp-edge-cluster-0-wnkz6-master-2 Running 3h48m openshift-machine-api ocp-edge-cluster-0-wnkz6-worker-0-8hnst Provisioned 3h30m openshift-machine-api ocp-edge-cluster-0-wnkz6-worker-0-blqqw Running 3h30m [kni@provisionhost-0-0 ~]$ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api openshift-master-0-0 externally provisioned ocp-edge-cluster-0-wnkz6-master-0 true 3h48m openshift-machine-api openshift-master-0-1 externally provisioned ocp-edge-cluster-0-wnkz6-master-1 true 3h48m openshift-machine-api openshift-master-0-2 externally provisioned ocp-edge-cluster-0-wnkz6-master-2 true 3h48m openshift-machine-api openshift-worker-0-0 provisioned ocp-edge-cluster-0-wnkz6-worker-0-8hnst true 3h48m openshift-machine-api openshift-worker-0-1 provisioned ocp-edge-cluster-0-wnkz6-worker-0-blqqw true 3h49m [kni@provisionhost-0-0 ~]$ I believe this was fixed in a recent release @awolff can you try to reproduce with a recent release? @rpittau Happy to. What version should have the fix? @awolff plesae try with 4.11.0-0.nightly-2022-08-04-081314 it's the latest accepted nightly thanks! I was testing this instead of Adina, who has been moved to a different project.
Adding the following networkConfig field to all nodes.
networkConfig:
routes:
config:
- destination: ::/0
next-hop-address: fd2e:6f44:5dd8::1
next-hop-interface: enp0s4
dns-resolver:
config:
server:
- fd2e:6f44:5dd8::1
interfaces:
- name: enp0s4
type: ethernet
state: up
ipv6:
address:
- ip: fd2e:6f44:5dd8::3[2-6]
prefix-length: 64
enabled: true
dhcp: false
[kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.11.0-0.nightly-2022-08-04-081314
built from commit 37684309bcb598757c99d3ea9fbc0758343d64a5
release image registry.ci.openshift.org/ocp/release@sha256:f8193229643849346f6f90107dcd415d7f74969c51aaa31953d998693cfaae88
release architecture amd64
TLDR - installation failed, no apparent change in behavior.
…..
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, monitoring
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-08-04-081314: 800 of 802 done (99% complete)
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available
INFO Cluster operator baremetal Disabled is False with :
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-685f7b67f7-pxd54" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.)
INFO Cluster operator insights SCAAvailable is False with NonHTTPError: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: cluster authorization token is not configured
INFO Cluster operator insights ClusterTransferAvailable is False with Disconnected: failed to pull cluster transfer: cluster authorization token is not configured
INFO Cluster operator insights Disabled is True with Disabled: Health reporting is disabled
INFO Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack.
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas
INFO Cluster operator network ManagementStateDegraded is False with :
ERROR Cluster initialization failed because one or more operators are not functioning properly.
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below,
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation
ERROR failed to initialize the cluster: Cluster operator monitoring is not available
Further details:
[kni@provisionhost-0-0 ~]$ oc get co
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
…..
ingress 4.11.0-0.nightly-2022-08-04-081314 True False True 59m The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-685f7b67f7-pxd54" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.)
……
monitoring False True True 61m Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
…….
[kni@provisionhost-0-0 ~]$ oc get pods -A | grep -vE "Run|Complete|market"
NAMESPACE NAME READY STATUS RESTARTS AGE
openshift-ingress router-default-685f7b67f7-pxd54 0/1 Pending 0 66m
openshift-monitoring prometheus-operator-admission-webhook-6db58c58f7-lb7x9 0/1 Pending 0 67m
[kni@provisionhost-0-0 ~]$ oc describe pods prometheus-operator-admission-webhook-6db58c58f7-lb7x9 -n openshift-monitoring
Name: prometheus-operator-admission-webhook-6db58c58f7-lb7x9
Namespace: openshift-monitoring
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: app.kubernetes.io/managed-by=cluster-monitoring-operator
app.kubernetes.io/name=prometheus-operator-admission-webhook
app.kubernetes.io/part-of=openshift-monitoring
app.kubernetes.io/version=0.57.0
pod-template-hash=6db58c58f7
Annotations: kubectl.kubernetes.io/default-container: prometheus-operator-admission-webhook
openshift.io/scc: restricted-v2
seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/prometheus-operator-admission-webhook-6db58c58f7
Containers:
prometheus-operator-admission-webhook:
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a67755cf6419877dce688f5bcd92da4929dcd86603219eb08628b44a29c4257
Port: 8443/TCP
Host Port: 0/TCP
Args:
--web.enable-tls=true
--web.tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
--web.tls-min-version=VersionTLS12
--web.cert-file=/etc/tls/private/tls.crt
--web.key-file=/etc/tls/private/tls.key
Requests:
cpu: 5m
memory: 30Mi
Liveness: http-get https://:https/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get https://:https/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Environment: <none>
Mounts:
/etc/tls/private from tls-certificates (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
tls-certificates:
Type: Secret (a volume populated by a Secret)
SecretName: prometheus-operator-admission-webhook-tls
Optional: false
QoS Class: Burstable
Node-Selectors: <none>
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 66m default-scheduler 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Warning FailedScheduling 69m default-scheduler 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Warning FailedScheduling 55m (x2 over 60m) default-scheduler 0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Warning FailedScheduling 108s (x26 over 53m) default-scheduler 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.
[kni@provisionhost-0-0 ~]$ oc describe pods router-default-685f7b67f7-pxd54 -n openshift-ingress
Name: router-default-685f7b67f7-pxd54
Namespace: openshift-ingress
Priority: 2000000000
Priority Class Name: system-cluster-critical
Node: <none>
Labels: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
ingresscontroller.operator.openshift.io/hash=7b64bc7c5d
pod-template-hash=685f7b67f7
Annotations: openshift.io/scc: hostnetwork
unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds: 10
Status: Pending
IP:
IPs: <none>
Controlled By: ReplicaSet/router-default-685f7b67f7
Containers:
router:
Image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:84512d9eed4c6b018b34205655932f0b347bd4562500be8b01b73f46af36f574
Ports: 80/TCP, 443/TCP, 1936/TCP
Host Ports: 80/TCP, 443/TCP, 1936/TCP
Requests:
cpu: 100m
memory: 256Mi
Liveness: http-get http://localhost:1936/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
Readiness: http-get http://localhost:1936/healthz/ready delay=0s timeout=1s period=10s #success=1 #failure=3
Startup: http-get http://localhost:1936/healthz/ready delay=0s timeout=1s period=1s #success=1 #failure=120
Environment:
DEFAULT_CERTIFICATE_DIR: /etc/pki/tls/private
DEFAULT_DESTINATION_CA_PATH: /var/run/configmaps/service-ca/service-ca.crt
RELOAD_INTERVAL: 5s
ROUTER_ALLOW_WILDCARD_ROUTES: false
ROUTER_CANONICAL_HOSTNAME: router-default.apps.ocp-edge-cluster-0.qe.lab.redhat.com
ROUTER_CIPHERS: ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
ROUTER_CIPHERSUITES: TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
ROUTER_DISABLE_HTTP2: true
ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK: false
ROUTER_DOMAIN: apps.ocp-edge-cluster-0.qe.lab.redhat.com
ROUTER_IP_V4_V6_MODE: v6
ROUTER_LOAD_BALANCE_ALGORITHM: random
ROUTER_METRICS_TLS_CERT_FILE: /etc/pki/tls/metrics-certs/tls.crt
ROUTER_METRICS_TLS_KEY_FILE: /etc/pki/tls/metrics-certs/tls.key
ROUTER_METRICS_TYPE: haproxy
ROUTER_SERVICE_HTTPS_PORT: 443
ROUTER_SERVICE_HTTP_PORT: 80
ROUTER_SERVICE_NAME: default
ROUTER_SERVICE_NAMESPACE: openshift-ingress
ROUTER_SET_FORWARDED_HEADERS: append
ROUTER_TCP_BALANCE_SCHEME: source
ROUTER_THREADS: 4
SSL_MIN_VERSION: TLSv1.2
STATS_PASSWORD_FILE: /var/lib/haproxy/conf/metrics-auth/statsPassword
STATS_PORT: 1936
STATS_USERNAME_FILE: /var/lib/haproxy/conf/metrics-auth/statsUsername
Mounts:
/etc/pki/tls/metrics-certs from metrics-certs (ro)
/etc/pki/tls/private from default-certificate (ro)
/var/lib/haproxy/conf/metrics-auth from stats-auth (ro)
/var/run/configmaps/service-ca from service-ca-bundle (ro)
/var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v56sr (ro)
Conditions:
Type Status
PodScheduled False
Volumes:
default-certificate:
Type: Secret (a volume populated by a Secret)
SecretName: router-certs-default
Optional: false
service-ca-bundle:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: service-ca-bundle
Optional: false
stats-auth:
Type: Secret (a volume populated by a Secret)
SecretName: router-stats-default
Optional: false
metrics-certs:
Type: Secret (a volume populated by a Secret)
SecretName: router-metrics-certs-default
Optional: false
kube-api-access-v56sr:
Type: Projected (a volume that contains injected data from multiple sources)
TokenExpirationSeconds: 3607
ConfigMapName: kube-root-ca.crt
ConfigMapOptional: <nil>
DownwardAPI: true
ConfigMapName: openshift-service-ca.crt
ConfigMapOptional: <nil>
QoS Class: Burstable
Node-Selectors: kubernetes.io/os=linux
node-role.kubernetes.io/worker=
Tolerations: node.kubernetes.io/memory-pressure:NoSchedule op=Exists
node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 67m default-scheduler 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Warning FailedScheduling 69m default-scheduler 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Warning FailedScheduling 56m (x2 over 61m) default-scheduler 0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
Warning FailedScheduling 2m32s (x25 over 53m) default-scheduler 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 Preemption is not helpful for scheduling.
Must gather available in:
http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ2105973_ipv6_installation_mustgather_100822.tar.gz
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9381 |
Description of problem: When deploying OCP using ipv6 static IP configuration on baremetal network, worker nodes fail to create. The cluster-machine-approver doesn't approve the csr because it's expecting the name openshift-worker-0-0, whereas the actual node name is worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com (this name is picked up by backwards dns lookup) Similar configuration is successful when run on ipv4. Version-Release number of selected component (if applicable): 4.11.0-0.nightly-2022-07-08-231743 How reproducible: 2/2 Steps to Reproduce: Deploy OCP with networkConfig section for each of the nodes, similar to the following: networkConfig: routes: config: - destination: ::/0 next-hop-address: fd2e:6f44:5dd8::1 next-hop-interface: enp0s4 dns-resolver: config: server: - fd2e:6f44:5dd8::1 interfaces: - name: enp0s4 type: ethernet state: up ipv6: address: - ip: fd2e:6f44:5dd8::56 prefix-length: 64 enabled: true Actual results: Deployment fails because worker nodes aren't created Expected results: Deployment succeeds Additional info: Must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.240973992141636803.tar.gz install-config.yaml: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/install-config.static.ipv6.yaml