Bug 2105973

Summary: Day-1 Networking - Static IPs on ipv6 Deployment Failure
Product: OpenShift Container Platform Reporter: Adina Wolff <awolff>
Component: Bare Metal Hardware ProvisioningAssignee: Iury Gregory Melo Ferreira <imelofer>
Bare Metal Hardware Provisioning sub component: ironic QA Contact: Amit Ugol <augol>
Status: CLOSED DEFERRED Docs Contact:
Severity: unspecified    
Priority: unspecified CC: rpittau, shardy, yporagpa
Version: 4.11   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-03-09 01:24:07 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Adina Wolff 2022-07-11 11:35:45 UTC
Description of problem:
When deploying OCP using ipv6 static IP configuration on baremetal network, worker nodes fail to create.
The cluster-machine-approver doesn't approve the csr because it's expecting the name openshift-worker-0-0, whereas the actual node name is worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com (this name is picked up by backwards dns lookup)

Similar configuration is successful when run on ipv4.


Version-Release number of selected component (if applicable):
4.11.0-0.nightly-2022-07-08-231743

How reproducible: 2/2


Steps to Reproduce:
Deploy OCP with networkConfig section for each of the nodes, similar to the following:

        networkConfig:
          routes:
            config:
            - destination: ::/0
              next-hop-address: fd2e:6f44:5dd8::1
              next-hop-interface: enp0s4
          dns-resolver:
            config:
              server:
              - fd2e:6f44:5dd8::1
          interfaces:
          - name: enp0s4
            type: ethernet
            state: up
            ipv6:
              address:
              - ip: fd2e:6f44:5dd8::56
                prefix-length: 64
              enabled: true

Actual results:
Deployment fails because worker nodes aren't created

Expected results:
Deployment succeeds


Additional info:
Must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.240973992141636803.tar.gz

install-config.yaml: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/install-config.static.ipv6.yaml

Comment 1 Adina Wolff 2022-07-11 12:38:27 UTC
I just retested with dhcp: false in the networkConfig section. 
In this case, the deployment failed earlier, with the issue described in bz2050296

Comment 2 Adina Wolff 2022-07-14 16:25:31 UTC
Another update: this occurs when "dhcp: false" in install-config if there is no dhcp server running on the environment.

Comment 3 Adina Wolff 2022-07-17 12:09:24 UTC
Another deployment that was just run, succeeded to create one of two workers
must-gather: http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/must-gather.local.5391045284423863836.tar.gz

[kni@provisionhost-0-0 ~]$ oc logs machine-approver-c468d47c8-8sfkx -n openshift-cluster-machine-approver -c machine-approver-controller
......
.......
E0717 11:45:43.643158       1 csr_check.go:257] csr-hnglk: failed to find machine for node worker-0-0.ocp-edge-cluster-0.qe.lab.redhat.com, cannot approve
I0717 11:45:43.643201       1 controller.go:233] csr-hnglk: CSR not authorized
[kni@provisionhost-0-0 ~]$ oc get machines -A
NAMESPACE               NAME                                      PHASE         TYPE   REGION   ZONE   AGE
openshift-machine-api   ocp-edge-cluster-0-wnkz6-master-0         Running                              3h48m
openshift-machine-api   ocp-edge-cluster-0-wnkz6-master-1         Running                              3h48m
openshift-machine-api   ocp-edge-cluster-0-wnkz6-master-2         Running                              3h48m
openshift-machine-api   ocp-edge-cluster-0-wnkz6-worker-0-8hnst   Provisioned                          3h30m
openshift-machine-api   ocp-edge-cluster-0-wnkz6-worker-0-blqqw   Running                              3h30m
[kni@provisionhost-0-0 ~]$ oc get bmh -A
NAMESPACE               NAME                   STATE                    CONSUMER                                  ONLINE   ERROR   AGE
openshift-machine-api   openshift-master-0-0   externally provisioned   ocp-edge-cluster-0-wnkz6-master-0         true             3h48m
openshift-machine-api   openshift-master-0-1   externally provisioned   ocp-edge-cluster-0-wnkz6-master-1         true             3h48m
openshift-machine-api   openshift-master-0-2   externally provisioned   ocp-edge-cluster-0-wnkz6-master-2         true             3h48m
openshift-machine-api   openshift-worker-0-0   provisioned              ocp-edge-cluster-0-wnkz6-worker-0-8hnst   true             3h48m
openshift-machine-api   openshift-worker-0-1   provisioned              ocp-edge-cluster-0-wnkz6-worker-0-blqqw   true             3h49m
[kni@provisionhost-0-0 ~]$

Comment 4 Riccardo Pittau 2022-08-09 11:38:56 UTC
I believe this was fixed in a recent release
@awolff can you try to reproduce with a recent release?

Comment 5 Adina Wolff 2022-08-09 16:30:04 UTC
@rpittau Happy to. What version should have the fix?

Comment 6 Riccardo Pittau 2022-08-10 07:59:40 UTC
@awolff plesae try with 4.11.0-0.nightly-2022-08-04-081314
it's the latest accepted nightly
thanks!

Comment 7 Yoav Porag 2022-08-10 11:20:25 UTC
I was testing this instead of Adina, who has been moved to a different project.

Adding the following networkConfig field to all nodes.

        networkConfig: 
          routes:
            config:
            - destination: ::/0
              next-hop-address: fd2e:6f44:5dd8::1
              next-hop-interface: enp0s4
          dns-resolver:
            config:
              server:
              - fd2e:6f44:5dd8::1
          interfaces:
          - name: enp0s4
            type: ethernet
            state: up
            ipv6:
              address:
              - ip: fd2e:6f44:5dd8::3[2-6] 
                prefix-length: 64
              enabled: true
              dhcp: false

[kni@provisionhost-0-0 ~]$ ./openshift-baremetal-install version
./openshift-baremetal-install 4.11.0-0.nightly-2022-08-04-081314
built from commit 37684309bcb598757c99d3ea9fbc0758343d64a5
release image registry.ci.openshift.org/ocp/release@sha256:f8193229643849346f6f90107dcd415d7f74969c51aaa31953d998693cfaae88
release architecture amd64


TLDR - installation failed, no apparent change in behavior.



…..
DEBUG Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, monitoring 
DEBUG Still waiting for the cluster to initialize: Working towards 4.11.0-0.nightly-2022-08-04-081314: 800 of 802 done (99% complete) 
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available 
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available 
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available 
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available 
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available 
DEBUG Still waiting for the cluster to initialize: Cluster operator monitoring is not available 
INFO Cluster operator baremetal Disabled is False with :  
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerAvailable is True with AsExpected: Trusted CA Bundle Controller works as expected 
INFO Cluster operator cloud-controller-manager TrustedCABundleControllerControllerDegraded is False with AsExpected: Trusted CA Bundle Controller works as expected 
INFO Cluster operator cloud-controller-manager CloudConfigControllerAvailable is True with AsExpected: Cloud Config Controller works as expected 
INFO Cluster operator cloud-controller-manager CloudConfigControllerDegraded is False with AsExpected: Cloud Config Controller works as expected 
INFO Cluster operator etcd RecentBackup is Unknown with ControllerStarted: The etcd backup controller is starting, and will decide if recent backups are available or if a backup is required 
ERROR Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-685f7b67f7-pxd54" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.) 
INFO Cluster operator insights SCAAvailable is False with NonHTTPError: Failed to pull SCA certs from https://api.openshift.com/api/accounts_mgmt/v1/certificates: cluster authorization token is not configured 
INFO Cluster operator insights ClusterTransferAvailable is False with Disconnected: failed to pull cluster transfer: cluster authorization token is not configured 
INFO Cluster operator insights Disabled is True with Disabled: Health reporting is disabled 
INFO Cluster operator monitoring Available is False with UpdatingPrometheusOperatorFailed: Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error. 
INFO Cluster operator monitoring Progressing is True with RollOutInProgress: Rolling out the stack. 
ERROR Cluster operator monitoring Degraded is True with UpdatingPrometheusOperatorFailed: Failed to rollout the stack. Error: updating prometheus operator: reconciling Prometheus Operator Admission Webhook Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/prometheus-operator-admission-webhook: got 1 unavailable replicas 
INFO Cluster operator network ManagementStateDegraded is False with :  
ERROR Cluster initialization failed because one or more operators are not functioning properly. 
ERROR The cluster should be accessible for troubleshooting as detailed in the documentation linked below, 
ERROR https://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html 
ERROR The 'wait-for install-complete' subcommand can then be used to continue the installation 
ERROR failed to initialize the cluster: Cluster operator monitoring is not available 


Further details:

[kni@provisionhost-0-0 ~]$ oc get co
NAME                                       VERSION                              AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
…..
ingress                                    4.11.0-0.nightly-2022-08-04-081314   True        False         True       59m     The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-685f7b67f7-pxd54" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 Preemption is not helpful for scheduling. Make sure you have sufficient worker nodes.)
……
monitoring                                                                      False       True          True       61m     Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
…….



[kni@provisionhost-0-0 ~]$ oc get pods -A | grep -vE "Run|Complete|market"
NAMESPACE                                          NAME                                                                             READY   STATUS             RESTARTS      AGE
openshift-ingress                                  router-default-685f7b67f7-pxd54                                                  0/1     Pending            0             66m
openshift-monitoring                               prometheus-operator-admission-webhook-6db58c58f7-lb7x9                           0/1     Pending            0             67m


[kni@provisionhost-0-0 ~]$ oc describe pods prometheus-operator-admission-webhook-6db58c58f7-lb7x9 -n openshift-monitoring
Name:                 prometheus-operator-admission-webhook-6db58c58f7-lb7x9
Namespace:            openshift-monitoring
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               app.kubernetes.io/managed-by=cluster-monitoring-operator
                      app.kubernetes.io/name=prometheus-operator-admission-webhook
                      app.kubernetes.io/part-of=openshift-monitoring
                      app.kubernetes.io/version=0.57.0
                      pod-template-hash=6db58c58f7
Annotations:          kubectl.kubernetes.io/default-container: prometheus-operator-admission-webhook
                      openshift.io/scc: restricted-v2
                      seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/prometheus-operator-admission-webhook-6db58c58f7
Containers:
  prometheus-operator-admission-webhook:
    Image:      quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9a67755cf6419877dce688f5bcd92da4929dcd86603219eb08628b44a29c4257
    Port:       8443/TCP
    Host Port:  0/TCP
    Args:
      --web.enable-tls=true
      --web.tls-cipher-suites=TLS_ECDHE_ECDSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256,TLS_ECDHE_ECDSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384,TLS_ECDHE_ECDSA_WITH_CHACHA20_POLY1305_SHA256,TLS_ECDHE_RSA_WITH_CHACHA20_POLY1305_SHA256
      --web.tls-min-version=VersionTLS12
      --web.cert-file=/etc/tls/private/tls.crt
      --web.key-file=/etc/tls/private/tls.key
    Requests:
      cpu:        5m
      memory:     30Mi
    Liveness:     http-get https://:https/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:    http-get https://:https/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /etc/tls/private from tls-certificates (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  tls-certificates:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  prometheus-operator-admission-webhook-tls
    Optional:    false
QoS Class:       Burstable
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  66m                  default-scheduler  0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  69m                  default-scheduler  0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  55m (x2 over 60m)    default-scheduler  0/3 nodes are available: 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  108s (x26 over 53m)  default-scheduler  0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't match pod anti-affinity rules, 3 Preemption is not helpful for scheduling.




[kni@provisionhost-0-0 ~]$ oc describe pods router-default-685f7b67f7-pxd54 -n openshift-ingress
Name:                 router-default-685f7b67f7-pxd54
Namespace:            openshift-ingress
Priority:             2000000000
Priority Class Name:  system-cluster-critical
Node:                 <none>
Labels:               ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
                      ingresscontroller.operator.openshift.io/hash=7b64bc7c5d
                      pod-template-hash=685f7b67f7
Annotations:          openshift.io/scc: hostnetwork
                      unsupported.do-not-use.openshift.io/override-liveness-grace-period-seconds: 10
Status:               Pending
IP:                   
IPs:                  <none>
Controlled By:        ReplicaSet/router-default-685f7b67f7
Containers:
  router:
    Image:       quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:84512d9eed4c6b018b34205655932f0b347bd4562500be8b01b73f46af36f574
    Ports:       80/TCP, 443/TCP, 1936/TCP
    Host Ports:  80/TCP, 443/TCP, 1936/TCP
    Requests:
      cpu:      100m
      memory:   256Mi
    Liveness:   http-get http://localhost:1936/healthz delay=0s timeout=1s period=10s #success=1 #failure=3
    Readiness:  http-get http://localhost:1936/healthz/ready delay=0s timeout=1s period=10s #success=1 #failure=3
    Startup:    http-get http://localhost:1936/healthz/ready delay=0s timeout=1s period=1s #success=1 #failure=120
    Environment:
      DEFAULT_CERTIFICATE_DIR:                   /etc/pki/tls/private
      DEFAULT_DESTINATION_CA_PATH:               /var/run/configmaps/service-ca/service-ca.crt
      RELOAD_INTERVAL:                           5s
      ROUTER_ALLOW_WILDCARD_ROUTES:              false
      ROUTER_CANONICAL_HOSTNAME:                 router-default.apps.ocp-edge-cluster-0.qe.lab.redhat.com
      ROUTER_CIPHERS:                            ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384:ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305:DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384
      ROUTER_CIPHERSUITES:                       TLS_AES_128_GCM_SHA256:TLS_AES_256_GCM_SHA384:TLS_CHACHA20_POLY1305_SHA256
      ROUTER_DISABLE_HTTP2:                      true
      ROUTER_DISABLE_NAMESPACE_OWNERSHIP_CHECK:  false
      ROUTER_DOMAIN:                             apps.ocp-edge-cluster-0.qe.lab.redhat.com
      ROUTER_IP_V4_V6_MODE:                      v6
      ROUTER_LOAD_BALANCE_ALGORITHM:             random
      ROUTER_METRICS_TLS_CERT_FILE:              /etc/pki/tls/metrics-certs/tls.crt
      ROUTER_METRICS_TLS_KEY_FILE:               /etc/pki/tls/metrics-certs/tls.key
      ROUTER_METRICS_TYPE:                       haproxy
      ROUTER_SERVICE_HTTPS_PORT:                 443
      ROUTER_SERVICE_HTTP_PORT:                  80
      ROUTER_SERVICE_NAME:                       default
      ROUTER_SERVICE_NAMESPACE:                  openshift-ingress
      ROUTER_SET_FORWARDED_HEADERS:              append
      ROUTER_TCP_BALANCE_SCHEME:                 source
      ROUTER_THREADS:                            4
      SSL_MIN_VERSION:                           TLSv1.2
      STATS_PASSWORD_FILE:                       /var/lib/haproxy/conf/metrics-auth/statsPassword
      STATS_PORT:                                1936
      STATS_USERNAME_FILE:                       /var/lib/haproxy/conf/metrics-auth/statsUsername
    Mounts:
      /etc/pki/tls/metrics-certs from metrics-certs (ro)
      /etc/pki/tls/private from default-certificate (ro)
      /var/lib/haproxy/conf/metrics-auth from stats-auth (ro)
      /var/run/configmaps/service-ca from service-ca-bundle (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-v56sr (ro)
Conditions:
  Type           Status
  PodScheduled   False 
Volumes:
  default-certificate:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  router-certs-default
    Optional:    false
  service-ca-bundle:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      service-ca-bundle
    Optional:  false
  stats-auth:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  router-stats-default
    Optional:    false
  metrics-certs:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  router-metrics-certs-default
    Optional:    false
  kube-api-access-v56sr:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              kubernetes.io/os=linux
                             node-role.kubernetes.io/worker=
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason            Age                   From               Message
  ----     ------            ----                  ----               -------
  Warning  FailedScheduling  67m                   default-scheduler  0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  69m                   default-scheduler  0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  56m (x2 over 61m)     default-scheduler  0/3 nodes are available: 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/3 nodes are available: 3 Preemption is not helpful for scheduling.
  Warning  FailedScheduling  2m32s (x25 over 53m)  default-scheduler  0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 node(s) didn't match Pod's node affinity/selector, 3 node(s) had untolerated taint {node-role.kubernetes.io/master: }. preemption: 0/4 nodes are available: 1 node(s) didn't have free ports for the requested pod ports, 3 Preemption is not helpful for scheduling.


Must gather available in:
http://rhos-compute-node-10.lab.eng.rdu2.redhat.com/logs/BZ2105973_ipv6_installation_mustgather_100822.tar.gz

Comment 9 Shiftzilla 2023-03-09 01:24:07 UTC
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9381