Description of problem: Trying out IPI on RHOSP with a normal 3 masters + 3 workers installation works fine. Setting workers replicas to 0 in install-config.yaml leads to a failure during last step at around 97%. Log file: time="2021-04-27T20:17:51Z" level=debug msg="Still waiting for the cluster to initialize: Working towards 4.7.7: 650 of 668 done (97% complete)" time="2021-04-27T20:20:21Z" level=debug msg="Still waiting for the cluster to initialize: Some cluster operators are still updating: authentication, console" time="2021-04-27T20:23:38Z" level=error msg="Cluster operator authentication Degraded is True with OAuthRouteCheckEndpointAccessibleController_SyncError: OAuthRouteCheckEndpointAccessibleControllerDegraded: Get \"https://oauth-openshift.apps.ocp4.ros.space.corp/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" time="2021-04-27T20:23:38Z" level=info msg="Cluster operator authentication Progressing is True with OAuthVersionRoute_WaitingForRoute: OAuthVersionRouteProgressing: Request to \"https://oauth-openshift.apps.ocp4.ros.space.corp/healthz\" not successful yet" time="2021-04-27T20:23:38Z" level=info msg="Cluster operator authentication Available is False with OAuthRouteCheckEndpointAccessibleController_EndpointUnavailable::OAuthVersionRoute_RequestFailed: OAuthVersionRouteAvailable: HTTP request to \"https://oauth-openshift.apps.ocp4.ros.space.corp/healthz\" failed: dial tcp 10.0.0.7:443: i/o timeout\nOAuthRouteCheckEndpointAccessibleControllerAvailable: Get \"https://oauth-openshift.apps.ocp4.ros.space.corp/healthz\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" time="2021-04-27T20:23:38Z" level=info msg="Cluster operator baremetal Disabled is True with UnsupportedPlatform: Nothing to do on this Platform" time="2021-04-27T20:23:38Z" level=error msg="Cluster operator console Degraded is True with RouteHealth_FailedGet: RouteHealthDegraded: failed to GET route (https://console-openshift-console.apps.ocp4.ros.space.corp/health): Get \"https://console-openshift-console.apps.ocp4.ros.space.corp/health\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)" time="2021-04-27T20:23:38Z" level=info msg="Cluster operator console Progressing is True with SyncLoopRefresh_InProgress: SyncLoopRefreshProgressing: Working toward version 4.7.7" time="2021-04-27T20:23:38Z" level=info msg="Cluster operator console Available is False with Deployment_InsufficientReplicas: DeploymentAvailable: 0 pods available for console deployment" time="2021-04-27T20:23:38Z" level=info msg="Cluster operator insights Disabled is False with AsExpected: " time="2021-04-27T20:23:38Z" level=info msg="Cluster operator network ManagementStateDegraded is False with : " time="2021-04-27T20:23:38Z" level=error msg="Cluster initialization failed because one or more operators are not functioning properly.\nThe cluster should be accessible for troubleshooting as detailed in the documentation linked below,\nhttps://docs.openshift.com/container-platform/latest/support/troubleshooting/troubleshooting-installations.html\nThe 'wait-for install-complete' subcommand can then be used to continue the installation" time="2021-04-27T20:23:38Z" level=fatal msg="failed to initialize the cluster: Some cluster operators are still updating: authentication, console" After diffing the 3+3 and the 3+0 installations it became clear that they identical and do not show any real differences. The openshift-ingress-operator looks good, so I rsh into the pod with oc --kubeconfig ocp4-config/auth/kubeconfig -n openshift-ingress-operator rsh pod/ingress-operator-54d886557b-f2gg8 and there issued a curl -kv https://canary-openshift-ingress-canary.apps.ocp4.ros.space.corp which succeeded, while the same curl from the outside using the FIP stalls. That lead to looking into the security groups which reveals that the masters just have the master SG attached, while in a densed master-only setup, they need to have the worker SG as well. Adding that durng deployment to the 3 masters solves the problem instantly and the installation succeeds. Version-Release number of selected component (if applicable): OCP 4.6 and 4.7 with or w/o kuryr on RHOSP 16.1 How reproducible: always Steps to Reproduce: 1. create an install-config on RHOSP 2. set the workers replicas to 0 3. install the OCP cluster on RHOSP 4. see it fail arround 97% with oauth and console Actual results: installation fails Expected results: installation succeeds Additional info: The quick fix is to manually add the worker SG to the masters during installation: BASE=ocp4-xvkg5 SG=$BASE-worker for server in master-{0..2}; do openstack server add security group $BASE-$server $SG done
I could reproduce the issue fine. I tried to set the master node as "Schedulable", to let OCP deploy worker pods onto a master node, but it didn't seem to work for me. I'll keep looking. Since there is an easy workaround (with security group), I'll set the bug as LOW. Please let me know if you think otherwise.
This would be incredibly useful for Service Telemetry Framework testing, which uses a 3-node reference architecture for OCP. I'd like to request this be bumped in priority to maybe medium. Our intent would be to use this for testing the Shift on Stack scenario to providing OpenStack monitoring without the need for an external cluster. CC: @pkilambi @swilber
Verified in OCP 4.10.0-0.nightly-2021-10-16-173656 on top of OSP RHOS-16.1-RHEL-8-20210916.n.0. Verification steps: 1) Installation of OCP with 3 masters and with 0 workers finished successfully: >$ openshift-install create cluster --dir ostest/ >INFO Credentials loaded from file "/home/stack/clouds.yaml" >WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings >INFO Consuming Install Config from target directory >INFO Obtaining RHCOS image file from 'https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.9/49.84.202107010027-0/x86_64/rhcos-49.84.202107010027-0-openstack.x86_64.qcow2.gz?sha256=00cb56c8711686255744646394e22a8ca5f27e059016f6758f14388e5a0a14cb' >INFO The file was found in cache: /home/stack/.cache/openshift-installer/image_cache/rhcos-49.84.202107010027-0-openstack.x86_64.qcow2. Reusing... >WARNING Following quotas Subnet, SecurityGroup, Port, Network, SecurityGroupRule are available but will be completely used pretty soon. >INFO Creating infrastructure resources... >INFO Waiting up to 20m0s for the Kubernetes API at https://api.ostest.shiftstack.com:6443... >INFO API v1.22.1+9312243 up >INFO Waiting up to 30m0s for bootstrapping to complete... >INFO Destroying the bootstrap resources... >INFO Waiting up to 40m0s for the cluster at https://api.ostest.shiftstack.com:6443 to initialize... >INFO Waiting up to 10m0s for the openshift-console route to be created... >INFO Install complete! >INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/ostest/auth/kubeconfig' >INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ostest.shiftstack.com >INFO Login to the console with user: "kubeadmin", and password: "5Ph7w-6NQQC-kNeu8-AZNny" >INFO Time elapsed: 23m56s 2) Make sure the OCP cluster is operational: >$ oc get machineset -A >NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE >openshift-machine-api ostest-sccdp-worker-0 0 0 120m >$ oc get machines -A >NAMESPACE NAME PHASE TYPE REGION ZONE AGE >openshift-machine-api ostest-sccdp-master-0 Running m4.xlarge regionOne nova 120m >openshift-machine-api ostest-sccdp-master-1 Running m4.xlarge regionOne nova 120m >openshift-machine-api ostest-sccdp-master-2 Running m4.xlarge regionOne nova 120m >$ oc get nodes >NAME STATUS ROLES AGE VERSION >ostest-sccdp-master-0 Ready master,worker 119m v1.22.1+9312243 >ostest-sccdp-master-1 Ready master,worker 119m v1.22.1+9312243 >ostest-sccdp-master-2 Ready master,worker 119m v1.22.1+9312243 >$ openstack server list >+--------------------------------------+-----------------------+--------+-------------------------------------+--------------------+--------+ >| ID | Name | Status | Networks | Image | Flavor | >+--------------------------------------+-----------------------+--------+-------------------------------------+--------------------+--------+ >| fa32d58d-4d3d-45f5-aadc-3fc5774860a6 | ostest-sccdp-master-2 | ACTIVE | ostest-sccdp-openshift=10.196.0.111 | ostest-sccdp-rhcos | | >| c2194eb3-7d23-4655-b899-991b887b8eea | ostest-sccdp-master-1 | ACTIVE | ostest-sccdp-openshift=10.196.1.213 | ostest-sccdp-rhcos | | >| 9835f5c0-2441-4238-a8e1-41c1f6877723 | ostest-sccdp-master-0 | ACTIVE | ostest-sccdp-openshift=10.196.2.171 | ostest-sccdp-rhcos | | >+--------------------------------------+-----------------------+--------+-------------------------------------+--------------------+--------+ >$ oc get clusteroperators >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE >authentication 4.10.0-0.nightly-2021-10-16-173656 True False False 103m >baremetal 4.10.0-0.nightly-2021-10-16-173656 True False False 110m >cloud-controller-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 119m >cloud-credential 4.10.0-0.nightly-2021-10-16-173656 True False False 120m >cluster-autoscaler 4.10.0-0.nightly-2021-10-16-173656 True False False 113m >config-operator 4.10.0-0.nightly-2021-10-16-173656 True False False 117m >console 4.10.0-0.nightly-2021-10-16-173656 True False False 107m >csi-snapshot-controller 4.10.0-0.nightly-2021-10-16-173656 True False False 116m >dns 4.10.0-0.nightly-2021-10-16-173656 True False False 113m >etcd 4.10.0-0.nightly-2021-10-16-173656 True False False 115m >image-registry 4.10.0-0.nightly-2021-10-16-173656 True False False 111m >ingress 4.10.0-0.nightly-2021-10-16-173656 True False False 110m >insights 4.10.0-0.nightly-2021-10-16-173656 True False False 110m >kube-apiserver 4.10.0-0.nightly-2021-10-16-173656 True False False 113m >kube-controller-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 114m >kube-scheduler 4.10.0-0.nightly-2021-10-16-173656 True False False 114m >kube-storage-version-migrator 4.10.0-0.nightly-2021-10-16-173656 True False False 117m >machine-api 4.10.0-0.nightly-2021-10-16-173656 True False False 110m >machine-approver 4.10.0-0.nightly-2021-10-16-173656 True False False 116m >machine-config 4.10.0-0.nightly-2021-10-16-173656 True False False 115m >marketplace 4.10.0-0.nightly-2021-10-16-173656 True False False 116m >monitoring 4.10.0-0.nightly-2021-10-16-173656 True False False 108m >network 4.10.0-0.nightly-2021-10-16-173656 True False False 118m >node-tuning 4.10.0-0.nightly-2021-10-16-173656 True False False 116m >openshift-apiserver 4.10.0-0.nightly-2021-10-16-173656 True False False 112m >openshift-controller-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 109m >openshift-samples 4.10.0-0.nightly-2021-10-16-173656 True False False 110m >operator-lifecycle-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 116m >operator-lifecycle-manager-catalog 4.10.0-0.nightly-2021-10-16-173656 True False False 116m >operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2021-10-16-173656 True False False 111m >service-ca 4.10.0-0.nightly-2021-10-16-173656 True False False 117m >storage 4.10.0-0.nightly-2021-10-16-173656 True False False 114m 3) Create a new project with three pods. The pods are running on the master nodes: >$ oc get pods -n demo -o wide >NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES >demo-7897db69cc-9zhh4 1/1 Running 0 18h 10.128.130.201 ostest-sccdp-master-1 <none> <none> >demo-7897db69cc-fmrdc 1/1 Running 0 18h 10.128.131.216 ostest-sccdp-master-2 <none> <none> >demo-7897db69cc-mk22w 1/1 Running 0 18h 10.128.130.46 ostest-sccdp-master-0 <none> <none> 4) Creating two workers. Changed the replica value from 0 to 2. The two instances and the clusteroperators are up and running.
(In reply to Itay Matza from comment #10) > Verified in OCP 4.10.0-0.nightly-2021-10-16-173656 on top of OSP > RHOS-16.1-RHEL-8-20210916.n.0. > > Verification steps: > > 1) Installation of OCP with 3 masters and with 0 workers finished > successfully: > >$ openshift-install create cluster --dir ostest/ > >INFO Credentials loaded from file "/home/stack/clouds.yaml" > >WARNING Making control-plane schedulable by setting MastersSchedulable to true for Scheduler cluster settings > >INFO Consuming Install Config from target directory > >INFO Obtaining RHCOS image file from 'https://releases-art-rhcos.svc.ci.openshift.org/art/storage/releases/rhcos-4.9/49.84.202107010027-0/x86_64/rhcos-49.84.202107010027-0-openstack.x86_64.qcow2.gz?sha256=00cb56c8711686255744646394e22a8ca5f27e059016f6758f14388e5a0a14cb' > >INFO The file was found in cache: /home/stack/.cache/openshift-installer/image_cache/rhcos-49.84.202107010027-0-openstack.x86_64.qcow2. Reusing... > >WARNING Following quotas Subnet, SecurityGroup, Port, Network, SecurityGroupRule are available but will be completely used pretty soon. > >INFO Creating infrastructure resources... > >INFO Waiting up to 20m0s for the Kubernetes API at https://api.ostest.shiftstack.com:6443... > >INFO API v1.22.1+9312243 up > >INFO Waiting up to 30m0s for bootstrapping to complete... > >INFO Destroying the bootstrap resources... > >INFO Waiting up to 40m0s for the cluster at https://api.ostest.shiftstack.com:6443 to initialize... > >INFO Waiting up to 10m0s for the openshift-console route to be created... > >INFO Install complete! > >INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/stack/ostest/auth/kubeconfig' > >INFO Access the OpenShift web-console here: https://console-openshift-console.apps.ostest.shiftstack.com > >INFO Login to the console with user: "kubeadmin", and password: "5Ph7w-6NQQC-kNeu8-AZNny" > >INFO Time elapsed: 23m56s > > 2) Make sure the OCP cluster is operational: > >$ oc get machineset -A > >NAMESPACE NAME DESIRED CURRENT READY AVAILABLE AGE > >openshift-machine-api ostest-sccdp-worker-0 0 0 120m > >$ oc get machines -A > >NAMESPACE NAME PHASE TYPE REGION ZONE AGE > >openshift-machine-api ostest-sccdp-master-0 Running m4.xlarge regionOne nova 120m > >openshift-machine-api ostest-sccdp-master-1 Running m4.xlarge regionOne nova 120m > >openshift-machine-api ostest-sccdp-master-2 Running m4.xlarge regionOne nova 120m > >$ oc get nodes > >NAME STATUS ROLES AGE VERSION > >ostest-sccdp-master-0 Ready master,worker 119m v1.22.1+9312243 > >ostest-sccdp-master-1 Ready master,worker 119m v1.22.1+9312243 > >ostest-sccdp-master-2 Ready master,worker 119m v1.22.1+9312243 > >$ openstack server list > >+--------------------------------------+-----------------------+--------+-------------------------------------+--------------------+--------+ > >| ID | Name | Status | Networks | Image | Flavor | > >+--------------------------------------+-----------------------+--------+-------------------------------------+--------------------+--------+ > >| fa32d58d-4d3d-45f5-aadc-3fc5774860a6 | ostest-sccdp-master-2 | ACTIVE | ostest-sccdp-openshift=10.196.0.111 | ostest-sccdp-rhcos | | > >| c2194eb3-7d23-4655-b899-991b887b8eea | ostest-sccdp-master-1 | ACTIVE | ostest-sccdp-openshift=10.196.1.213 | ostest-sccdp-rhcos | | > >| 9835f5c0-2441-4238-a8e1-41c1f6877723 | ostest-sccdp-master-0 | ACTIVE | ostest-sccdp-openshift=10.196.2.171 | ostest-sccdp-rhcos | | > >+--------------------------------------+-----------------------+--------+-------------------------------------+--------------------+--------+ > >$ oc get clusteroperators > >NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE > >authentication 4.10.0-0.nightly-2021-10-16-173656 True False False 103m > >baremetal 4.10.0-0.nightly-2021-10-16-173656 True False False 110m > >cloud-controller-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 119m > >cloud-credential 4.10.0-0.nightly-2021-10-16-173656 True False False 120m > >cluster-autoscaler 4.10.0-0.nightly-2021-10-16-173656 True False False 113m > >config-operator 4.10.0-0.nightly-2021-10-16-173656 True False False 117m > >console 4.10.0-0.nightly-2021-10-16-173656 True False False 107m > >csi-snapshot-controller 4.10.0-0.nightly-2021-10-16-173656 True False False 116m > >dns 4.10.0-0.nightly-2021-10-16-173656 True False False 113m > >etcd 4.10.0-0.nightly-2021-10-16-173656 True False False 115m > >image-registry 4.10.0-0.nightly-2021-10-16-173656 True False False 111m > >ingress 4.10.0-0.nightly-2021-10-16-173656 True False False 110m > >insights 4.10.0-0.nightly-2021-10-16-173656 True False False 110m > >kube-apiserver 4.10.0-0.nightly-2021-10-16-173656 True False False 113m > >kube-controller-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 114m > >kube-scheduler 4.10.0-0.nightly-2021-10-16-173656 True False False 114m > >kube-storage-version-migrator 4.10.0-0.nightly-2021-10-16-173656 True False False 117m > >machine-api 4.10.0-0.nightly-2021-10-16-173656 True False False 110m > >machine-approver 4.10.0-0.nightly-2021-10-16-173656 True False False 116m > >machine-config 4.10.0-0.nightly-2021-10-16-173656 True False False 115m > >marketplace 4.10.0-0.nightly-2021-10-16-173656 True False False 116m > >monitoring 4.10.0-0.nightly-2021-10-16-173656 True False False 108m > >network 4.10.0-0.nightly-2021-10-16-173656 True False False 118m > >node-tuning 4.10.0-0.nightly-2021-10-16-173656 True False False 116m > >openshift-apiserver 4.10.0-0.nightly-2021-10-16-173656 True False False 112m > >openshift-controller-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 109m > >openshift-samples 4.10.0-0.nightly-2021-10-16-173656 True False False 110m > >operator-lifecycle-manager 4.10.0-0.nightly-2021-10-16-173656 True False False 116m > >operator-lifecycle-manager-catalog 4.10.0-0.nightly-2021-10-16-173656 True False False 116m > >operator-lifecycle-manager-packageserver 4.10.0-0.nightly-2021-10-16-173656 True False False 111m > >service-ca 4.10.0-0.nightly-2021-10-16-173656 True False False 117m > >storage 4.10.0-0.nightly-2021-10-16-173656 True False False 114m > > 3) Create a new project with three pods. > The pods are running on the master nodes: > >$ oc get pods -n demo -o wide > >NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES > >demo-7897db69cc-9zhh4 1/1 Running 0 18h 10.128.130.201 ostest-sccdp-master-1 <none> <none> > >demo-7897db69cc-fmrdc 1/1 Running 0 18h 10.128.131.216 ostest-sccdp-master-2 <none> <none> > >demo-7897db69cc-mk22w 1/1 Running 0 18h 10.128.130.46 ostest-sccdp-master-0 <none> <none> > > 4) Creating two workers. > Changed the replica value from 0 to 2. The two instances and the > clusteroperators are up and running. ^ Verified with Kuryr network type.
Removing the Triaged keyword because: * the QE automation assessment (flag qe_test_coverage) is missing
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days