Bug 2054914 - OpenShift 4.6 provision failures on GCP with ocp/stable-4.6
Summary: OpenShift 4.6 provision failures on GCP with ocp/stable-4.6
Status: CLOSED DUPLICATE of bug 2022840
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.6
Hardware: x86_64
OS: Linux
Target Milestone: ---
: ---
Assignee: aos-install
QA Contact: Gaoyun Pei
: 2054916 (view as bug list)
Depends On:
TreeView+ depends on / blocked
Reported: 2022-02-16 01:39 UTC by Shane Bostick
Modified: 2022-02-16 18:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2022-02-16 18:44:45 UTC
Target Upstream Version:

Attachments (Terms of Use)
archive of must-gather and openshift-install.log output (15.67 MB, application/gzip)
2022-02-16 01:39 UTC, Shane Bostick
no flags Details
.openshift_install.log (175.66 KB, text/plain)
2022-02-16 01:40 UTC, Shane Bostick
no flags Details
output from oc adm must-gather (1.13 MB, text/plain)
2022-02-16 01:43 UTC, Shane Bostick
no flags Details

Description Shane Bostick 2022-02-16 01:39:43 UTC
Created attachment 1861361 [details]
archive of must-gather and openshift-install.log output


$ openshift-install version

openshift-install 4.6.48
built from commit 1cfb1b32f5aaf0dfe0fb2ea9da41c710da9b2c76
release image quay.io/openshift-release-dev/ocp-release@sha256:6f03d6ced979d6f6fd10b6a54529c186e3f83c0ecf3e2b910d01505d2f59037a

(from https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable-4.6/openshift-install-linux.tar.gz)



Please specify:

apiVersion: v1
baseDomain: ${BASE_DOMAIN}
- architecture: amd64
  hyperthreading: Enabled
  name: worker
      type: ${WORKER_NODE_TYPE}
  replicas: ${WORKER_NODE_COUNT}
  architecture: amd64
  hyperthreading: Enabled
  name: master
      type: ${MASTER_NODE_TYPE}
  replicas: ${MASTER_NODE_COUNT}
  creationTimestamp: null
  name: ${CLUSTER_NAME}
  - cidr:
    hostPrefix: 23
  - cidr:
  networkType: OVNKubernetes
    projectID: ${PROJECT}
    region: ${REGION}
publish: External
pullSecret: |
sshKey: |

What happened?

Bootstrap and master nodes come up but no worker nodes.
Ingress component fails to reach ready.
The installer binary has not changed.
Neither has the way we are invoking it.
Suspect possible change to GCP API.

# Always at least include the `.openshift_install.log`

What did you expect to happen?

Cluster provisioning to complete on GCP.
This work working but recently started failing.

How to reproduce it (as minimally and precisely as possible)?

This is part of ACS testing...
Those are probably private but this is essentially what we do:
create() {
    if [ -n "${USER_PULL_SECRET-}" ]; then
        echo "The pull secret was overriden with a user supplied value."
        export PULL_SECRET

    echo ">>> Generating an SSH key pair."
    yes | ssh-keygen -t rsa -f /data/id_rsa -C '' -N ''
    chmod 0600 /data/id_rsa /data/id_rsa.pub
    read -r SSH_KEY < /data/id_rsa.pub
    export SSH_KEY

    echo ">>> Creating cluster install config."
    envsubst < /cluster-create/install-config.yaml > /data/install-config.yaml

    echo ">>> Creating the cluster."
    cd /data
    if ! openshift-install create cluster --log-level=debug > /dev/null 2>&1; then
        echo ">>> ERROR: The create failed."
        echo "/data/.openshift_install.log:"
        sed '/Login to the console with user/s/password.*/password from file/' < /data/.openshift_install.log
        exit 1
    echo ">>> Cluster created."
    export KUBECONFIG=/data/auth/kubeconfig

    OPENSHIFT_CONSOLE_LOGIN_STR=$(grep -Eo 'Login to the console with user.*"' .openshift_install.log \
                | tail -1 | sed -e 's/\\"/"/g' | sed -e 's/"$//')
    OPENSHIFT_CONSOLE_USERNAME=$(perl -lne '/user: "(\w+)"/ and print $1' <<<"$OPENSHIFT_CONSOLE_LOGIN_STR")
    OPENSHIFT_CONSOLE_PASSWORD=$(perl -lne '/password: "([\w-]+)"/ and print $1' <<<"$OPENSHIFT_CONSOLE_LOGIN_STR")

    echo "$OPENSHIFT_CONSOLE_URL" > /data/url
    cat > /data/dotenv <<EOF

    echo ">>> Test cluster & kubeconfig"
    oc get nodes -o wide

    echo ">>> Deploy a bastion pod for SSH access"
    curl https://raw.githubusercontent.com/eparis/ssh-bastion/master/deploy/deploy.sh | bash

    echo ">>> Give the user some SSH help"
    cluster_name_prefix=$(cut -b1-21 <<<"$CLUSTER_NAME")
    instances_table=$(gcloud compute instances list --project="$PROJECT" --filter="$gcp_instances_filter" | sort)
    ssh_commands=$(gcloud compute instances list --project="$PROJECT" --filter="$gcp_instances_filter" \
        --format json | jq -r '.[].name' | awk '{ printf "./data/ssh.sh %s\n", $1 }' | sort -k2)
    export instances_table ssh_commands PROJECT gcp_instances_filter
    envsubst < /cluster-create/SSH_ACCESS.md > /data/SSH_ACCESS.md
    cp /usr/bin/ssh-via-bastion.sh /data/ssh.sh

Anything else we need to know?

Tracking initial investigation through resolution on the ACS side here:

This is an important platform for our existing ACS customers.
(i.e. openshift ocp/stable-4.6 running in GCP)

Comment 1 Shane Bostick 2022-02-16 01:40:42 UTC
Created attachment 1861362 [details]

Comment 2 Shane Bostick 2022-02-16 01:43:48 UTC
Created attachment 1861363 [details]
output from oc adm must-gather

I think this is the relevant part:
When opening a support case, bugzilla, or issue please include the following summary data along with any other requested information.
ClusterID: 46dc02c8-d18a-41a3-8628-f8883e61562f
ClusterVersion: Installing "4.6.48" for 58 minutes: Unable to apply 4.6.48: some cluster operators have not yet rolled out
	clusteroperator/authentication is not available (ReadyIngressNodesAvailable: Authentication requires functional ingress which requires at least one schedulable and ready node. Got 0 worker nodes, 3 master nodes, 0 custom target nodes (none are schedulable or ready for ingress pods).
OAuthServiceEndpointsCheckEndpointAccessibleControllerAvailable: Failed to get oauth-openshift enpoints
OAuthServiceCheckEndpointAccessibleControllerAvailable: Get "": dial tcp connect: connection refused
WellKnownAvailable: The well-known endpoint is not yet available: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)) because OAuthServiceEndpointsCheckEndpointAccessibleControllerDegraded: oauth service endpoints are not ready
OAuthServiceCheckEndpointAccessibleControllerDegraded: Get "": dial tcp connect: connection refused
IngressStateEndpointsDegraded: No subsets found for the endpoints of oauth-server
OAuthRouteCheckEndpointAccessibleControllerDegraded: route status does not have host address
RouteDegraded: Route is not available at canonical host oauth-openshift.apps.rox-9228-010.openshift.infra.rox.systems: route status ingress is empty
WellKnownReadyControllerDegraded: failed to get oauth metadata from openshift-config-managed/oauth-openshift ConfigMap: configmap "oauth-openshift" not found (check authentication operator, it is supposed to create this)
OAuthVersionDeploymentDegraded: Unable to get OAuth server deployment: deployment.apps "oauth-openshift" not found
OAuthServerDeploymentDegraded: deployments.apps "oauth-openshift" not found
OAuthServerRouteDegraded: Route is not available at canonical host oauth-openshift.apps.rox-9228-010.openshift.infra.rox.systems: route status ingress is empty
	clusteroperator/cloud-credential is degraded because 1 of 3 credentials requests are failing to sync.
	clusteroperator/console is not available () because 
	clusteroperator/image-registry is not available (Available: The deployment does not have available replicas
ImagePrunerAvailable: Pruner CronJob has been created) because 
	clusteroperator/ingress is not available (Not all ingress controllers are available.) because Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-75b77f56bf-kbht5" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Pod "router-default-75b77f56bf-r2dsk" cannot be scheduled: 0/3 nodes are available: 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentAvailable=False (DeploymentUnavailable: The deployment has Available status condition set to False (reason: MinimumReplicasUnavailable) with message: Deployment does not have minimum availability.), DeploymentReplicasMinAvailable=False (DeploymentMinimumReplicasNotMet: 0/2 of replicas are available, max unavailable is 1)
	clusteroperator/kube-storage-version-migrator is not available (Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available) because 
	clusteroperator/monitoring is not available () because Failed to rollout the stack. Error: running task Updating Alertmanager failed: waiting for Alertmanager Route to become ready failed: waiting for route openshift-monitoring/alertmanager-main: no status available


Comment 3 Scott Dodson 2022-02-16 03:30:35 UTC
*** Bug 2054916 has been marked as a duplicate of this bug. ***

Comment 4 Scott Dodson 2022-02-16 03:31:48 UTC
This is likely https://bugzilla.redhat.com/show_bug.cgi?id=2022840 which was fixed in 4.6.51. Can you please test with that or a newer 4.6.z and if the problem is resolved mark this as a dupe of that bug?

Comment 5 Marcin Owsiany 2022-02-16 06:03:37 UTC
Scott, I'm not familiar with the internals of the installer, but I think there might be two issues:
1. workers are not being spun up - might indeed be https://bugzilla.redhat.com/show_bug.cgi?id=2021731 (which seems to be the parent bug for 2022840), we'll give .51 a try
2. installer not halting at the bootstrap phase, but instead moving on and deleting the bootstrap nodes which makes it hard/impossible for the end user to investigate the cause for worker nodes not appearing

Comment 6 Scott Dodson 2022-02-16 13:56:30 UTC
Bootstrapping is complete as soon as there's a long lived API server and the required manifests have been created in that cluster. This cluster had successfully completed bootstrapping and it was appropriate to tear down the bootstrap host. This problem is fully debugable via the captured must-gather.

You can see from the machine controller logs that the machine controller is complaining about missing cloud credentials when attempting to create workers.

2022-02-15T21:59:47.069337767Z E0215 21:59:47.069276       1 controller.go:104] controllers/MachineSet "msg"="Failed to reconcile MachineSet" "error"="error getting credentials secret \"gcp-cloud-credentials\" in namespace \"openshift-machine-api\": Secret \"gcp-cloud-credentials\" not found" "machineset"="rox-9228-010-gvts2-worker-d" "namespace"="openshift-machine-api" 

See must-gather.local.8006661806462056237/quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-f62746c191ba69678f2961cd30607475915eb14e16648ec95d3fead5c6620632/namespaces/openshift-machine-api/pods/machine-api-controllers-774fcbc4c-fmfjf/machine-controller/machine-controller/logs/current.log

Comment 7 Shane Bostick 2022-02-16 17:24:50 UTC
Verified "ocp/4.6.54" works. Testing once more.
I think "ocp/stable-4.6" channel needs update to include fixes in range "4.6.48..4.6.54".

Comment 8 Scott Dodson 2022-02-16 18:44:45 UTC
The stable channel will not be updated to include this version as it does not receive updates which occur after EUS transition. You'll want to switch to the eus-4.6 channel via whatever means you're using to pick installation versions. Reach out to me on slack if you need any additional help with that.

*** This bug has been marked as a duplicate of bug 2022840 ***

Note You need to log in before you can comment on or make changes to this bug.