Bug 2054916

Summary: OpenShift 4.6 provision failures on GCP with ocp/stable-4.6
Product: OpenShift Container Platform Reporter: Shane Bostick <sbostick>
Component: InstallerAssignee: aos-install
Installer sub component: openshift-installer QA Contact: Gaoyun Pei <gpei>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: unspecified    
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-02-16 03:30:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
.openshift_install.log
none
Full artifacts dump from oc adm must-gather none

Description Shane Bostick 2022-02-16 01:48:32 UTC
Version:

$ openshift-install version

openshift-install 4.6.48
built from commit 1cfb1b32f5aaf0dfe0fb2ea9da41c710da9b2c76
release image quay.io/openshift-release-dev/ocp-release@sha256:6f03d6ced979d6f6fd10b6a54529c186e3f83c0ecf3e2b910d01505d2f59037a

(from https://mirror.openshift.com/pub/openshift-v4/clients/ocp/stable-4.6/openshift-install-linux.tar.gz)

Platform:

GCP

Please specify:
* IPI

install-config.yaml:
```
apiVersion: v1
baseDomain: ${BASE_DOMAIN}
compute:
- architecture: amd64
  hyperthreading: Enabled
  name: worker
  platform:
    gcp:
      type: ${WORKER_NODE_TYPE}
  replicas: ${WORKER_NODE_COUNT}
controlPlane:
  architecture: amd64
  hyperthreading: Enabled
  name: master
  platform:
    gcp:
      type: ${MASTER_NODE_TYPE}
  replicas: ${MASTER_NODE_COUNT}
metadata:
  creationTimestamp: null
  name: ${CLUSTER_NAME}
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  machineNetwork:
  - cidr: 10.0.0.0/16
  networkType: OVNKubernetes
  serviceNetwork:
  - 172.30.0.0/16
platform:
  gcp:
    projectID: ${PROJECT}
    region: ${REGION}
publish: External
pullSecret: |
  ${PULL_SECRET}
sshKey: |
  ${SSH_KEY}
```

What happened?

Bootstrap and master nodes come up but no worker nodes.
Ingress component fails to reach ready.
The installer binary has not changed.
Neither has the way we are invoking it.
Suspect possible change to GCP API.

# Always at least include the `.openshift_install.log`

What did you expect to happen?

Cluster provisioning to complete on GCP.
This work working but recently started failing.

How to reproduce it (as minimally and precisely as possible)?

This is part of ACS testing...
https://github.com/stackrox/automation-flavors/blob/master/openshift-4/entrypoint.sh
https://github.com/stackrox/automation-flavors/blob/master/openshift-4/install-config.yaml
Those are probably private but this is essentially what we do:
```
create() {
    if [ -n "${USER_PULL_SECRET-}" ]; then
        echo "The pull secret was overriden with a user supplied value."
        PULL_SECRET="${USER_PULL_SECRET}"
        export PULL_SECRET
    fi
    MASTER_NODE_COUNT="${MASTER_NODE_COUNT:-3}"
    MASTER_NODE_TYPE="${MASTER_NODE_TYPE:-3}"
    WORKER_NODE_COUNT="${WORKER_NODE_COUNT:-3}"
    WORKER_NODE_TYPE="${WORKER_NODE_TYPE:-3}"
    REGION="${REGION:-us-east1}"

    echo ">>> Generating an SSH key pair."
    yes | ssh-keygen -t rsa -f /data/id_rsa -C '' -N ''
    chmod 0600 /data/id_rsa /data/id_rsa.pub
    read -r SSH_KEY < /data/id_rsa.pub
    export SSH_KEY

    echo ">>> Creating cluster install config."
    envsubst < /cluster-create/install-config.yaml > /data/install-config.yaml

    echo ">>> Creating the cluster."
    cd /data
    if ! openshift-install create cluster --log-level=debug > /dev/null 2>&1; then
        destroy
        echo ">>> ERROR: The create failed."
        echo "/data/.openshift_install.log:"
        sed '/Login to the console with user/s/password.*/password from file/' < /data/.openshift_install.log
        exit 1
    fi
    echo ">>> Cluster created."
    export KUBECONFIG=/data/auth/kubeconfig

    local OPENSHIFT_CONSOLE_URL OPENSHIFT_CONSOLE_LOGIN_STR OPENSHIFT_CONSOLE_USERNAME OPENSHIFT_CONSOLE_PASSWORD
    OPENSHIFT_CONSOLE_URL="https://console-openshift-console.apps.${CLUSTER_NAME}.${BASE_DOMAIN}"
    OPENSHIFT_CONSOLE_LOGIN_STR=$(grep -Eo 'Login to the console with user.*"' .openshift_install.log \
                | tail -1 | sed -e 's/\\"/"/g' | sed -e 's/"$//')
    OPENSHIFT_CONSOLE_USERNAME=$(perl -lne '/user: "(\w+)"/ and print $1' <<<"$OPENSHIFT_CONSOLE_LOGIN_STR")
    OPENSHIFT_CONSOLE_PASSWORD=$(perl -lne '/password: "([\w-]+)"/ and print $1' <<<"$OPENSHIFT_CONSOLE_LOGIN_STR")

    echo "$OPENSHIFT_CONSOLE_URL" > /data/url
    cat > /data/dotenv <<EOF
CLUSTER_NAME="$CLUSTER_NAME"
REGION="$REGION"
OPENSHIFT_VERSION="$OPENSHIFT_VERSION"
OPENSHIFT_CONSOLE_URL="$OPENSHIFT_CONSOLE_URL"
OPENSHIFT_CONSOLE_USERNAME="$OPENSHIFT_CONSOLE_USERNAME"
OPENSHIFT_CONSOLE_PASSWORD="$OPENSHIFT_CONSOLE_PASSWORD"
EOF

    echo ">>> Test cluster & kubeconfig"
    oc get nodes -o wide

    echo ">>> Deploy a bastion pod for SSH access"
    curl https://raw.githubusercontent.com/eparis/ssh-bastion/master/deploy/deploy.sh | bash

    echo ">>> Give the user some SSH help"
    cluster_name_prefix=$(cut -b1-21 <<<"$CLUSTER_NAME")
    gcp_instances_filter="name~${cluster_name_prefix}-.*"
    instances_table=$(gcloud compute instances list --project="$PROJECT" --filter="$gcp_instances_filter" | sort)
    ssh_commands=$(gcloud compute instances list --project="$PROJECT" --filter="$gcp_instances_filter" \
        --format json | jq -r '.[].name' | awk '{ printf "./data/ssh.sh %s\n", $1 }' | sort -k2)
    export instances_table ssh_commands PROJECT gcp_instances_filter
    envsubst < /cluster-create/SSH_ACCESS.md > /data/SSH_ACCESS.md
    cp /usr/bin/ssh-via-bastion.sh /data/ssh.sh
}
```

Anything else we need to know?

Tracking initial investigation through resolution on the ACS side here:
https://issues.redhat.com/browse/ROX-9228

This is an important platform for our existing ACS customers.
(i.e. openshift ocp/stable-4.6 running in GCP)

Comment 1 Shane Bostick 2022-02-16 01:49:50 UTC
Created attachment 1861364 [details]
.openshift_install.log

Comment 2 Shane Bostick 2022-02-16 02:08:05 UTC
Created attachment 1861368 [details]
Full artifacts dump from oc adm must-gather

Comment 3 Scott Dodson 2022-02-16 03:30:35 UTC

*** This bug has been marked as a duplicate of bug 2054914 ***