Description of problem: e2e-gcp tests are failing in ci with following error, this is causing some important PRs to fail merge. Failure summary: 1. Hosts: 35.243.221.117 Play: Initialize basic host facts Task: Ensure openshift_master_cluster_hostname is set when deploying multiple masters Message: The conditional check 'groups.oo_masters_to_config | length > 1' failed. The error was: error while evaluating conditional (groups.oo_masters_to_config | length > 1): 'dict object' has no attribute 'oo_masters_to_config' The error appears to be in '/usr/share/ansible/openshift-ansible/roles/openshift_sanitize_inventory/tasks/main.yml': line 131, column 3, but may be elsewhere in the file depending on the exact syntax problem. The offending line appears to be: - name: Ensure openshift_master_cluster_hostname is set when deploying multiple masters ^ here ---
Could be related to the non-standard way clusters are provisioned on GCP.
*** Bug 1829361 has been marked as a duplicate of this bug. ***
All the recent failures are test case failures, so installation is now completing successfully. There are however 74 failing test cases and people should file bugs against those failures so that teams can clean them up.
Failed to post the link to recent jobs https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-origin-release-3.11-e2e-gcp BTW, I've asked @build-watcher to review the 3.11 failures and file bugs. https://coreos.slack.com/archives/CEKNRGF25/p1592483396289400
Hi Scott and all, I regret, but this is not fixed. Please read my remarks below, that I already exposed https://bugzilla.redhat.com/show_bug.cgi?id=1828484#c10 As you have seen Scott, jobs are still failing, and the tests are failing because of timeouts. These timeouts are caused by the setup cluster NOT having ANY compute nor infra nodes. Hence, the docker registry and the router, and any user deployment config can be instanciated. The cluster only have master nodes. Here is a reproducer: - Create any PR with any project using openshift-3.11 branch - Connect to oc login api.ci.openshift.org - oc project into the project matching the PR (read the name in the project description) - Wait for the ci to reach the step: e2e-gcp, you should see something like this: ``` NAME READY STATUS RESTARTS AGE 2-centos-build 0/1 Completed 0 65m e2e-gcp 5/5 Running 0 46s maven-agent-build 0/1 Terminating 0 35m nodejs-agent-build 0/1 Completed 0 35m nodejs10-agent-build 0/1 Completed 0 35m nodejs12-agent-build 0/1 Completed 0 35m slave-base-centos-build 0/1 Completed 0 65m src-build 0/1 Completed 0 66m tests-build 0/1 Completed 0 65m ``` - then rsh into the e2e-gcp pod in the container name `setup` : oc rsh setup -c e2e-gcp - run `ps axwww` to determine the gcp host bastion/master , you should see something like this: ``` 8443 ? S 0:00 ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o IdentityFile="/opt/app-root/src/.ssh/google_compute_engine" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="cloud-user" -o ConnectTimeout=30 -o ControlPath=/opt/app-root/src/.ansible/cp/%h-%r 34.74.79.66 /bin/sh -c 'sudo -H -S -n -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-ayeeplgchbzxjctsbomaefopkjpoyhmi ; /usr/bin/python'"'"' && sleep 0' ``` - login using ssh into the gcp newly created master/bastion by getting the IP address from the previous command, in my case: 34.74.79.66 and the private key available in the container ```` ssh -i /opt/app-root/src/.ssh/google_compute_engine 34.74.79.66 ``` - once in the gcp cluster master you can debug it, and first problem: run `oc get nodes` ``` [cloud-user@ci-op-vqihvtms-3f197-ig-m-p2wz ~]$ oc get nodes NAME STATUS ROLES AGE VERSION ci-op-vqihvtms-3f197-ig-m-p2wz Ready master 5m v1.11.0+d4cacc0 ci-op-vqihvtms-3f197-ig-n-0ljw Ready master 1m v1.11.0+d4cacc0 ci-op-vqihvtms-3f197-ig-n-cqhb Ready master 1m v1.11.0+d4cacc0 ci-op-vqihvtms-3f197-ig-n-q150 Ready master 1m v1.11.0+d4cacc0 ``` As you see, there is only master nodes here. no infra, no compute. - see what is going on with the docker-registry pod: ```` [cloud-user@ci-op-vqihvtms-3f197-ig-m-p2wz ~]$ oc describe pod docker-registry-1-deploy Name: docker-registry-1-deploy Namespace: default Priority: 0 PriorityClassName: <none> Node: <none> Labels: openshift.io/deployer-pod-for.name=docker-registry-1 Annotations: openshift.io/deployment-config.name=docker-registry openshift.io/deployment.name=docker-registry-1 openshift.io/scc=restricted Status: Pending IP: Containers: deployment: Image: registry.svc.ci.openshift.org/ci-op-vqihvtms/stable:deployer Port: <none> Host Port: <none> Environment: OPENSHIFT_DEPLOYMENT_NAME: docker-registry-1 OPENSHIFT_DEPLOYMENT_NAMESPACE: default Mounts: /var/run/secrets/kubernetes.io/serviceaccount from deployer-token-tvgrz (ro) Conditions: Type Status PodScheduled False Volumes: deployer-token-tvgrz: Type: Secret (a volume populated by a Secret) SecretName: deployer-token-tvgrz Optional: false QoS Class: BestEffort Node-Selectors: node-role.kubernetes.io/infra=true Tolerations: <none> Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 3s (x22 over 42s) default-scheduler 0/4 nodes are available: 4 node(s) didn't match node selector. ``` the docker-registry cannot be scheduled, and the routers also. This is because they need infra nodes and there isn't any. - and in the default project ``` [cloud-user@ci-op-vqihvtms-3f197-ig-m-p2wz ~]$ oc get pods NAME READY STATUS RESTARTS AGE docker-registry-1-deploy 0/1 Pending 0 22s registry-console-1-d2bwp 0/1 InvalidImageName 0 12s registry-console-1-deploy 1/1 Running 0 16s router-1-deploy 0/1 Pending 0 40s ``` - last time when this occured, I was able to edit the config-node config map and add the labels, and the router deployed and one of my dc was scheduled.
All nodes are being assigned an openshift_node_group_name of node-config-master in the setup_scale_group_facts.yml [1] tasks in the openshift_gcp role. This is causing all nodes to be labeled as masters instead of having master/infra/compute nodes. Working on a fix. [1] https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_gcp/tasks/setup_scale_group_facts.yml
PR 12203 merged which should address the second issue (tracked in 1848723) identified in this bug.
This no longer happens after the changes on July 8th.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2990