1828484 – e2e-gcp fails in ci due to groups.oo_masters_to_config conditional check

Bug 1828484 - e2e-gcp fails in ci due to groups.oo_masters_to_config conditional check

Summary: e2e-gcp fails in ci due to groups.oo_masters_to_config conditional check

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Russell Teague
QA Contact:	Scott Dodson
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1829361 (view as bug list)
Depends On:	1830021 1848723
Blocks:
TreeView+	depends on / blocked

Reported:	2020-04-27 19:46 UTC by Vikas Laad
Modified:	2020-07-27 13:49 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The conditional set on a task checking the openshift_master_cluster_hostname variable expected the [masters] group in the inventory to be set. Consequence: If the [masters] group was not defined, the task would fail. Fix: Added a conditional to check if [masters] was defined. Result: The task would be skipped if [masters] is not defined instead of failing on an undefined variable.
Clone Of:
Environment:
Last Closed:	2020-07-27 13:49:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift openshift-ansible pull 12152	None	closed	Bug 1828484: Add conditional to fix openshift_master_cluster_hostname check	2020-08-10 18:10:11 UTC
Github	openshift openshift-ansible pull 12203	None	closed	Bug 1848723: Correct GCP group mapping for infra and compute	2020-08-10 18:10:11 UTC
Red Hat Product Errata	RHBA-2020:2990	None	None	None	2020-07-27 13:49:23 UTC

Description Vikas Laad 2020-04-27 19:46:35 UTC

Description of problem:
e2e-gcp tests are failing in ci with following error, this is causing some important PRs to fail merge.

Failure summary:
  1. Hosts:    35.243.221.117
     Play:     Initialize basic host facts
     Task:     Ensure openshift_master_cluster_hostname is set when deploying multiple masters
     Message:  The conditional check 'groups.oo_masters_to_config | length > 1' failed. The error was: error while evaluating conditional (groups.oo_masters_to_config | length > 1): 'dict object' has no attribute 'oo_masters_to_config'
               
               The error appears to be in '/usr/share/ansible/openshift-ansible/roles/openshift_sanitize_inventory/tasks/main.yml': line 131, column 3, but may
               be elsewhere in the file depending on the exact syntax problem.
               
               The offending line appears to be:
               
               
               - name: Ensure openshift_master_cluster_hostname is set when deploying multiple masters
                 ^ here
---

Comment 3 Russell Teague 2020-04-27 19:51:37 UTC

Could be related to the non-standard way clusters are provisioned on GCP.

Comment 6 Russell Teague 2020-04-29 13:02:38 UTC

*** Bug 1829361 has been marked as a duplicate of this bug. ***

Comment 18 Scott Dodson 2020-06-18 13:02:27 UTC

All the recent failures are test case failures, so installation is now completing successfully. There are however 74 failing test cases and people should file bugs against those failures so that teams can clean them up.

Comment 19 Scott Dodson 2020-06-18 13:03:54 UTC

Failed to post the link to recent jobs

https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/job-history/gs/origin-ci-test/pr-logs/directory/pull-ci-openshift-origin-release-3.11-e2e-gcp

BTW, I've asked @build-watcher to review the 3.11 failures and file bugs.

https://coreos.slack.com/archives/CEKNRGF25/p1592483396289400

Comment 20 Akram Ben Aissi 2020-06-19 00:33:34 UTC

Hi Scott and all,

I regret, but this is not fixed. Please read my remarks below, that I already exposed https://bugzilla.redhat.com/show_bug.cgi?id=1828484#c10


As you have seen Scott, jobs are still failing, and the tests are failing because of timeouts.
These timeouts are caused by the setup cluster NOT having ANY compute nor infra nodes.
Hence, the docker registry and the router, and any user deployment config can be instanciated. The cluster only have master nodes.

Here is a reproducer:
- Create any PR with any project using openshift-3.11 branch
- Connect to oc login api.ci.openshift.org
- oc project into the project matching the PR (read the name in the project description)
- Wait for the ci to reach the step: e2e-gcp, you should see something like this:
```
NAME                      READY   STATUS        RESTARTS   AGE
2-centos-build            0/1     Completed     0          65m
e2e-gcp                   5/5     Running       0          46s
maven-agent-build         0/1     Terminating   0          35m
nodejs-agent-build        0/1     Completed     0          35m
nodejs10-agent-build      0/1     Completed     0          35m
nodejs12-agent-build      0/1     Completed     0          35m
slave-base-centos-build   0/1     Completed     0          65m
src-build                 0/1     Completed     0          66m
tests-build               0/1     Completed     0          65m
```

- then rsh into the e2e-gcp pod in the container name `setup`  : oc rsh setup -c e2e-gcp 
- run `ps axwww` to determine the gcp host bastion/master , you should see something like this:

```
8443 ?        S      0:00 ssh -o ControlMaster=auto -o ControlPersist=600s -o StrictHostKeyChecking=no -o IdentityFile="/opt/app-root/src/.ssh/google_compute_engine" -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o User="cloud-user" -o ConnectTimeout=30 -o ControlPath=/opt/app-root/src/.ansible/cp/%h-%r 34.74.79.66 /bin/sh -c 'sudo -H -S -n  -u root /bin/sh -c '"'"'echo BECOME-SUCCESS-ayeeplgchbzxjctsbomaefopkjpoyhmi ; /usr/bin/python'"'"' && sleep 0'
```

- login using ssh into the gcp newly created master/bastion by getting the IP address from the previous command, in my case: 34.74.79.66 and the private key available in the container
````
ssh -i /opt/app-root/src/.ssh/google_compute_engine 34.74.79.66

```

- once in the gcp cluster master you can debug it, and first problem: run `oc get nodes`

```
[cloud-user@ci-op-vqihvtms-3f197-ig-m-p2wz ~]$ oc get nodes
NAME                             STATUS    ROLES     AGE       VERSION
ci-op-vqihvtms-3f197-ig-m-p2wz   Ready     master    5m        v1.11.0+d4cacc0
ci-op-vqihvtms-3f197-ig-n-0ljw   Ready     master    1m        v1.11.0+d4cacc0
ci-op-vqihvtms-3f197-ig-n-cqhb   Ready     master    1m        v1.11.0+d4cacc0
ci-op-vqihvtms-3f197-ig-n-q150   Ready     master    1m        v1.11.0+d4cacc0

```

As you see, there is only master nodes here. no infra, no compute.

- see what is going on with the docker-registry pod:
````
[cloud-user@ci-op-vqihvtms-3f197-ig-m-p2wz ~]$ oc describe pod docker-registry-1-deploy
Name:               docker-registry-1-deploy
Namespace:          default
Priority:           0
PriorityClassName:  <none>
Node:               <none>
Labels:             openshift.io/deployer-pod-for.name=docker-registry-1
Annotations:        openshift.io/deployment-config.name=docker-registry
                    openshift.io/deployment.name=docker-registry-1
                    openshift.io/scc=restricted
Status:             Pending
IP:
Containers:
  deployment:
    Image:      registry.svc.ci.openshift.org/ci-op-vqihvtms/stable:deployer
    Port:       <none>
    Host Port:  <none>
    Environment:
      OPENSHIFT_DEPLOYMENT_NAME:       docker-registry-1
      OPENSHIFT_DEPLOYMENT_NAMESPACE:  default
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from deployer-token-tvgrz (ro)
Conditions:
  Type           Status
  PodScheduled   False
Volumes:
  deployer-token-tvgrz:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  deployer-token-tvgrz
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  node-role.kubernetes.io/infra=true
Tolerations:     <none>
Events:
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  3s (x22 over 42s)  default-scheduler  0/4 nodes are available: 4 node(s) didn't match node selector.
```

the docker-registry cannot be scheduled, and the routers also. This is because they need infra nodes and there isn't any.

- and in the default project
```
[cloud-user@ci-op-vqihvtms-3f197-ig-m-p2wz ~]$ oc get pods
NAME                        READY     STATUS             RESTARTS   AGE
docker-registry-1-deploy    0/1       Pending            0          22s
registry-console-1-d2bwp    0/1       InvalidImageName   0          12s
registry-console-1-deploy   1/1       Running            0          16s
router-1-deploy             0/1       Pending            0          40s
```


- last time when this occured, I was able to edit the config-node config map and add the labels, and the router deployed and one of my dc was scheduled.

Comment 21 Russell Teague 2020-07-09 13:50:52 UTC

All nodes are being assigned an openshift_node_group_name of node-config-master in the setup_scale_group_facts.yml [1] tasks in the openshift_gcp role.  This is causing all nodes to be labeled as masters instead of having master/infra/compute nodes.  Working on a fix.

[1] https://github.com/openshift/openshift-ansible/blob/release-3.11/roles/openshift_gcp/tasks/setup_scale_group_facts.yml

Comment 22 Russell Teague 2020-07-10 18:17:03 UTC

PR 12203 merged which should address the second issue (tracked in 1848723) identified in this bug.

Comment 26 Scott Dodson 2020-07-24 14:59:15 UTC

This no longer happens after the changes on July 8th.

Comment 28 errata-xmlrpc 2020-07-27 13:49:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2990

Note You need to log in before you can comment on or make changes to this bug.