1558689 – cluster_facts is broken when deploying from containerized installer and ansible_connection=local

Bug 1558689 - cluster_facts is broken when deploying from containerized installer and ansible_connection=local

Summary: cluster_facts is broken when deploying from containerized installer and ansib...

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	3.9.z
Assignee:	Russell Teague
QA Contact:	Wenjing Zheng
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1593439
TreeView+	depends on / blocked

Reported:	2018-03-20 19:39 UTC by Samuel Padgett
Modified:	2018-08-01 14:52 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Adding iproute and hostname packages to the openshift-ansible image: Ansible uses common user-space utilities for determining default facts. The ansible_default_ipv4.address fact is populated using utilities from iproute package, and this fact is used for populating openshift IP in roles/openshift_facts/library/openshift_facts.py.
Clone Of:
Clones:	1593439 (view as bug list)
Environment:
Last Closed:	2018-08-01 14:52:02 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Samuel Padgett 2018-03-20 19:39:07 UTC

While trying to use oc cluster up --metrics|--logging deployment fails with

TASK [Gather Cluster facts] ***************************************************************************************************************************************************************************************
task path: /usr/share/ansible/openshift-ansible/playbooks/init/cluster_facts.yml:9
Friday 02 February 2018 21:02:07 +0000 (0:00:00.076) 0:00:05.943 *******
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'ansible_default_ipv4'
fatal: [127.0.0.1]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n File "/tmp/ansible_jrgkpu/ansible_module_openshift_facts.py", line 1687, in \n main()\n File "/tmp/ansible_jrgkpu/ansible_module_openshift_facts.py", line 1674, in main\n additive_facts_to_overwrite)\n File "/tmp/ansible_jrgkpu/ansible_module_openshift_facts.py", line 1339, in init\n additive_facts_to_overwrite)\n File "/tmp/ansible_jrgkpu/ansible_module_openshift_facts.py", line 1368, in generate_facts\n defaults = self.get_defaults(roles, deployment_type, deployment_subtype)\n File "/tmp/ansible_jrgkpu/ansible_module_openshift_facts.py", line 1400, in get_defaults\n ip_addr = self.system_facts['ansible_default_ipv4']['address']\nKeyError: 'ansible_default_ipv4'\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}

Additional details in https://github.com/openshift/openshift-ansible/issues/7006

Comment 1 Russell Teague 2018-04-03 20:04:18 UTC

Proposed: https://github.com/openshift/openshift-ansible/pull/7760

Comment 2 Russell Teague 2018-04-05 12:38:39 UTC

release-3.9: https://github.com/openshift/openshift-ansible/pull/7805

Comment 3 Davi Garcia 2018-04-10 13:38:59 UTC

I'm facing the similar issue when runnining 'oc cluster up --logging=true --metrics=true --version=v3.9 --loglevel=2':

$ oc get pods --all-namespaces 
NAMESPACE               NAME                                  READY     STATUS      RESTARTS   AGE
default                 docker-registry-1-ghbt9               1/1       Running     0          7m
default                 persistent-volume-setup-wplxp         0/1       Completed   0          7m
default                 router-1-4j4tx                        1/1       Running     0          7m
logging                 openshift-ansible-logging-job-264wh   0/1       Error       0          4m
logging                 openshift-ansible-logging-job-c8g8c   0/1       Error       0          7m
logging                 openshift-ansible-logging-job-g7g8v   0/1       Error       0          5m
logging                 openshift-ansible-logging-job-lnfk9   0/1       Error       0          2m
logging                 openshift-ansible-logging-job-nv5mv   0/1       Error       0          5m
logging                 openshift-ansible-logging-job-w8pts   0/1       Error       0          4m
logging                 openshift-ansible-logging-job-whckj   0/1       Error       0          5m
openshift-infra         openshift-ansible-metrics-job-2p4kf   0/1       Error       0          3m
openshift-infra         openshift-ansible-metrics-job-77t4b   0/1       Error       0          4m
openshift-infra         openshift-ansible-metrics-job-gw7sv   0/1       Error       0          5m
openshift-infra         openshift-ansible-metrics-job-h8xqm   0/1       Error       0          5m
openshift-infra         openshift-ansible-metrics-job-md5p9   0/1       Error       0          7m
openshift-infra         openshift-ansible-metrics-job-v5q4n   0/1       Error       0          5m
openshift-web-console   webconsole-548fd9b7c4-kzmh6           1/1       Running     0          7m

Both Metrics and Logging playbook jobs fail due the same problem:

TASK [Gather Cluster facts] ****************************************************
Sunday 01 April 2018  13:56:00 +0000 (0:00:00.074)       0:00:07.064 ********** 
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'ansible_default_ipv4'
fatal: [127.0.0.1]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_tQr_lb/ansible_module_openshift_facts.py\", line 1688, in <module>\n    main()\n  File \"/tmp/ansible_tQr_lb/ansible_module_openshift_facts.py\", line 1675, in main\n    additive_facts_to_overwrite)\n  File \"/tmp/ansible_tQr_lb/ansible_module_openshift_facts.py\", line 1340, in __init__\n    additive_facts_to_overwrite)\n  File \"/tmp/ansible_tQr_lb/ansible_module_openshift_facts.py\", line 1369, in generate_facts\n    defaults = self.get_defaults(roles, deployment_type, deployment_subtype)\n  File \"/tmp/ansible_tQr_lb/ansible_module_openshift_facts.py\", line 1401, in get_defaults\n    ip_addr = self.system_facts['ansible_default_ipv4']['address']\nKeyError: 'ansible_default_ipv4'\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 1}

My environment:

$ oc version
oc v3.9.0+191fece
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://127.0.0.1:8443
openshift v3.9.0+191fece
kubernetes v1.9.1+a0ce1bc657

$ docker version
Client:
 Version:	18.03.0-ce
 API version:	1.37
 Go version:	go1.9.4
 Git commit:	0520e24
 Built:	Wed Mar 21 23:11:15 2018
 OS/Arch:	linux/amd64
 Experimental:	false
 Orchestrator:	swarm

Server:
 Engine:
  Version:	18.03.0-ce
  API version:	1.37 (minimum version 1.12)
  Go version:	go1.9.4
  Git commit:	0520e24
  Built:	Wed Mar 21 23:15:01 2018
  OS/Arch:	linux/amd64
  Experimental:	false

$ cat /etc/fedora-release 
Fedora release 27 (Twenty Seven)

Comment 4 Russell Teague 2018-04-10 14:20:29 UTC

Please pull the latest image for the v3.9 tag and try again.

Comment 5 Davi Garcia 2018-04-10 17:22:54 UTC

I'm still facing the same issue when running 'oc cluster up --metrics=true --version=v3.9 --loglevel=2' after cleaning all images from my Docker local storage.

It looks like the image tagged with v3.9 (or latest) at registry.access.redhat.com is still from 2 weeks ago:

$ docker images
REPOSITORY                                                  TAG                 IMAGE ID            CREATED             SIZE
registry.access.redhat.com/openshift3/ose-ansible           latest              19f345ac236b        2 weeks ago         846MB
registry.access.redhat.com/openshift3/ose-ansible           v3.9                19f345ac236b        2 weeks ago         846MB
registry.access.redhat.com/openshift3/ose-haproxy-router    latest              4eb76bae54ef        2 weeks ago         1.28GB
registry.access.redhat.com/openshift3/ose-haproxy-router    v3.9                4eb76bae54ef        2 weeks ago         1.28GB
registry.access.redhat.com/openshift3/ose-deployer          latest              ba9779c50c5b        2 weeks ago         1.26GB
registry.access.redhat.com/openshift3/ose-deployer          v3.9                ba9779c50c5b        2 weeks ago         1.26GB
registry.access.redhat.com/openshift3/ose                   latest              078f595369ae        2 weeks ago         1.26GB
registry.access.redhat.com/openshift3/ose                   v3.9                078f595369ae        2 weeks ago         1.26GB
registry.access.redhat.com/openshift3/ose-docker-registry   latest              11923de49247        2 weeks ago         459MB
registry.access.redhat.com/openshift3/ose-docker-registry   v3.9                11923de49247        2 weeks ago         459MB
registry.access.redhat.com/openshift3/ose-web-console       latest              a0f5a2e23591        2 weeks ago         489MB
registry.access.redhat.com/openshift3/ose-web-console       v3.9                a0f5a2e23591        2 weeks ago         489MB
registry.access.redhat.com/openshift3/ose-pod               latest              e598d93f5abe        2 weeks ago         209MB
registry.access.redhat.com/openshift3/ose-pod               v3.9                e598d93f5abe        2 weeks ago         209MB

Some 'oc' outputs:

$ oc get pods --all-namespaces
NAMESPACE               NAME                                  READY     STATUS      RESTARTS   AGE
default                 docker-registry-1-q4sfp               1/1       Running     0          24m
default                 persistent-volume-setup-x74b7         0/1       Completed   0          24m
default                 router-1-59rw2                        1/1       Running     0          24m
openshift-infra         openshift-ansible-metrics-job-4dx6q   0/1       Error       0          23m
openshift-infra         openshift-ansible-metrics-job-5h76z   0/1       Error       0          20m
openshift-infra         openshift-ansible-metrics-job-6cwgz   0/1       Error       0          23m
openshift-infra         openshift-ansible-metrics-job-6qlg6   0/1       Error       0          22m
openshift-infra         openshift-ansible-metrics-job-fggrm   0/1       Error       0          21m
openshift-infra         openshift-ansible-metrics-job-fxgd6   0/1       Error       0          22m
openshift-infra         openshift-ansible-metrics-job-vdmfn   0/1       Error       0          24m
openshift-web-console   webconsole-744d5fcf55-ck6vh           1/1       Running     0          24m

And:

$ oc logs openshift-ansible-metrics-job-5h76z -n openshift-infra

(...)

TASK [Gather Cluster facts] ****************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'ansible_default_ipv4'
fatal: [127.0.0.1]: FAILED! => {"changed": false, "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_S2aJ4E/ansible_module_openshift_facts.py\", line 1704, in <module>\n    main()\n  File \"/tmp/ansible_S2aJ4E/ansible_module_openshift_facts.py\", line 1691, in main\n    additive_facts_to_overwrite)\n  File \"/tmp/ansible_S2aJ4E/ansible_module_openshift_facts.py\", line 1355, in __init__\n    additive_facts_to_overwrite)\n  File \"/tmp/ansible_S2aJ4E/ansible_module_openshift_facts.py\", line 1384, in generate_facts\n    defaults = self.get_defaults(roles, deployment_type, deployment_subtype)\n  File \"/tmp/ansible_S2aJ4E/ansible_module_openshift_facts.py\", line 1416, in get_defaults\n    ip_addr = self.system_facts['ansible_default_ipv4']['address']\nKeyError: 'ansible_default_ipv4'\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 0}


PLAY RECAP *********************************************************************
127.0.0.1                  : ok=30   changed=0    unreachable=0    failed=1   


INSTALLER STATUS ***************************************************************
Initialization             : In Progress (0:00:12)

Comment 6 Russell Teague 2018-04-11 12:23:22 UTC

Commit is in build openshift-ansible-3.9.20-1.git.0.f99fb43.el7

Comment 7 Davi Garcia 2018-04-11 12:54:55 UTC

(In reply to Russell Teague from comment #6)
> Commit is in build openshift-ansible-3.9.20-1.git.0.f99fb43.el7

I couldn't find any container image tagged with v3.9.20 in the registries, and looks like 'v3.9' and 'latest' tags are still pointing to the affected version. Maybe because the new build was not pushed to the public registries yet.

I'll wait and try again in a few days.

Comment 8 Scott Dodson 2018-04-11 13:02:57 UTC

Davi,

The v3.9.20 tag is specific to OCP images. However, the public CI infrastructure will push docker.io/openshift/origin-ansible:v3.9 periodically and it looks like that has the change if you'd like to test with it. For supported OCP installs however you'll need to wait until this bug is attached to an errata and ships.

Comment 9 Davi Garcia 2018-04-11 13:22:54 UTC

(In reply to Scott Dodson from comment #8)
> Davi,
> 
> The v3.9.20 tag is specific to OCP images. However, the public CI
> infrastructure will push docker.io/openshift/origin-ansible:v3.9
> periodically and it looks like that has the change if you'd like to test
> with it. For supported OCP installs however you'll need to wait until this
> bug is attached to an errata and ships.

Thanks for your feedback!

I'm testing with 'oc cluster up --metrics=true --version=v3.9 --loglevel=2' (origin's oc version) and I'm not sure how should I make it pull the artefacts from a different registry/repository to get the fix. Any advice?

Comment 10 XiuJuan Wang 2018-04-28 08:29:20 UTC

I used the latest v3.9 brew image which point to v3.9.27 version, also met the comment 5 error.
#skopeo inspect docker://brew-***/openshift3/ose-ansible:v3.9  --tls-verify=false  | grep version 
        "version": "v3.9.27"


#oc cluster up -image='brew-****/openshift3/ose' --logging=true --metrics=true --version=v3.9 --loglevel=8 
# oc get pods --all-namespaces
NAMESPACE               NAME                                  READY     STATUS      RESTARTS   AGE
default                 docker-registry-1-vzr9z               1/1       Running     0          6m
default                 persistent-volume-setup-m5qzq         0/1       Evicted     0          8m
default                 persistent-volume-setup-rdw49         0/1       Completed   0          6m
default                 router-1-9f6c2                        1/1       Running     0          6m
logging                 openshift-ansible-logging-job-sfw4l   0/1       Error       0          8m
openshift-infra         openshift-ansible-metrics-job-64mwd   0/1       Error       0          8m
openshift-web-console   webconsole-5849496d9d-fcsb7           1/1       Running     0          8m

TASK [Gather Cluster facts] ****************************************************
An exception occurred during task execution. To see the full traceback, use -vvv. The error was: KeyError: 'ansible_default_ipv4'
fatal: [127.0.0.1]: FAILED! => {"changed": false, "failed": true, "module_stderr": "Traceback (most recent call last):\n  File \"/tmp/ansible_gs2xvV/ansible_module_openshift_facts.py\", line 1688, in <module>\n    main()\n  File \"/tmp/ansible_gs2xvV/ansible_module_openshift_facts.py\", line 1675, in main\n    additive_facts_to_overwrite)\n  File \"/tmp/ansible_gs2xvV/ansible_module_openshift_facts.py\", line 1340, in __init__\n    additive_facts_to_overwrite)\n  File \"/tmp/ansible_gs2xvV/ansible_module_openshift_facts.py\", line 1369, in generate_facts\n    defaults = self.get_defaults(roles, deployment_type, deployment_subtype)\n  File \"/tmp/ansible_gs2xvV/ansible_module_openshift_facts.py\", line 1401, in get_defaults\n    ip_addr = self.system_facts['ansible_default_ipv4']['address']\nKeyError: 'ansible_default_ipv4'\n", "module_stdout": "", "msg": "MODULE FAILURE", "rc": 0}


Another issue:
No --logging|metrics options in 'oc cluster up' v3.10, and can't use 'oc cluster add logging' to add this component. 
So how to add logging|metrics with v3.10?

Comment 11 Davi Garcia 2018-04-28 13:51:02 UTC

After updating the image, looks like now the playbook failure is at a different point:

$ oc logs -f openshift-ansible-metrics-job-gl8vz -n openshift-infra
(...)
RUNNING HANDLER [openshift_metrics : restart master api] ***********************
Saturday 28 April 2018  13:40:37 +0000 (0:00:00.179)       0:02:32.201 ******** 
fatal: [127.0.0.1]: FAILED! => {"changed": false, "cmd": "/usr/bin/systemctl", "msg": "Failed to get D-Bus connection: Operation not permitted", "rc": 1, "stderr": "Failed to get D-Bus connection: Operation not permitted\n", "stderr_lines": ["Failed to get D-Bus connection: Operation not permitted"], "stdout": "", "stdout_lines": []}

The command used is still the same: 

$ oc cluster up --metrics=true --version=v3.9 --loglevel=2

The images I have locally are:

$ docker images
REPOSITORY                                  TAG                 IMAGE ID            CREATED             SIZE
openshift/origin-ansible                    v3.9                d2967b4f1b8a        16 hours ago        1.22GB
openshift/origin-web-console                v3.9                2bff42918944        9 days ago          489MB
openshift/origin-docker-registry            v3.9                c13803d064c8        9 days ago          458MB
openshift/origin-haproxy-router             v3.9                0a48e702efe7        9 days ago          1.28GB
openshift/origin-deployer                   v3.9                78a4725976a1        9 days ago          1.25GB
openshift/origin                            v3.9                94e55dec4dc3        9 days ago          1.25GB
openshift/origin-pod                        v3.9                f62f913f6617        9 days ago          220MB
openshift/origin-metrics-cassandra          v3.9                d77a710bd9f0        5 months ago        780MB
openshift/origin-metrics-hawkular-metrics   v3.9                67c1503b2ae2        5 months ago        914MB
openshift/origin-metrics-heapster           v3.9                93f72c7c2f46        5 months ago        820MB

Comment 12 Ed Seymour 2018-06-12 15:21:01 UTC

I've hit the same (original reported) error running the oc v3.9.30 client on a Fedora 28 machine. 

In the openshift-infra project you get a number of failed attempts to run the metrics ansible installation. The failure reports that "KeyError: 'ansible_default_ipv4'"

It appears the ansible playbooks are attempting to reference the ansible_default_ipv4 fact and from this derive an ipv4 address. 

At the command line, I am able to check the facts for my Fedora 28 machine using the following: 

ansible all -i localhost, -m setup -c local

If I grep this ansible_default_ipv4 the fact is present and correct

Using a failed metrics install pod, I can run this up using the debug option

oc debug pod/<pod name>

Then running the above ansible command in the pod, I can see that the fact is not available. What I've also determined is that the /sbin/ip is not present in the container, if this is used by ansible to determine this fact then it could be why the fact is not available.

Comment 13 Ed Seymour 2018-06-13 06:32:18 UTC

Looks like iproute was added to fix this upstream: https://github.com/openshift/openshift-ansible/pull/7760 but the fix has not made it into registry.access.redhat.com/openshift3/ose-ansible:v3.9.30 image (last updated 11 days ago)

Comment 14 Russell Teague 2018-06-20 15:04:02 UTC

Update for Dockerfile.rhel7 used to build ose-ansible
https://github.com/openshift/openshift-ansible/pull/8870

Comment 15 openshift-github-bot 2018-06-20 16:45:18 UTC

Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/ee09d0942722b242a2397d4b8c3fe9862f3c575b
Bug 1558689 - Add iproute to Dockerfile.rhel7

iproute is required by Ansible to gather some networking facts

Fixes https://bugzilla.redhat.com/show_bug.cgi?id=1558689

https://github.com/openshift/openshift-ansible/commit/c3195db0363833c8a273b1a9bc91c4e6021477ee
Merge pull request #8870 from mtnbikenc/fix-1558689

Bug 1558689 - Add iproute to Dockerfile.rhel7

Comment 16 Russell Teague 2018-06-20 17:09:36 UTC

release-3.9: https://github.com/openshift/openshift-ansible/pull/8874

Comment 17 Russell Teague 2018-07-09 17:16:50 UTC

openshift-ansible-3.9.33-1

Comment 18 XiuJuan Wang 2018-07-11 03:01:24 UTC

logging and  metrics pods still can't be running with 3.9.33-1 client and 3.9.33-1 images

#oc cluster up --image='brew-***/openshift3/ose' --logging=true --metrics=true --version=v3.9.33-1 --loglevel=8    --public-hostname=10.8.241.46

#oc get pods  --all-namespaces 
NAMESPACE               NAME                                  READY     STATUS      RESTARTS   AGE
default                 docker-registry-1-hbx5q               1/1       Running     0          1m
default                 persistent-volume-setup-b95qk         0/1       Completed   0          1m
default                 router-1-q79f2                        1/1       Running     0          1m
logging                 openshift-ansible-logging-job-jt58s   0/1       Error       0          20s
logging                 openshift-ansible-logging-job-q592c   0/1       Error       0          1m
openshift-infra         openshift-ansible-metrics-job-fr74d   0/1       Error       0          1m
openshift-infra         openshift-ansible-metrics-job-v9nnl   0/1       Error       0          16s
openshift-web-console   webconsole-746648b7d4-fn8x8           1/1       Running     0          1m

#oc  logs -f openshift-ansible-logging-job-q592c -n logging 
TASK [Ensure various deps for running system containers are installed] *********
skipping: [10.8.241.46] => (item=atomic)  => {"changed": false, "item": "atomic", "skip_reason": "Conditional result was False", "skipped": true}
skipping: [10.8.241.46] => (item=ostree)  => {"changed": false, "item": "ostree", "skip_reason": "Conditional result was False", "skipped": true}
skipping: [10.8.241.46] => (item=runc)  => {"changed": false, "item": "runc", "skip_reason": "Conditional result was False", "skipped": true}
PLAY [Initialize cluster facts] ************************************************
TASK [Gathering Facts] *********************************************************
ok: [10.8.241.46]
TASK [Gather Cluster facts] ****************************************************
fatal: [10.8.241.46]: FAILED! => {"changed": false, "cmd": "hostname -f", "failed": true, "msg": "[Errno 2] No such file or directory", "rc": 2}
PLAY RECAP *********************************************************************
10.8.241.46                : ok=19   changed=0    unreachable=0    failed=1   
localhost                  : ok=11   changed=0    unreachable=0    failed=0

Comment 19 Russell Teague 2018-07-11 18:55:42 UTC

https://github.com/openshift/openshift-ansible/pull/9164

Comment 21 XiuJuan Wang 2018-08-01 04:35:46 UTC

#oc cluster up --image='brew-pulp-***:8888/openshift3/ose' --logging=true --metrics=true --version=v3.9.40

error in comment #18 has fixed, but new error comeout since no useful yum config file 

TASK [openshift_version : fail] ************************************************
fatal: [127.0.0.1]: FAILED! => {"changed": false, "failed": true, "msg": "Package 'atomic-openshift' not found"}


oc debug openshift-ansible-logging-job-wx2h2 -n logging 
Defaulting container name to openshift-ansible-logging-job.
Use 'oc describe pod/openshift-ansible-logging-job-wx2h2-debug -n logging' to see all of the containers in this pod.
Debugging with pod/openshift-ansible-logging-job-wx2h2-debug, original command: <image entrypoint>
Waiting for pod to start ...
Pod IP: 172.16.120.90
If you don't see a command prompt, try pressing enter.
sh-4.2# yum search atomic-openshift
Loaded plugins: ovl, product-id, search-disabled-repos, subscription-manager
This system is not receiving updates. You can use subscription-manager on the host to register and assign subscriptions.
=============================================================================================== N/S matched: atomic-openshift ===============================================================================================
atomic-openshift-clients.x86_64 : Origin Client binaries for Linux
atomic-openshift-utils.noarch : Atomic OpenShift Utilities

  Name and summary matches only, use "search all" for everything.

Comment 22 Scott Dodson 2018-08-01 14:52:02 UTC

Support for installing metrics and logging via `oc cluster up` has always been experimental and is removed in 3.10 and newer. We won't be able to fix this one.

Note You need to log in before you can comment on or make changes to this bug.