Bug 1676915

Summary: Skydive agent's deployment fails because it uses the same tag as the analyzer
Product: Red Hat OpenStack Reporter: David Vallee Delisle <dvd>
Component: skydiveAssignee: safchain
Status: CLOSED ERRATA QA Contact: safchain
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: fbaudin, marjones, mburns, mkaliyam, nplanel, psahoo, rheslop, rsafrono, safchain, sbaubeau
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: 14.0 (Rocky)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: skydive-0.20.4-1.el7ost.x86_64.rpm Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-19 12:41:34 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description David Vallee Delisle 2019-02-13 15:21:33 UTC
Description of problem:
When deploying Skydive, the agent is not getting pulled by docker because we use the same tag as the analyzer. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060908.el7ost.noarch

How reproducible:
All the time

Steps to Reproduce:
1. Deploy skydive with the default templates, using the latest tag:
~~~
 DockerSkydiveAgentImage: satellite:5000/lab-osp14_containers-skydive-agent:latest
  DockerSkydiveAnalyzerImage: satellite:5000/lab-osp14_containers-skydive-analyzer:latest
~~~


Actual results:
"fatal: [lab
-l-rh-cmp-0]: FAILED! => {\"changed\": true, \"cmd\": \"docker pull satellite:5000/lab-osp14_containers-skydive-agent:14.0-46\", \"delta\": \"0:00:00.132607\", \"end\": \"2019-02-07 22:40:
26.701776\", \"msg\": \"non-zero return code\", \"rc\": 1, \"start\": \"2019-02-07 22:40:26.569169\", \"stderr\": \"error parsing HTTP 404 response body: invalid character '<' looking for beginning of value
: \\\"<!DOCTYPE html>\\\\n<html>\\\\n<head>\\\\n  <title>The page you were looking for doesn't exist (404)</title>\\\\n

Expected results:
It should download the right image.

Additional info:
[1] Apparently due skydive template using same tag for agent image as for analyzer image
[2] Not the same tags available 

[1]
~~~
/usr/share/openstack-tripleo-heat-templates/extraconfig/services/skydive-agent.yaml
[...]
skydive_docker_image_tag: {{skydive_analyzer_docker_image | regex_replace(".*:")}}
[...]
~~~

[2]
~~~
$ skopeo inspect docker://satellite:5000/lab-osp14_containers-skydive-agent:latest | jq .RepoTags[]
"14.0-47"
"14.0-48"
"14.0"
"latest"
$ skopeo inspect docker://satellite:5000/lab-osp14_containers-skydive-analyzer:latest | jq .RepoTags[]
"14.0-45"
"14.0-46"
"14.0"
"latest"
~~~

Comment 1 safchain 2019-02-13 18:27:21 UTC
It looks like the main issue is a configuration issue :

~~~
parameter_defaults:
  SkydiveVars:
    globals:
      skydive_listen_ip: 192.168.4.6
~~~

this IP(192.168.4.6) seems to be not reachable by the agents. There is a check in the Skydive ansible playbooks which checks is the analyzer/API is available which seems to be not the case according to the Skydive playbook logs.

I would try to not specify any IP or to use 0.0.0.0 to test.

For the docker image tag I do not see why specifying 'satellite:5000/lab-osp14_containers-skydive-agent:latest', we get this one in the log 'satellite:5000/lab-osp14_containers-skydive-agent:14.0-46'. 

Does the installation have been re-triggered with a another tag specified ?

As the analyzer and the agents seems to be started thus docker pull succeed at least once, per the log and the processes reported, I don't think the main issue is due to the docker tag.

Comment 3 safchain 2019-02-21 15:28:21 UTC
Adding these parameters could help

~~~
parameter_defaults:
  SkydiveVars:
    analyzers:
       skydive_analyzer_docker_extra_env: "--net=host"
  ControllerExtraConfig:
    tripleo::firewall::firewall_rules:
      '600 allow skydive etcd':
        dport:
          - 12379
          - 12380

~~~

Comment 4 David Vallee Delisle 2019-03-07 17:59:38 UTC
Customer has retried with the recommended change but it still fails.

I believe this is because the tenant and other operation are ran on all 3 controllers but because it fails on 2 out of 3 hosts (because you can't create a tenant multiple times), the 2 other hosts are ignored for the rest of the play.

[1] the tasks that fails
[2] The logs from the playbook

I believe that all the keystone operations shouldn't be executed on all 3 controllers. We should probably "delegate_to: localhost"

[1]
~~~
- name: Create a Skydive tenant
  environment:
    OS_AUTH_TOKEN: ""
    OS_AUTH_URL: "{{ os_auth_url }}"
    OS_USERNAME: "{{ os_username }}"
    OS_PASSWORD: "{{ os_password }}"
    OS_PROJECT_NAME: "{{ os_tenant_name }}"
    OS_USER_DOMAIN_NAME: "{{ os_user_domain_name }}"
    OS_PROJECT_DOMAIN_NAME: "{{ os_project_domain_name }}"
    OS_IDENTITY_API_VERSION: "{{ os_identity_api_version }}"
  os_project:
    name: "{{ skydive_auth_os_tenant_name }}"
    description: "Skydive admin users"
    domain_id: "{{ skydive_auth_os_domain_id }}"
    enabled: True
    state: present
~~~

[2]
~~~
TASK [skydive_analyzer : Create a Skydive tenant] ******************************
Tuesday 05 March 2019  15:46:59 -0800 (0:00:01.025)       0:03:00.335 ********* 
fatal: [oc-l-rh-ocld-0 -> localhost]: FAILED! => {"changed": false, "extra_data": null, "msg": "ConflictException: 409"}
changed: [oc-l-rh-ocld-1 -> localhost]
fatal: [oc-l-rh-ocld-2 -> localhost]: FAILED! => {"changed": false, "extra_data": null, "msg": "ConflictException: 409"}

TASK [skydive_analyzer : Create a Skydive keystone API user] *******************
Tuesday 05 March 2019  15:47:04 -0800 (0:00:04.998)       0:03:05.334 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Set skydive Keystone API user role] *******************
Tuesday 05 March 2019  15:47:09 -0800 (0:00:05.325)       0:03:10.659 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Create a Skydive keystone service user] ***************
Tuesday 05 March 2019  15:47:15 -0800 (0:00:05.771)       0:03:16.431 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Set skydive Keystone service user role] ***************
Tuesday 05 March 2019  15:47:20 -0800 (0:00:05.056)       0:03:21.487 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Make the docker image available] **********************
Tuesday 05 March 2019  15:47:26 -0800 (0:00:05.241)       0:03:26.729 ********* 

TASK [skydive_common : Install Docker] *****************************************
Tuesday 05 March 2019  15:47:26 -0800 (0:00:00.512)       0:03:27.242 ********* 
ok: [oc-l-rh-ocld-1]

TASK [skydive_common : Enable Docker service] **********************************
Tuesday 05 March 2019  15:47:29 -0800 (0:00:03.390)       0:03:30.632 ********* 
ok: [oc-l-rh-ocld-1]

TASK [skydive_common : Pull skydive image] *************************************
Tuesday 05 March 2019  15:47:30 -0800 (0:00:00.566)       0:03:31.198 ********* 
changed: [oc-l-rh-ocld-1]
~~~

Comment 5 safchain 2019-03-11 10:26:12 UTC
There is already a "delegate_to" thing here:

https://github.com/skydive-project/skydive/blob/master/contrib/ansible/roles/skydive_analyzer/tasks/main.yml#L30

I'll check one more time...

Comment 6 David Vallee Delisle 2019-03-11 17:41:16 UTC
This is interesting, it's clearly running on all controllers instead of the undercloud though.

It looks that since 2.5 [1] we need to import if we want inheritance. This was reported upstream here [2]

[1] https://docs.ansible.com/ansible/devel/porting_guides/porting_guide_2.5.html#dynamic-includes-and-attribute-inheritance
[2] https://github.com/ansible/ansible/issues/37995

Comment 7 safchain 2019-03-13 08:34:11 UTC
Thanks David, Indeed something changed. I'm about to submit a fix upstream for that and we will backport then it.

Comment 9 safchain 2019-04-12 06:44:15 UTC
*** Bug 1677607 has been marked as a duplicate of this bug. ***

Comment 10 safchain 2019-04-12 06:47:14 UTC
*** Bug 1679851 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2019-04-30 17:47:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0944

Comment 15 safchain 2019-06-13 07:41:59 UTC
Hi Mark,

The keystone changes have been backported and should be part of the next release. The firewall rules will be added by default in OSP15.

Thanks,
Sylvain

Comment 16 safchain 2019-06-19 12:41:34 UTC
Addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1722053