Bug 1676915 - Skydive agent's deployment fails because it uses the same tag as the analyzer
Summary: Skydive agent's deployment fails because it uses the same tag as the analyzer
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: skydive
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 14.0 (Rocky)
Assignee: safchain
QA Contact: safchain
URL:
Whiteboard:
: 1677607 1679851 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-02-13 15:21 UTC by David Vallee Delisle
Modified: 2023-09-07 19:46 UTC (History)
10 users (show)

Fixed In Version: skydive-0.20.4-1.el7ost.x86_64.rpm
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-06-19 12:41:34 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github skydive-project skydive pull 1700 0 None None None 2019-03-13 17:35:05 UTC
OpenStack gerrit 638606 0 None None None 2019-03-13 17:35:49 UTC
Red Hat Issue Tracker OSP-28210 0 None None None 2023-09-07 19:46:14 UTC
Red Hat Knowledge Base (Solution) 3956281 0 Troubleshoot None [RHOSP14] SkyDive integration has failed while pulling skydive-analyzer & skydive-agent 2019-04-12 06:47:13 UTC
Red Hat Product Errata RHBA-2019:0944 0 None None None 2019-04-30 17:47:49 UTC

Description David Vallee Delisle 2019-02-13 15:21:33 UTC
Description of problem:
When deploying Skydive, the agent is not getting pulled by docker because we use the same tag as the analyzer. 

Version-Release number of selected component (if applicable):
openstack-tripleo-heat-templates-9.0.1-0.20181013060908.el7ost.noarch

How reproducible:
All the time

Steps to Reproduce:
1. Deploy skydive with the default templates, using the latest tag:
~~~
 DockerSkydiveAgentImage: satellite:5000/lab-osp14_containers-skydive-agent:latest
  DockerSkydiveAnalyzerImage: satellite:5000/lab-osp14_containers-skydive-analyzer:latest
~~~


Actual results:
"fatal: [lab
-l-rh-cmp-0]: FAILED! => {\"changed\": true, \"cmd\": \"docker pull satellite:5000/lab-osp14_containers-skydive-agent:14.0-46\", \"delta\": \"0:00:00.132607\", \"end\": \"2019-02-07 22:40:
26.701776\", \"msg\": \"non-zero return code\", \"rc\": 1, \"start\": \"2019-02-07 22:40:26.569169\", \"stderr\": \"error parsing HTTP 404 response body: invalid character '<' looking for beginning of value
: \\\"<!DOCTYPE html>\\\\n<html>\\\\n<head>\\\\n  <title>The page you were looking for doesn't exist (404)</title>\\\\n

Expected results:
It should download the right image.

Additional info:
[1] Apparently due skydive template using same tag for agent image as for analyzer image
[2] Not the same tags available 

[1]
~~~
/usr/share/openstack-tripleo-heat-templates/extraconfig/services/skydive-agent.yaml
[...]
skydive_docker_image_tag: {{skydive_analyzer_docker_image | regex_replace(".*:")}}
[...]
~~~

[2]
~~~
$ skopeo inspect docker://satellite:5000/lab-osp14_containers-skydive-agent:latest | jq .RepoTags[]
"14.0-47"
"14.0-48"
"14.0"
"latest"
$ skopeo inspect docker://satellite:5000/lab-osp14_containers-skydive-analyzer:latest | jq .RepoTags[]
"14.0-45"
"14.0-46"
"14.0"
"latest"
~~~

Comment 1 safchain 2019-02-13 18:27:21 UTC
It looks like the main issue is a configuration issue :

~~~
parameter_defaults:
  SkydiveVars:
    globals:
      skydive_listen_ip: 192.168.4.6
~~~

this IP(192.168.4.6) seems to be not reachable by the agents. There is a check in the Skydive ansible playbooks which checks is the analyzer/API is available which seems to be not the case according to the Skydive playbook logs.

I would try to not specify any IP or to use 0.0.0.0 to test.

For the docker image tag I do not see why specifying 'satellite:5000/lab-osp14_containers-skydive-agent:latest', we get this one in the log 'satellite:5000/lab-osp14_containers-skydive-agent:14.0-46'. 

Does the installation have been re-triggered with a another tag specified ?

As the analyzer and the agents seems to be started thus docker pull succeed at least once, per the log and the processes reported, I don't think the main issue is due to the docker tag.

Comment 3 safchain 2019-02-21 15:28:21 UTC
Adding these parameters could help

~~~
parameter_defaults:
  SkydiveVars:
    analyzers:
       skydive_analyzer_docker_extra_env: "--net=host"
  ControllerExtraConfig:
    tripleo::firewall::firewall_rules:
      '600 allow skydive etcd':
        dport:
          - 12379
          - 12380

~~~

Comment 4 David Vallee Delisle 2019-03-07 17:59:38 UTC
Customer has retried with the recommended change but it still fails.

I believe this is because the tenant and other operation are ran on all 3 controllers but because it fails on 2 out of 3 hosts (because you can't create a tenant multiple times), the 2 other hosts are ignored for the rest of the play.

[1] the tasks that fails
[2] The logs from the playbook

I believe that all the keystone operations shouldn't be executed on all 3 controllers. We should probably "delegate_to: localhost"

[1]
~~~
- name: Create a Skydive tenant
  environment:
    OS_AUTH_TOKEN: ""
    OS_AUTH_URL: "{{ os_auth_url }}"
    OS_USERNAME: "{{ os_username }}"
    OS_PASSWORD: "{{ os_password }}"
    OS_PROJECT_NAME: "{{ os_tenant_name }}"
    OS_USER_DOMAIN_NAME: "{{ os_user_domain_name }}"
    OS_PROJECT_DOMAIN_NAME: "{{ os_project_domain_name }}"
    OS_IDENTITY_API_VERSION: "{{ os_identity_api_version }}"
  os_project:
    name: "{{ skydive_auth_os_tenant_name }}"
    description: "Skydive admin users"
    domain_id: "{{ skydive_auth_os_domain_id }}"
    enabled: True
    state: present
~~~

[2]
~~~
TASK [skydive_analyzer : Create a Skydive tenant] ******************************
Tuesday 05 March 2019  15:46:59 -0800 (0:00:01.025)       0:03:00.335 ********* 
fatal: [oc-l-rh-ocld-0 -> localhost]: FAILED! => {"changed": false, "extra_data": null, "msg": "ConflictException: 409"}
changed: [oc-l-rh-ocld-1 -> localhost]
fatal: [oc-l-rh-ocld-2 -> localhost]: FAILED! => {"changed": false, "extra_data": null, "msg": "ConflictException: 409"}

TASK [skydive_analyzer : Create a Skydive keystone API user] *******************
Tuesday 05 March 2019  15:47:04 -0800 (0:00:04.998)       0:03:05.334 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Set skydive Keystone API user role] *******************
Tuesday 05 March 2019  15:47:09 -0800 (0:00:05.325)       0:03:10.659 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Create a Skydive keystone service user] ***************
Tuesday 05 March 2019  15:47:15 -0800 (0:00:05.771)       0:03:16.431 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Set skydive Keystone service user role] ***************
Tuesday 05 March 2019  15:47:20 -0800 (0:00:05.056)       0:03:21.487 ********* 
changed: [oc-l-rh-ocld-1 -> localhost]

TASK [skydive_analyzer : Make the docker image available] **********************
Tuesday 05 March 2019  15:47:26 -0800 (0:00:05.241)       0:03:26.729 ********* 

TASK [skydive_common : Install Docker] *****************************************
Tuesday 05 March 2019  15:47:26 -0800 (0:00:00.512)       0:03:27.242 ********* 
ok: [oc-l-rh-ocld-1]

TASK [skydive_common : Enable Docker service] **********************************
Tuesday 05 March 2019  15:47:29 -0800 (0:00:03.390)       0:03:30.632 ********* 
ok: [oc-l-rh-ocld-1]

TASK [skydive_common : Pull skydive image] *************************************
Tuesday 05 March 2019  15:47:30 -0800 (0:00:00.566)       0:03:31.198 ********* 
changed: [oc-l-rh-ocld-1]
~~~

Comment 5 safchain 2019-03-11 10:26:12 UTC
There is already a "delegate_to" thing here:

https://github.com/skydive-project/skydive/blob/master/contrib/ansible/roles/skydive_analyzer/tasks/main.yml#L30

I'll check one more time...

Comment 6 David Vallee Delisle 2019-03-11 17:41:16 UTC
This is interesting, it's clearly running on all controllers instead of the undercloud though.

It looks that since 2.5 [1] we need to import if we want inheritance. This was reported upstream here [2]

[1] https://docs.ansible.com/ansible/devel/porting_guides/porting_guide_2.5.html#dynamic-includes-and-attribute-inheritance
[2] https://github.com/ansible/ansible/issues/37995

Comment 7 safchain 2019-03-13 08:34:11 UTC
Thanks David, Indeed something changed. I'm about to submit a fix upstream for that and we will backport then it.

Comment 9 safchain 2019-04-12 06:44:15 UTC
*** Bug 1677607 has been marked as a duplicate of this bug. ***

Comment 10 safchain 2019-04-12 06:47:14 UTC
*** Bug 1679851 has been marked as a duplicate of this bug. ***

Comment 12 errata-xmlrpc 2019-04-30 17:47:41 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0944

Comment 15 safchain 2019-06-13 07:41:59 UTC
Hi Mark,

The keystone changes have been backported and should be part of the next release. The firewall rules will be added by default in OSP15.

Thanks,
Sylvain

Comment 16 safchain 2019-06-19 12:41:34 UTC
Addressed in https://bugzilla.redhat.com/show_bug.cgi?id=1722053


Note You need to log in before you can comment on or make changes to this bug.