1468731 – Logging deployment wound up with curator and ES DCs with two replicas

Bug 1468731 - Logging deployment wound up with curator and ES DCs with two replicas

Summary: Logging deployment wound up with curator and ES DCs with two replicas

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	3.6.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	ewolinet
QA Contact:	Johnny Liu
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-07 18:48 UTC by Luke Meyer
Modified:	2017-08-16 19:51 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: The Elasticsearch, Kibana, and Curator roles were using a variable for setting the replica count in their DC spec that was used elsewhere in the Ansible playbook. So when that was set to '2' it was updating the value for the logging components. Consequence: When that variable was a value other than 1, logging would also end up with multiple replicas. Fix: Changed the logging variables to be unique to prevent variable scope creep. Result: The logging components are not affected by the previous variable and correctly deploy with the given number of replicas.
Clone Of:
Environment:
Last Closed:	2017-08-10 05:29:50 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
non-mangled output from oc (3.75 KB, text/plain) 2017-07-07 18:48 UTC, Luke Meyer	no flags	Details
hosts inventory file (2.86 KB, text/plain) 2017-07-26 09:03 UTC, Erez Freiberger	no flags	Details
deployment configurations after ansible run (25.54 KB, text/plain) 2017-07-26 09:04 UTC, Erez Freiberger	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1716	0	normal	SHIPPED_LIVE	Red Hat OpenShift Container Platform 3.6 RPM Release Advisory	2017-08-10 09:02:50 UTC

Description Luke Meyer 2017-07-07 18:48:55 UTC

Created attachment 1295379 [details]
non-mangled output from oc

Description of problem:
I ran a two-node install that included logging and metrics. It broke several times apparently due to ssh connectivity issues, so it's entirely likely that logging deployment ran more than once. In the end I wound up with DCs for ES and Curator that were running two pods each.


Version-Release number of the following components:
I ran the installation with the latest prod ose-ansible build v3.6.137-1

rpm -q openshift-ansible: openshift-ansible-3.6.137-1.git.0.12654fb.el7.noarch
rpm -q ansible: ansible-2.2.3.0-1.el7.noarch
FWIW I installed against CentOS 7, containerized with the Origin 3.6.0-alpha.1 images specified.


How reproducible:
Don't know; it happened once after multiple failed runs.


Steps to Reproduce:
1. Create inventory with master and node hosts and logging deployment
2. Run the byo/config.yml playbook a few times?


Expected results:
A normal install with one pod per ES and Curator DC


Actual results:

[root@ip-172-18-3-19 ~]# oc -n logging get pods -o wide
NAME                                      READY     STATUS        RESTARTS   AGE       IP            NODE
logging-curator-1-njhk7                   1/1       Running       49         4h        10.129.0.19   ip-172-18-8-74.ec2.internal
logging-curator-1-sm5lp                   1/1       Running       50         4h        10.128.0.14   ip-172-18-3-19.ec2.internal
logging-es-data-master-o23apde3-1-dw2pd   1/1       Running       1          4h        10.128.0.13   ip-172-18-3-19.ec2.internal
logging-es-data-master-o23apde3-1-sq61w   1/1       Running       1          4h        10.129.0.15   ip-172-18-8-74.ec2.internal
logging-fluentd-rczmh                     0/1       OutOfmemory   0          4h        <none>        ip-172-18-3-19.ec2.internal
logging-fluentd-wtj7b                     0/1       OutOfmemory   0          4h        <none>        ip-172-18-8-74.ec2.internal
logging-kibana-1-deploy                   0/1       Error         0          4h        10.128.0.7    ip-172-18-3-19.ec2.internal
[root@ip-172-18-3-19 ~]# oc -n logging get dc
NAME                              REVISION   DESIRED   CURRENT   TRIGGERED BY
logging-curator                   1          2         2         config
logging-es-data-master-o23apde3   1          2         2         config
logging-kibana                    1          2         0         config
[root@ip-172-18-3-19 ~]# oc -n logging describe dc/logging-es-data-master-o23apde3
Name:		logging-es-data-master-o23apde3
Namespace:	logging
Created:	4 hours ago
Labels:		component=es
		deployment=logging-es-data-master-o23apde3
		logging-infra=elasticsearch
		provider=openshift
Annotations:	<none>
Latest Version:	1
Selector:	component=es,deployment=logging-es-data-master-o23apde3,logging-infra=elasticsearch,provider=openshift
Replicas:	2
Triggers:	Config
Strategy:	Recreate
Template:
  Labels:		component=es
			deployment=logging-es-data-master-o23apde3
			logging-infra=elasticsearch
			provider=openshift
  Service Account:	aggregated-logging-elasticsearch
  Containers:
   elasticsearch:
    Image:	docker.io/openshift/origin-logging-elasticsearch:v3.6.0-alpha.1
    Ports:	9200/TCP, 9300/TCP
    Limits:
      cpu:	1
      memory:	8Gi
    Requests:
      memory:	512Mi
    Volume Mounts:
      /elasticsearch/persistent from elasticsearch-storage (rw)
      /etc/elasticsearch/secret from elasticsearch (ro)
      /usr/share/java/elasticsearch/config from elasticsearch-config (ro)
    Environment Variables:
      DC_NAME:			logging-es-data-master-o23apde3
      NAMESPACE:		 (v1:metadata.namespace)
      KUBERNETES_TRUST_CERT:	true
      SERVICE_DNS:		logging-es-cluster
      CLUSTER_NAME:		logging-es
      INSTANCE_RAM:		8Gi
      HEAP_DUMP_LOCATION:	/elasticsearch/persistent/heapdump.hprof
      NODE_QUORUM:		1
      RECOVER_EXPECTED_NODES:	1
      RECOVER_AFTER_TIME:	5m
      IS_MASTER:		true
      HAS_DATA:			true
  Volumes:
   elasticsearch:
    Type:	Secret (a volume populated by a Secret)
    SecretName:	logging-elasticsearch
   elasticsearch-config:
    Type:	ConfigMap (a volume populated by a ConfigMap)
    Name:	logging-elasticsearch
   elasticsearch-storage:
    Type:	EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:	

Deployment #1 (latest):
	Name:		logging-es-data-master-o23apde3-1
	Created:	4 hours ago
	Status:		Complete
	Replicas:	2 current / 2 desired
	Selector:	component=es,deployment=logging-es-data-master-o23apde3-1,deploymentconfig=logging-es-data-master-o23apde3,logging-infra=elasticsearch,provider=openshift
	Labels:		component=es,deployment=logging-es-data-master-o23apde3,logging-infra=elasticsearch,openshift.io/deployment-config.name=logging-es-data-master-o23apde3,provider=openshift
	Pods Status:	2 Running / 0 Waiting / 0 Succeeded / 0 Failed

No events.




Additional info:
Attaching above and logs from ansible-playbook with the -vvv flag. I did not capture output from the first two broken runs so it's possible something untoward happened there.

Comment 6 Rich Megginson 2017-07-07 19:44:55 UTC

I don't see anywhere in the first or second run that it reached the logging roles.

In the third run, when ansible goes to get the logging facts, it gets this:

    "ansible_facts": {
        "openshift_logging_facts": {
            "agl_namespace": "logging", 
            "curator": {
...
                        "replicas": 2,

So somehow, the curator dc, and I guess the other dcs, were scaled to 2.  I have no idea how that could have happened.  Can we look at the master logs?

Comment 9 Rich Megginson 2017-07-07 20:41:52 UTC

first occurrence of curator in the master log:

-- Logs begin at Fri 2017-07-07 09:04:59 EDT, end at Fri 2017-07-07 15:52:55 EDT. --
Jul 07 11:10:50 ip-172-18-3-19.ec2.internal docker[22571]: E0707 15:10:49.732016       1 errors.go:63] apiserver received an error that is not an unversioned.Status: http: Handler timeout
...
Jul 07 11:10:50 ip-172-18-3-19.ec2.internal docker[22571]: I0707 15:10:49.744326       1 replication_controller.go:322] Observed updated replication controller logging-curator-1. Desired pod count change: 2->2

That is, the replica count is already 2 - something must have changed it before then.  I suppose it is possible that the master logs have already been truncated - it seems there should be something before 11:10:50.

These times are only approximations since the ansible logs do not have a timestamp prefix for each log.

The first ansible run started somewhere around 
    "start": "2017-07-07 10:11:58.266812",
and ended around
    "end": "2017-07-07 11:01:17.985972", 

The second run
    "start": "2017-07-07 11:22:51.788250", 

So somehow something increased the replica count from 1 to 2 between the first and second ansible runs.

I'm not sure where to look - I was hoping there would be a clue in the master log, since that's where replication controller changes like this are logged, but it would appear that information was either lost or never logged . . .

Where else to look?

Comment 10 Luke Meyer 2017-07-10 13:10:11 UTC

I don't know where else would be helpful to look. Probably the missing ansible run logs. Probably not much point in keeping this open until someone can reproduce with full logs, so will close for now.

Comment 11 Luke Meyer 2017-07-11 15:30:51 UTC

Saw this again last night, will attach logs.

Comment 17 Rich Megginson 2017-07-11 23:04:33 UTC

I have no idea what's going on.  The only clue is that the router code sets a global variable (fact) called "replicas", and in this case it is 2.  But unless there is an ansible bug, or I misunderstand how variable scoping works, this code should not be affected:

https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_kibana/tasks/main.yaml#L236

# create Kibana DC
- name: Generate Kibana DC template
  template:
    src: kibana.j2
    dest: "{{ tempdir }}/templates/kibana-dc.yaml"
  vars:
    component: "{{ kibana_component }}"
...
    replicas: "{{ openshift_logging_kibana_replicas | default (1) }}"
    kibana_node_selector: "{{ openshift_logging_kibana_nodeselector | default({}) }}"

Nothing sets openshift_logging_kibana_replicas so it should default to 1.

Comment 18 Luke Meyer 2017-07-12 13:22:37 UTC

Happened again this morning with just a single run. I guess it would be useful to have ansible keep all the temp files to look at later. The DCs seem to be created initially with replicas:2 in the spec. But if no one else is experiencing this then it's likely something specific I'm doing, e.g. origin containerized install with alpha.1 images. I'll reopen if I can recreate on a more official platform.

Comment 19 Erez Freiberger 2017-07-26 09:03:57 UTC

Created attachment 1304651 [details]
hosts inventory file

Comment 20 Erez Freiberger 2017-07-26 09:04:49 UTC

Created attachment 1304652 [details]
deployment configurations after ansible run

Comment 21 Erez Freiberger 2017-07-26 09:06:52 UTC

I have added ansible hosts file and the deployment configurations that result it. (oc get dc -n logging -o yaml) 
I used this ansible hosts file to reproduce it multiple times.

Comment 22 Luke Meyer 2017-07-27 18:29:18 UTC

It looks like this is due to the lack of ansible variable scope. It seems this will happen under exactly these conditions:

* Deploying with byo/config.yml
* With logging enabled
* With two infra nodes so that two router instances are deployed

The last point sets the "replicas" fact to 2 and this is somehow used in the logging deployment templates (even though the scopes don't seem like they should conflate at all).

You wouldn't see this when deploying just logging by itself (not the router) or with an all-in-one node. To reproduce you could probably just run openshift-logging.yml with -e replicas=2 though.

Eric is addressing with https://github.com/openshift/openshift-ansible/pull/4911

Comment 26 errata-xmlrpc 2017-08-10 05:29:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1716

Note You need to log in before you can comment on or make changes to this bug.