Created attachment 1295379 [details] non-mangled output from oc Description of problem: I ran a two-node install that included logging and metrics. It broke several times apparently due to ssh connectivity issues, so it's entirely likely that logging deployment ran more than once. In the end I wound up with DCs for ES and Curator that were running two pods each. Version-Release number of the following components: I ran the installation with the latest prod ose-ansible build v3.6.137-1 rpm -q openshift-ansible: openshift-ansible-3.6.137-1.git.0.12654fb.el7.noarch rpm -q ansible: ansible-2.2.3.0-1.el7.noarch FWIW I installed against CentOS 7, containerized with the Origin 3.6.0-alpha.1 images specified. How reproducible: Don't know; it happened once after multiple failed runs. Steps to Reproduce: 1. Create inventory with master and node hosts and logging deployment 2. Run the byo/config.yml playbook a few times? Expected results: A normal install with one pod per ES and Curator DC Actual results: [root@ip-172-18-3-19 ~]# oc -n logging get pods -o wide NAME READY STATUS RESTARTS AGE IP NODE logging-curator-1-njhk7 1/1 Running 49 4h 10.129.0.19 ip-172-18-8-74.ec2.internal logging-curator-1-sm5lp 1/1 Running 50 4h 10.128.0.14 ip-172-18-3-19.ec2.internal logging-es-data-master-o23apde3-1-dw2pd 1/1 Running 1 4h 10.128.0.13 ip-172-18-3-19.ec2.internal logging-es-data-master-o23apde3-1-sq61w 1/1 Running 1 4h 10.129.0.15 ip-172-18-8-74.ec2.internal logging-fluentd-rczmh 0/1 OutOfmemory 0 4h <none> ip-172-18-3-19.ec2.internal logging-fluentd-wtj7b 0/1 OutOfmemory 0 4h <none> ip-172-18-8-74.ec2.internal logging-kibana-1-deploy 0/1 Error 0 4h 10.128.0.7 ip-172-18-3-19.ec2.internal [root@ip-172-18-3-19 ~]# oc -n logging get dc NAME REVISION DESIRED CURRENT TRIGGERED BY logging-curator 1 2 2 config logging-es-data-master-o23apde3 1 2 2 config logging-kibana 1 2 0 config [root@ip-172-18-3-19 ~]# oc -n logging describe dc/logging-es-data-master-o23apde3 Name: logging-es-data-master-o23apde3 Namespace: logging Created: 4 hours ago Labels: component=es deployment=logging-es-data-master-o23apde3 logging-infra=elasticsearch provider=openshift Annotations: <none> Latest Version: 1 Selector: component=es,deployment=logging-es-data-master-o23apde3,logging-infra=elasticsearch,provider=openshift Replicas: 2 Triggers: Config Strategy: Recreate Template: Labels: component=es deployment=logging-es-data-master-o23apde3 logging-infra=elasticsearch provider=openshift Service Account: aggregated-logging-elasticsearch Containers: elasticsearch: Image: docker.io/openshift/origin-logging-elasticsearch:v3.6.0-alpha.1 Ports: 9200/TCP, 9300/TCP Limits: cpu: 1 memory: 8Gi Requests: memory: 512Mi Volume Mounts: /elasticsearch/persistent from elasticsearch-storage (rw) /etc/elasticsearch/secret from elasticsearch (ro) /usr/share/java/elasticsearch/config from elasticsearch-config (ro) Environment Variables: DC_NAME: logging-es-data-master-o23apde3 NAMESPACE: (v1:metadata.namespace) KUBERNETES_TRUST_CERT: true SERVICE_DNS: logging-es-cluster CLUSTER_NAME: logging-es INSTANCE_RAM: 8Gi HEAP_DUMP_LOCATION: /elasticsearch/persistent/heapdump.hprof NODE_QUORUM: 1 RECOVER_EXPECTED_NODES: 1 RECOVER_AFTER_TIME: 5m IS_MASTER: true HAS_DATA: true Volumes: elasticsearch: Type: Secret (a volume populated by a Secret) SecretName: logging-elasticsearch elasticsearch-config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-elasticsearch elasticsearch-storage: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: Deployment #1 (latest): Name: logging-es-data-master-o23apde3-1 Created: 4 hours ago Status: Complete Replicas: 2 current / 2 desired Selector: component=es,deployment=logging-es-data-master-o23apde3-1,deploymentconfig=logging-es-data-master-o23apde3,logging-infra=elasticsearch,provider=openshift Labels: component=es,deployment=logging-es-data-master-o23apde3,logging-infra=elasticsearch,openshift.io/deployment-config.name=logging-es-data-master-o23apde3,provider=openshift Pods Status: 2 Running / 0 Waiting / 0 Succeeded / 0 Failed No events. Additional info: Attaching above and logs from ansible-playbook with the -vvv flag. I did not capture output from the first two broken runs so it's possible something untoward happened there.
I don't see anywhere in the first or second run that it reached the logging roles. In the third run, when ansible goes to get the logging facts, it gets this: "ansible_facts": { "openshift_logging_facts": { "agl_namespace": "logging", "curator": { ... "replicas": 2, So somehow, the curator dc, and I guess the other dcs, were scaled to 2. I have no idea how that could have happened. Can we look at the master logs?
first occurrence of curator in the master log: -- Logs begin at Fri 2017-07-07 09:04:59 EDT, end at Fri 2017-07-07 15:52:55 EDT. -- Jul 07 11:10:50 ip-172-18-3-19.ec2.internal docker[22571]: E0707 15:10:49.732016 1 errors.go:63] apiserver received an error that is not an unversioned.Status: http: Handler timeout ... Jul 07 11:10:50 ip-172-18-3-19.ec2.internal docker[22571]: I0707 15:10:49.744326 1 replication_controller.go:322] Observed updated replication controller logging-curator-1. Desired pod count change: 2->2 That is, the replica count is already 2 - something must have changed it before then. I suppose it is possible that the master logs have already been truncated - it seems there should be something before 11:10:50. These times are only approximations since the ansible logs do not have a timestamp prefix for each log. The first ansible run started somewhere around "start": "2017-07-07 10:11:58.266812", and ended around "end": "2017-07-07 11:01:17.985972", The second run "start": "2017-07-07 11:22:51.788250", So somehow something increased the replica count from 1 to 2 between the first and second ansible runs. I'm not sure where to look - I was hoping there would be a clue in the master log, since that's where replication controller changes like this are logged, but it would appear that information was either lost or never logged . . . Where else to look?
I don't know where else would be helpful to look. Probably the missing ansible run logs. Probably not much point in keeping this open until someone can reproduce with full logs, so will close for now.
Saw this again last night, will attach logs.
I have no idea what's going on. The only clue is that the router code sets a global variable (fact) called "replicas", and in this case it is 2. But unless there is an ansible bug, or I misunderstand how variable scoping works, this code should not be affected: https://github.com/openshift/openshift-ansible/blob/master/roles/openshift_logging_kibana/tasks/main.yaml#L236 # create Kibana DC - name: Generate Kibana DC template template: src: kibana.j2 dest: "{{ tempdir }}/templates/kibana-dc.yaml" vars: component: "{{ kibana_component }}" ... replicas: "{{ openshift_logging_kibana_replicas | default (1) }}" kibana_node_selector: "{{ openshift_logging_kibana_nodeselector | default({}) }}" Nothing sets openshift_logging_kibana_replicas so it should default to 1.
Happened again this morning with just a single run. I guess it would be useful to have ansible keep all the temp files to look at later. The DCs seem to be created initially with replicas:2 in the spec. But if no one else is experiencing this then it's likely something specific I'm doing, e.g. origin containerized install with alpha.1 images. I'll reopen if I can recreate on a more official platform.
Created attachment 1304651 [details] hosts inventory file
Created attachment 1304652 [details] deployment configurations after ansible run
I have added ansible hosts file and the deployment configurations that result it. (oc get dc -n logging -o yaml) I used this ansible hosts file to reproduce it multiple times.
It looks like this is due to the lack of ansible variable scope. It seems this will happen under exactly these conditions: * Deploying with byo/config.yml * With logging enabled * With two infra nodes so that two router instances are deployed The last point sets the "replicas" fact to 2 and this is somehow used in the logging deployment templates (even though the scopes don't seem like they should conflate at all). You wouldn't see this when deploying just logging by itself (not the router) or with an all-in-one node. To reproduce you could probably just run openshift-logging.yml with -e replicas=2 though. Eric is addressing with https://github.com/openshift/openshift-ansible/pull/4911
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716