1416629 – [IntService_public_324]Deploy logging with ansible, failed to create es pod for invalid INSTANCE_RAM env value

Bug 1416629 - [IntService_public_324]Deploy logging with ansible, failed to create es pod for invalid INSTANCE_RAM env value

Summary: [IntService_public_324]Deploy logging with ansible, failed to create es pod f...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.5.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	3.5.z
Assignee:	ewolinet
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-01-26 06:12 UTC by Junqi Zhao
Modified:	2017-10-25 13:00 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	undefined
Clone Of:
Environment:
Last Closed:	2017-10-25 13:00:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
INSTANCE_RAM env var is invalid (115.62 KB, image/png) 2017-01-26 06:12 UTC, Junqi Zhao	no flags	Details
ansible log (1.56 MB, text/plain) 2017-01-26 06:13 UTC, Junqi Zhao	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:3049	0	normal	SHIPPED_LIVE	OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update	2017-10-25 15:57:15 UTC

Description Junqi Zhao 2017-01-26 06:12:02 UTC

Created attachment 1244599 [details]
INSTANCE_RAM env var is invalid

Description of problem:
Deploy logging with ansible, failed to create es pod for invalid INSTANCE_RAM env value

Enable pod scheduling on OCP master to workaround bug #1415056 :
# oc get node
NAME                            STATUS                     AGE
$master                         Ready                      3h
$node                           Ready,SchedulingDisabled   3h


# oc get po -o wide -n juzhao
NAME                          READY     STATUS             RESTARTS   AGE       IP            NODE
logging-es-127rf9yo-1-m07kk   0/1       CrashLoopBackOff   35         2h        10.130.0.6    $master
logging-fluentd-7hkl1         1/1       Running            0          2h        10.130.0.5    $master
logging-fluentd-lf59n         1/1       Running            0          2h        10.129.0.10   $node
logging-fluentd-w4w4m         1/1       Running            0          2h        10.128.0.9    $node
logging-kibana-1-z5k1z        2/2       Running            0          2h        10.130.0.7    $master

# oc logs logging-es-127rf9yo-1-m07kk
INSTANCE_RAM env var is invalid: 1024Mi

login OCP Web UI, see the attached INSTANCE_RAM_invalid.png, INSTANCE_RAM is 1024Mi.

Version-Release number of selected component (if applicable):
# openshift version
openshift v3.5.0.9+e84be2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

How reproducible:
Always

Steps to Reproduce:
1. prepare the inventory file

[oo_first_master]
$master-public-dns ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="~/cfile/libra.pem" openshift_public_hostname=$master-public-dns

[oo_first_master:vars]
deployment_type=openshift-enterprise
openshift_release=v3.5.0
openshift_logging_install_logging=true

openshift_logging_kibana_hostname=kibana.$sub-domain
public_master_url=https://$master-public-dns:8443

openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/
openshift_logging_image_version=3.5.0

openshift_logging_namespace=juzhao

2. Running the playbook from a control machine (my laptop) which is not oo_master:
git clone https://github.com/openshift/openshift-ansible
ansible-playbook -vvv -i ~/inventory   playbooks/common/openshift-cluster/openshift_logging.yml

Actual results:
failed to create es pod for invalid INSTANCE_RAM env value

Expected results:
Should complete successfully

Additional info:
Attached the fully ansile running log

There are 3 additonal questions I want to ask:
Q1: Why jks-cert-gen pod not exist, and can't find any info about it in attached ansible log.

Q2: Under /etc/origin/logging/, ca.db.attr and ca.db.attr.old are the same, should ca.db.attr.old be deleted?
# ls -al /etc/origin/logging/
total 140
drwxr-xr-x. 2 root root 4096 Jan 25 21:47 .
drwx------. 7 root root 4096 Jan 25 21:42 ..
-rw-r--r--. 1 root root 1196 Jan 25 21:44 02.pem
-rw-r--r--. 1 root root 1196 Jan 25 21:45 03.pem
-rw-r--r--. 1 root root 1196 Jan 25 21:45 04.pem
-rw-r--r--. 1 root root 1184 Jan 25 21:45 05.pem
-rw-r--r--. 1 root root 1050 Jan 25 21:43 ca.crt
-rw-r--r--. 1 root root    0 Jan 25 21:44 ca.crt.srl
-rw-r--r--. 1 root root  301 Jan 25 21:45 ca.db
-rw-r--r--. 1 root root   20 Jan 25 21:45 ca.db.attr
-rw-r--r--. 1 root root   20 Jan 25 21:45 ca.db.attr.old
-rw-r--r--. 1 root root  233 Jan 25 21:45 ca.db.old
-rw-------. 1 root root 1679 Jan 25 21:43 ca.key
-rw-r--r--. 1 root root    3 Jan 25 21:45 ca.serial.txt
-rw-r--r--. 1 root root    3 Jan 25 21:45 ca.serial.txt.old
-rw-r--r--. 1 root root 3768 Jan 25 21:46 elasticsearch.jks
-rw-r--r--. 1 root root 2242 Jan 25 21:43 kibana-internal.crt
-rw-------. 1 root root 1679 Jan 25 21:43 kibana-internal.key
-rw-r--r--. 1 root root 3979 Jan 25 21:47 logging-es.jks
-rw-r--r--. 1 root root  321 Jan 25 21:43 server-tls.json
-rw-r--r--. 1 root root 4263 Jan 25 21:43 signing.conf
-rw-r--r--. 1 root root 1184 Jan 25 21:45 system.admin.crt
-rw-r--r--. 1 root root  948 Jan 25 21:45 system.admin.csr
-rw-r--r--. 1 root root 3701 Jan 25 21:47 system.admin.jks
-rw-r--r--. 1 root root 1704 Jan 25 21:45 system.admin.key
-rw-r--r--. 1 root root 1196 Jan 25 21:45 system.logging.curator.crt
-rw-r--r--. 1 root root  960 Jan 25 21:45 system.logging.curator.csr
-rw-r--r--. 1 root root 1708 Jan 25 21:45 system.logging.curator.key
-rw-r--r--. 1 root root 1196 Jan 25 21:44 system.logging.fluentd.crt
-rw-r--r--. 1 root root  960 Jan 25 21:44 system.logging.fluentd.csr
-rw-r--r--. 1 root root 1704 Jan 25 21:44 system.logging.fluentd.key
-rw-r--r--. 1 root root 1196 Jan 25 21:45 system.logging.kibana.crt
-rw-r--r--. 1 root root  960 Jan 25 21:45 system.logging.kibana.csr
-rw-r--r--. 1 root root 1704 Jan 25 21:45 system.logging.kibana.key
-rw-r--r--. 1 root root  797 Jan 25 21:47 truststore.jks

Q3: although the error message in ansible log shows it can't get deploymentconfig/logging-kibana,but the deploymentconfig/logging-kibana is exist actually, 
same phenomenon for sa, configmap, rolebinding

fatal: [ec2-52-202-145-248.compute-1.amazonaws.com]: FAILED! => {
    "attempts": 30, 
    "changed": false, 
    "cmd": [
        "oc", 
        "--config=/tmp/openshift-logging-ansible-QVGhWk/admin.kubeconfig", 
        "get", 
        "deploymentconfig/logging-kibana", 
        "-n", 
        "juzhao", 
        "-o", 
        "jsonpath={.status.replicas}"
    ], 
    "delta": "0:00:01.231301", 
    "end": "2017-01-25 22:11:21.072966", 
    "failed": true, 
    "invocation": {
        "module_args": {
            "_raw_params": "oc --config=/tmp/openshift-logging-ansible-QVGhWk/admin.kubeconfig get deploymentconfig/logging-kibana -n juzhao -o jsonpath='{.status.replicas}'", 
            "_uses_shell": false, 
            "chdir": null, 
            "creates": null, 
            "executable": null, 
            "removes": null, 
            "warn": true
        }, 
        "module_name": "command"
    }, 
    "rc": 0, 
    "start": "2017-01-25 22:11:19.841665", 
    "stderr": "", 
    "stdout": "1", 
    "stdout_lines": [
        "1"
    ], 
    "warnings": []
}
	to retry, use: --limit @/home/fedora/openshift-ansible/playbooks/common/openshift-cluster/openshift_logging.retry

# oc get dc
NAME                  REVISION   DESIRED   CURRENT   TRIGGERED BY
logging-curator       1          0         0         config
logging-es-127rf9yo   1          1         1         config
logging-kibana        1          1         1         config

# oc get configmap
NAME                    DATA      AGE
logging-curator         1         3h
logging-elasticsearch   2         3h
logging-fluentd         3         3h

# oc get sa
NAME                               SECRETS   AGE
aggregated-logging-curator         2         3h
aggregated-logging-elasticsearch   2         3h
aggregated-logging-fluentd         2         2h
aggregated-logging-kibana          2         2h
builder                            2         3h
default                            2         3h
deployer                           2         3h

# oc get rolebinding
NAME                              ROLE                    USERS          GROUPS                          SERVICE ACCOUNTS                   SUBJECTS
logging-elasticsearch-view-role   /view                                                                  aggregated-logging-elasticsearch   
system:deployers                  /system:deployer                                                       deployer                           
system:image-builders             /system:image-builder                                                  builder                            
system:image-pullers              /system:image-puller                   system:serviceaccounts:juzhao                                      
admin                             /admin                  system:admin

Comment 1 Junqi Zhao 2017-01-26 06:13:13 UTC

Created attachment 1244600 [details]
ansible log

Comment 2 Junqi Zhao 2017-01-26 07:21:38 UTC

In https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_logging

openshift_logging_es_memory_limit: The amount of RAM that should be assigned to ES. Defaults to '1024Mi'.

Maybe this error is related to it

Comment 3 ewolinet 2017-01-26 19:12:56 UTC

Can you please attach the logs for the ES pod that is failing? It should be able to correctly use the default of '1024Mi'.

Comment 4 ewolinet 2017-01-26 22:29:40 UTC

To answer your above questions:

1) We no longer use a jks generation pod due to issues with it needing to be scheduled on a specific node. A script is now executed on the control host

2) Possibly, however we are letting that be handled by the signing tools, it shouldn't impact this working or not working though.

3) Is that the output from while it is retrying until it sees that Kibana has successfully started up? I'll check to see if its until statement is incorrect... it looks like everything started up eventually (With the exception of ES)...

Comment 6 Xia Zhao 2017-02-03 06:20:08 UTC

Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1418911

Comment 7 Junqi Zhao 2017-02-07 01:49:28 UTC

(In reply to ewolinet from comment #3)
> Can you please attach the logs for the ES pod that is failing? It should be
> able to correctly use the default of '1024Mi'.

Sorry for forgetting to attach ES pod log when I submitted this defect
# oc logs logging-es-127rf9yo-1-m07kk
INSTANCE_RAM env var is invalid: 1024Mi

Comment 8 Junqi Zhao 2017-02-07 07:20:40 UTC

Tested with latest es 3.5.0 image on ops registry,same error with https://bugzilla.redhat.com/show_bug.cgi?id=1419244

# oc get po
NAME                          READY     STATUS             RESTARTS   AGE
logging-curator-1-85sb4       1/1       Running            2          7m
logging-es-eft6uu2i-1-hqk3r   0/1       CrashLoopBackOff   6          7m
logging-fluentd-mvpb2         1/1       Running            0          8m
logging-fluentd-tprgq         1/1       Running            0          8m
logging-fluentd-vvvrh         1/1       Running            0          8m
logging-kibana-1-bt7tr        2/2       Running            0          7m

# oc logs logging-es-eft6uu2i-1-hqk3r
Comparing the specificed RAM to the maximum recommended for ElasticSearch...
Inspecting the maximum RAM available...
ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms128M -Xmx512m'
/opt/app-root/src/run.sh: line 141: /usr/share/elasticsearch/bin/elasticsearch: No such file or directory

Images tested with:
openshift3/logging-elasticsearch   3.5.0               eed2ca51f2ba        9 hours ago         399.2 MB

# openshift version
openshift v3.5.0.17+c55cf2b
kubernetes v1.5.2+43a9be4
etcd 3.1.0

Comment 9 Junqi Zhao 2017-02-07 10:37:17 UTC

error "INSTANCE_RAM env var is invalid: 1024Mi" does not exist now, although same error with https://bugzilla.redhat.com/show_bug.cgi?id=1419244 happens now.

Set this defect to VERIFIED and close it.

Comment 11 errata-xmlrpc 2017-10-25 13:00:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049

Note You need to log in before you can comment on or make changes to this bug.