Created attachment 1244599 [details] INSTANCE_RAM env var is invalid Description of problem: Deploy logging with ansible, failed to create es pod for invalid INSTANCE_RAM env value Enable pod scheduling on OCP master to workaround bug #1415056 : # oc get node NAME STATUS AGE $master Ready 3h $node Ready,SchedulingDisabled 3h # oc get po -o wide -n juzhao NAME READY STATUS RESTARTS AGE IP NODE logging-es-127rf9yo-1-m07kk 0/1 CrashLoopBackOff 35 2h 10.130.0.6 $master logging-fluentd-7hkl1 1/1 Running 0 2h 10.130.0.5 $master logging-fluentd-lf59n 1/1 Running 0 2h 10.129.0.10 $node logging-fluentd-w4w4m 1/1 Running 0 2h 10.128.0.9 $node logging-kibana-1-z5k1z 2/2 Running 0 2h 10.130.0.7 $master # oc logs logging-es-127rf9yo-1-m07kk INSTANCE_RAM env var is invalid: 1024Mi login OCP Web UI, see the attached INSTANCE_RAM_invalid.png, INSTANCE_RAM is 1024Mi. Version-Release number of selected component (if applicable): # openshift version openshift v3.5.0.9+e84be2b kubernetes v1.5.2+43a9be4 etcd 3.1.0 How reproducible: Always Steps to Reproduce: 1. prepare the inventory file [oo_first_master] $master-public-dns ansible_user=root ansible_ssh_user=root ansible_ssh_private_key_file="~/cfile/libra.pem" openshift_public_hostname=$master-public-dns [oo_first_master:vars] deployment_type=openshift-enterprise openshift_release=v3.5.0 openshift_logging_install_logging=true openshift_logging_kibana_hostname=kibana.$sub-domain public_master_url=https://$master-public-dns:8443 openshift_logging_image_prefix=registry.ops.openshift.com/openshift3/ openshift_logging_image_version=3.5.0 openshift_logging_namespace=juzhao 2. Running the playbook from a control machine (my laptop) which is not oo_master: git clone https://github.com/openshift/openshift-ansible ansible-playbook -vvv -i ~/inventory playbooks/common/openshift-cluster/openshift_logging.yml Actual results: failed to create es pod for invalid INSTANCE_RAM env value Expected results: Should complete successfully Additional info: Attached the fully ansile running log There are 3 additonal questions I want to ask: Q1: Why jks-cert-gen pod not exist, and can't find any info about it in attached ansible log. Q2: Under /etc/origin/logging/, ca.db.attr and ca.db.attr.old are the same, should ca.db.attr.old be deleted? # ls -al /etc/origin/logging/ total 140 drwxr-xr-x. 2 root root 4096 Jan 25 21:47 . drwx------. 7 root root 4096 Jan 25 21:42 .. -rw-r--r--. 1 root root 1196 Jan 25 21:44 02.pem -rw-r--r--. 1 root root 1196 Jan 25 21:45 03.pem -rw-r--r--. 1 root root 1196 Jan 25 21:45 04.pem -rw-r--r--. 1 root root 1184 Jan 25 21:45 05.pem -rw-r--r--. 1 root root 1050 Jan 25 21:43 ca.crt -rw-r--r--. 1 root root 0 Jan 25 21:44 ca.crt.srl -rw-r--r--. 1 root root 301 Jan 25 21:45 ca.db -rw-r--r--. 1 root root 20 Jan 25 21:45 ca.db.attr -rw-r--r--. 1 root root 20 Jan 25 21:45 ca.db.attr.old -rw-r--r--. 1 root root 233 Jan 25 21:45 ca.db.old -rw-------. 1 root root 1679 Jan 25 21:43 ca.key -rw-r--r--. 1 root root 3 Jan 25 21:45 ca.serial.txt -rw-r--r--. 1 root root 3 Jan 25 21:45 ca.serial.txt.old -rw-r--r--. 1 root root 3768 Jan 25 21:46 elasticsearch.jks -rw-r--r--. 1 root root 2242 Jan 25 21:43 kibana-internal.crt -rw-------. 1 root root 1679 Jan 25 21:43 kibana-internal.key -rw-r--r--. 1 root root 3979 Jan 25 21:47 logging-es.jks -rw-r--r--. 1 root root 321 Jan 25 21:43 server-tls.json -rw-r--r--. 1 root root 4263 Jan 25 21:43 signing.conf -rw-r--r--. 1 root root 1184 Jan 25 21:45 system.admin.crt -rw-r--r--. 1 root root 948 Jan 25 21:45 system.admin.csr -rw-r--r--. 1 root root 3701 Jan 25 21:47 system.admin.jks -rw-r--r--. 1 root root 1704 Jan 25 21:45 system.admin.key -rw-r--r--. 1 root root 1196 Jan 25 21:45 system.logging.curator.crt -rw-r--r--. 1 root root 960 Jan 25 21:45 system.logging.curator.csr -rw-r--r--. 1 root root 1708 Jan 25 21:45 system.logging.curator.key -rw-r--r--. 1 root root 1196 Jan 25 21:44 system.logging.fluentd.crt -rw-r--r--. 1 root root 960 Jan 25 21:44 system.logging.fluentd.csr -rw-r--r--. 1 root root 1704 Jan 25 21:44 system.logging.fluentd.key -rw-r--r--. 1 root root 1196 Jan 25 21:45 system.logging.kibana.crt -rw-r--r--. 1 root root 960 Jan 25 21:45 system.logging.kibana.csr -rw-r--r--. 1 root root 1704 Jan 25 21:45 system.logging.kibana.key -rw-r--r--. 1 root root 797 Jan 25 21:47 truststore.jks Q3: although the error message in ansible log shows it can't get deploymentconfig/logging-kibana,but the deploymentconfig/logging-kibana is exist actually, same phenomenon for sa, configmap, rolebinding fatal: [ec2-52-202-145-248.compute-1.amazonaws.com]: FAILED! => { "attempts": 30, "changed": false, "cmd": [ "oc", "--config=/tmp/openshift-logging-ansible-QVGhWk/admin.kubeconfig", "get", "deploymentconfig/logging-kibana", "-n", "juzhao", "-o", "jsonpath={.status.replicas}" ], "delta": "0:00:01.231301", "end": "2017-01-25 22:11:21.072966", "failed": true, "invocation": { "module_args": { "_raw_params": "oc --config=/tmp/openshift-logging-ansible-QVGhWk/admin.kubeconfig get deploymentconfig/logging-kibana -n juzhao -o jsonpath='{.status.replicas}'", "_uses_shell": false, "chdir": null, "creates": null, "executable": null, "removes": null, "warn": true }, "module_name": "command" }, "rc": 0, "start": "2017-01-25 22:11:19.841665", "stderr": "", "stdout": "1", "stdout_lines": [ "1" ], "warnings": [] } to retry, use: --limit @/home/fedora/openshift-ansible/playbooks/common/openshift-cluster/openshift_logging.retry # oc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY logging-curator 1 0 0 config logging-es-127rf9yo 1 1 1 config logging-kibana 1 1 1 config # oc get configmap NAME DATA AGE logging-curator 1 3h logging-elasticsearch 2 3h logging-fluentd 3 3h # oc get sa NAME SECRETS AGE aggregated-logging-curator 2 3h aggregated-logging-elasticsearch 2 3h aggregated-logging-fluentd 2 2h aggregated-logging-kibana 2 2h builder 2 3h default 2 3h deployer 2 3h # oc get rolebinding NAME ROLE USERS GROUPS SERVICE ACCOUNTS SUBJECTS logging-elasticsearch-view-role /view aggregated-logging-elasticsearch system:deployers /system:deployer deployer system:image-builders /system:image-builder builder system:image-pullers /system:image-puller system:serviceaccounts:juzhao admin /admin system:admin
Created attachment 1244600 [details] ansible log
In https://github.com/openshift/openshift-ansible/tree/master/roles/openshift_logging openshift_logging_es_memory_limit: The amount of RAM that should be assigned to ES. Defaults to '1024Mi'. Maybe this error is related to it
Can you please attach the logs for the ES pod that is failing? It should be able to correctly use the default of '1024Mi'.
To answer your above questions: 1) We no longer use a jks generation pod due to issues with it needing to be scheduled on a specific node. A script is now executed on the control host 2) Possibly, however we are letting that be handled by the signing tools, it shouldn't impact this working or not working though. 3) Is that the output from while it is retrying until it sees that Kibana has successfully started up? I'll check to see if its until statement is incorrect... it looks like everything started up eventually (With the exception of ES)...
Blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1418911
(In reply to ewolinet from comment #3) > Can you please attach the logs for the ES pod that is failing? It should be > able to correctly use the default of '1024Mi'. Sorry for forgetting to attach ES pod log when I submitted this defect # oc logs logging-es-127rf9yo-1-m07kk INSTANCE_RAM env var is invalid: 1024Mi
Tested with latest es 3.5.0 image on ops registry,same error with https://bugzilla.redhat.com/show_bug.cgi?id=1419244 # oc get po NAME READY STATUS RESTARTS AGE logging-curator-1-85sb4 1/1 Running 2 7m logging-es-eft6uu2i-1-hqk3r 0/1 CrashLoopBackOff 6 7m logging-fluentd-mvpb2 1/1 Running 0 8m logging-fluentd-tprgq 1/1 Running 0 8m logging-fluentd-vvvrh 1/1 Running 0 8m logging-kibana-1-bt7tr 2/2 Running 0 7m # oc logs logging-es-eft6uu2i-1-hqk3r Comparing the specificed RAM to the maximum recommended for ElasticSearch... Inspecting the maximum RAM available... ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms128M -Xmx512m' /opt/app-root/src/run.sh: line 141: /usr/share/elasticsearch/bin/elasticsearch: No such file or directory Images tested with: openshift3/logging-elasticsearch 3.5.0 eed2ca51f2ba 9 hours ago 399.2 MB # openshift version openshift v3.5.0.17+c55cf2b kubernetes v1.5.2+43a9be4 etcd 3.1.0
error "INSTANCE_RAM env var is invalid: 1024Mi" does not exist now, although same error with https://bugzilla.redhat.com/show_bug.cgi?id=1419244 happens now. Set this defect to VERIFIED and close it.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:3049