Created attachment 1259427 [details] ansible inventory file Description of problem: This issue was found when verifying https://bugzilla.redhat.com/show_bug.cgi?id=1426511. Install logging 3.3.1 first, specified nodeselector in configmap, steps please see 'Steps to Reproduce', after upgrading logging from 3.3.1 to 3.5.0, ES pod can not start up, unable to read /etc/elasticsearch/secret/searchguard.truststore # oc get po NAME READY STATUS RESTARTS AGE logging-curator-2-zn6lk 1/1 Running 5 24m logging-deployer-1c4s8 0/1 Completed 0 50m logging-es-63pgj4rj-2-0qn01 0/1 CrashLoopBackOff 9 24m logging-fluentd-6sj96 1/1 Running 0 24m logging-kibana-2-fxblb 2/2 Running 0 24m # oc logs logging-es-63pgj4rj-2-0qn01 Comparing the specificed RAM to the maximum recommended for ElasticSearch... Inspecting the maximum RAM available... ES_JAVA_OPTS: '-Dmapper.allow_dots_in_name=true -Xms128M -Xmx4096m' Checking if Elasticsearch is ready on https://localhost:9200 ..[2017-03-03 07:27:06,021][INFO ][node ] [Iron Fist] version[2.4.4], pid[1], build[b3c4811/2017-01-18T03:01:12Z] [2017-03-03 07:27:06,022][INFO ][node ] [Iron Fist] initializing ... .[2017-03-03 07:27:07,065][INFO ][plugins ] [Iron Fist] modules [reindex, lang-expression, lang-groovy], plugins [search-guard-ssl, openshift-elasticsearch, cloud-kubernetes, search-guard-2], sites [] [2017-03-03 07:27:07,103][INFO ][env ] [Iron Fist] using [1] data paths, mounts [[/elasticsearch/persistent (/dev/xvda2)]], net usable_space [17.8gb], net total_space [24.9gb], spins? [possibly], types [xfs] [2017-03-03 07:27:07,103][INFO ][env ] [Iron Fist] heap size [3.9gb], compressed ordinary object pointers [true] Exception in thread "main" ElasticsearchException[Unable to read /etc/elasticsearch/secret/searchguard.truststore (/etc/elasticsearch/secret/searchguard.truststore) Please make sure this files exists and is readable regarding to permissions] at com.floragunn.searchguard.ssl.DefaultSearchGuardKeyStore.checkStorePath(DefaultSearchGuardKeyStore.java:551) at com.floragunn.searchguard.ssl.DefaultSearchGuardKeyStore.initSSLConfig(DefaultSearchGuardKeyStore.java:199) at com.floragunn.searchguard.ssl.DefaultSearchGuardKeyStore.<init>(DefaultSearchGuardKeyStore.java:139) at com.floragunn.searchguard.ssl.SearchGuardSSLModule.<init>(SearchGuardSSLModule.java:40) at com.floragunn.searchguard.ssl.SearchGuardSSLPlugin.nodeModules(SearchGuardSSLPlugin.java:126) at org.elasticsearch.plugins.PluginsService.nodeModules(PluginsService.java:263) at org.elasticsearch.node.Node.<init>(Node.java:179) at org.elasticsearch.node.Node.<init>(Node.java:140) at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143) at org.elasticsearch.bootstrap.Bootstrap.setup(Bootstrap.java:194) at org.elasticsearch.bootstrap.Bootstrap.init(Bootstrap.java:286) at org.elasticsearch.bootstrap.Elasticsearch.main(Elasticsearch.java:45) Refer to the log for complete error details. Version-Release number of selected component (if applicable): openshift-ansible-3.5.20-1.git.0.5a5fcd5.el7.noarch How reproducible: Always Steps to Reproduce: 1.Deploy logging 3.3.1 stacks (on OCP 3.5.0) with journald log driver enabled and node selectors defined in configmap,curator, es and kibana nodeselector are different with fluentd nodeselector: "use-journal": "true" "curator-nodeselector": "logging-infra-east=true" "es-nodeselector": "logging-infra-east=true" "kibana-nodeselector": "logging-infra-east=true" 2.Upgrade to logging 3.5.0 stacks by using ansible, specifying these parameters in inventory file (as in the attachment), curator, es and kibana nodeselector are different with fluentd nodeselector:: openshift_logging_fluentd_use_journal=true openshift_logging_es_nodeselector={'logging-infra-east':'true'} openshift_logging_kibana_nodeselector={'logging-infra-east':'true'} openshift_logging_curator_nodeselector={'logging-infra-east':'true'} openshift_logging_fluentd_nodeselector={'logging-infra-fluentd':'true'} 3.Check upgrade result Actual results: Upgrade failed, failed to start up ES pod. Expected results: Upgrade should be successful Additional info: Ansible upgrade log attached inventory file for the upgrade attached ES dc info attached
Created attachment 1259430 [details] ansible running log
Created attachment 1259431 [details] es dc log
In ES 3.3 the truststore is named /etc/elasticsearch/secret/truststore This commit changed - truststore_filepath: /etc/elasticsearch/secret/truststore to + truststore.path: /etc/elasticsearch/secret/searchguard.truststore commit b7f526dc6dfabf1a98db284984fbc7333080f067 Author: ewolinetz <ewolinet> Date: Tue Jul 12 10:05:20 2016 -0500 bumping up versions to work with es 2.3.5 and kibana 4.5.4 I'm assuming this needs to be handled by the upgrade in ansible?
How did you do the upgrade from 3.3 to 3.5? I'm following the official documentation for 3.4: https://docs.openshift.com/container-platform/3.4/install_config/upgrading/automated_upgrades.html#preparing-for-an-automated-upgrade except that I'm using ansible-playbook -vvv -i /root/ansible-inventory playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml And I get this error message: MSG: openshift_release is 3.3 which is not a valid release for a 3.5 upgrade
Pl(In reply to Rich Megginson from comment #5) > How did you do the upgrade from 3.3 to 3.5? I'm following the official > documentation for 3.4: > https://docs.openshift.com/container-platform/3.4/install_config/upgrading/ > automated_upgrades.html#preparing-for-an-automated-upgrade > > except that I'm using > > ansible-playbook -vvv -i /root/ansible-inventory > playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml > > And I get this error message: > > MSG: > > openshift_release is 3.3 which is not a valid release for a 3.5 upgrade (In reply to Rich Megginson from comment #5) > How did you do the upgrade from 3.3 to 3.5? I'm following the official > documentation for 3.4: > https://docs.openshift.com/container-platform/3.4/install_config/upgrading/ > automated_upgrades.html#preparing-for-an-automated-upgrade > > except that I'm using > > ansible-playbook -vvv -i /root/ansible-inventory > playbooks/byo/openshift-cluster/upgrades/v3_5/upgrade.yml > > And I get this error message: > > MSG: > > openshift_release is 3.3 which is not a valid release for a 3.5 upgrade please see https://bugzilla.redhat.com/show_bug.cgi?id=1426511#c17
(In reply to Junqi Zhao from comment #6) > (In reply to Rich Megginson from comment #5) > > How did you do the upgrade from 3.3 to 3.5? > > please see https://bugzilla.redhat.com/show_bug.cgi?id=1426511#c17 excerpt: > We specified the following ansible parameters to upgrade from 3.3.1 to 3.5.0 Specified where? How did you run ansible? What version of openshift-ansible did you use? Did you do a yum update (or git checkout) to go from openshift-ansible 3.3.1 to 3.5.0? Did you start with openshift-ansible 3.5.0 and somehow install logging 3.3.1? > > openshift_logging_install_logging=false > openshift_logging_upgrade_logging=true > What I'm looking for is the exact, step-by-step instructions you used. Because I am unable to reproduce based on the information given so far.
Created attachment 1260668 [details] Deploy logging 3.3.1 shell script
@rmeggins, 1. Please use the attached 'Deploy logging 3.3.1 shell script' to deploy logging 3.3.1, change the parameters according to your environment before deployment. In my scenario, there is one Master and one Node, nodeSelector for fluentd is 'logging-infra-fluentd=true', nodeSelector for curator, es and kibana is 'logging-infra-east=true', since I have only one Node, so you will see the nodeSelector "logging-infra-fluentd=true" and "logging-infra-east=true" are both labeled for Node. I suggest you use JSON-FILE as Logging driver, since it's slow to show log entry in Kibana UI. Please make sure logging entries can be found in Kibana before your upgrade. 2. My openshift-ansible is installed by yum # rpm -qa | grep openshift-ansible openshift-ansible-3.5.20-1.git.0.5a5fcd5.el7.noarch openshift-ansible-docs-3.5.20-1.git.0.5a5fcd5.el7.noarch you can install it from our puddle server 'rcm-guest/puddles/RHAOS/AtomicOpenShift/3.5/' playbooks are git cloned from git clone https://github.com/openshift/openshift-ansible/ Use the following command to upgrade to 3.5.0 by ansible. # git clone https://github.com/openshift/openshift-ansible/ # cd openshift-ansible # ansible-playbook -vvv -i $INVENTORY_FILE playbooks/common/openshift-cluster/openshift_logging.yml $INVENTORY_FILE is your ansible inventory file used to do upgrade work. Please use the attached 'ansible inventory file' to upgrade to logging 3.5, change the parameter according to your environment too. In the sample file, "ec2-52-202-98-194.compute-1.amazonaws.com" is master, ansible_ssh_private_key_file is your private key file, also please change openshift_logging_kibana_hostname,openshift_logging_kibana_ops_hostname,public_master_url, openshift_logging_fluentd_hosts and other parameters. The nodeSelector part does not need to be changed, they all the same with logging 3.3.1.
This defect is not related to nodeSelector, even not set nodeSelector for curator, es and kibana, after upgrade from 3.3.1 to 3.5.0 via ansible, es pod still can not start up, same error with this defect. PS: Upgarde from 3.4.1 to 3.5.0 don't have such issue.
(In reply to Junqi Zhao from comment #9) > @rmeggins, > 0. I deployed an OSE 3.3 single host install using openshift-ansible-3.3: yum install openshift-ansible openshift-ansible-docs openshift-ansible-callback-plugins openshift-ansible-filter-plugins openshift-ansible-lookup-plugins openshift-ansible-playbooks openshift-ansible-roles then cd /usr/share/ansible/openshift-ansible ANSIBLE_LOG_PATH=/var/log/ansible.log ansible-playbook -vvv -i $INVENTORY playbooks/byo/config.yml > 1. Please use the attached 'Deploy logging 3.3.1 shell script' to deploy > logging 3.3.1, change the parameters according to your environment before > deployment. done > > In my scenario, there is one Master and one Node, nodeSelector for fluentd > is 'logging-infra-fluentd=true', nodeSelector for curator, es and kibana is > 'logging-infra-east=true', since I have only one Node, so you will see the > nodeSelector "logging-infra-fluentd=true" and "logging-infra-east=true" are > both labeled for Node. done > > I suggest you use JSON-FILE as Logging driver, since it's slow to show log > entry in Kibana UI. yes > > Please make sure logging entries can be found in Kibana before your upgrade. # date -u Thu Mar 9 21:28:19 UTC 2017 # oc exec logging-es-mgvnymku-1-0v5ny -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/.operations.*/_search?size=1\&sort=time:desc|python -mjson.tool ... "_index": ".operations.2017.03.09", "time": "2017-03-09T16:28:22-05:00", and # oc exec logging-es-mgvnymku-1-0v5ny -- curl -s -k --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/logging.*/_search?size=1\&sort=time:desc|python -mjson.tool ... "_index": "logging.bcd3dfc9-04f0-11e7-aed4-fa163ed71416.2017.03.10", "time": "2017-03-09T21:24:09.738471446Z", so, Elasticsearch is up-to-date > > 2. My openshift-ansible is installed by yum > # rpm -qa | grep openshift-ansible > openshift-ansible-3.5.20-1.git.0.5a5fcd5.el7.noarch > openshift-ansible-docs-3.5.20-1.git.0.5a5fcd5.el7.noarch I edited my /etc/yum.repos.d/rhaos.repo (puddle) file to look like this: baseurl = http://download.eng.bos.redhat.com/rcm-guest/puddles/RHAOS/AtomicOpenShift/3.5/latest/x86_64/os/ Then I did `yum update openshift-ansible openshift-ansible-docs` # rpm -q openshift-ansible openshift-ansible-docs openshift-ansible-3.5.28-1.git.0.103513e.el7.noarch openshift-ansible-docs-3.5.28-1.git.0.103513e.el7.noarch > > you can install it from our puddle server > 'rcm-guest/puddles/RHAOS/AtomicOpenShift/3.5/' > > playbooks are git cloned from git clone > https://github.com/openshift/openshift-ansible/ I'm not sure why you are using openshift-ansible from rpm packaging, but using git for the playbooks, when they are available from yum install openshift-ansible-callback-plugins openshift-ansible-filter-plugins openshift-ansible-lookup-plugins openshift-ansible-playbooks openshift-ansible-roles but, ok > > Use the following command to upgrade to 3.5.0 by ansible. > # git clone https://github.com/openshift/openshift-ansible/ > # cd openshift-ansible > # ansible-playbook -vvv -i $INVENTORY_FILE > playbooks/common/openshift-cluster/openshift_logging.yml > > $INVENTORY_FILE is your ansible inventory file used to do upgrade work. > > Please use the attached 'ansible inventory file' to upgrade to logging 3.5, > change the parameter according to your environment too. In the sample file, > "ec2-52-202-98-194.compute-1.amazonaws.com" is master, > ansible_ssh_private_key_file is your private key file, also please change > openshift_logging_kibana_hostname,openshift_logging_kibana_ops_hostname, > public_master_url, openshift_logging_fluentd_hosts and other parameters. The > nodeSelector part does not need to be changed, they all the same with > logging 3.3.1. ok. I am now able to reproduce the problem. Note that this _does not upgrade to ocp 3.5_ - this runs logging 3.5.0 containers _on top of ose 3.3_. Do we even support that?
submitted PR: https://github.com/openshift/openshift-ansible/pull/3616
@rmeggins, We use OCP 3.5.0 now, so for this issue, Logging 3.3.1 was installed on OCP 3.5.0 and then upgrade to Logging 3.5.0. Verified with your fix, ES pod is running now, but there are exceptions in ES log, see the attached file, and curator's status was changed from Running -> Error -> CrashLoopBackOff -> Running, and finally changed to CrashLoopBackOff, there is no log for curator pod. This issue did not happen before your fix. # oc get po NAME READY STATUS RESTARTS AGE logging-curator-2-fsc4p 1/1 Running 5 11m logging-deployer-pvvxt 0/1 Completed 0 1h logging-es-s6smjn2c-2-5wz6d 1/1 Running 0 56m logging-fluentd-4kf0b 1/1 Running 0 55m logging-kibana-2-5v6pq 2/2 Running 0 55m openshift-ansible and playbooks are yum installed. version: openshift-ansible-3.5.25-1.git.0.a40beae.el7.noarch openshift-ansible-playbooks-3.5.25-1.git.0.a40beae.el7.noarch # ansible --version ansible 2.2.1.0
Created attachment 1261796 [details] es pod log, SSL Problem Received fatal alert: unknown_ca
Well, it looks like https://github.com/openshift/openshift-ansible/pull/3616 doesn't help in this case, but it should be fixed anyway.
Commit pushed to master at https://github.com/openshift/openshift-ansible https://github.com/openshift/openshift-ansible/commit/bc3042fbb66f6a231056d665f2f82cdc6f6d4a3b Bug 1428711 - [IntService_public_324] ES pod is unable to read searchguard.truststore after upgarde logging from 3.3.1 to 3.5.0 https://bugzilla.redhat.com/show_bug.cgi?id=1428711 The list of secrets for elasticsearch was missing searchguard.truststore
ewolinetz jcantrill The problem is that there are two different CA certs: one for the Elasticsearch certs which is created by the 3.5 ansible playbooks, and one for the other components created by the 3.3 deployer. Elasticsearch doesn't trust the 3.3 CA cert, and the other components do not trust the ES 3.5 CA. I think the problem is the way ansible handles upgrade. In 3.3 the certs are generated with an "ephemeral" ca, the key only exists inside the deployer pod: + openshift admin ca create-signer-cert --key=/etc/deploy/scratch/ca.key --cert=/etc/deploy/scratch/ca.crt --serial=/etc/deploy/scratch/ca.serial.txt --name=logging-signer-20170315001154 However, this CA key/serial is not saved anywhere, so it cannot be used again. When 3.5 is installed for the first time, ansible will create a new CA (logging-signer-test) and create potential new certs/keys/truststores/keystores in case some are missing (generate-certs.yaml). However, it doesn't actually install them unless they are missing from the secrets in the openshift_logging_facts (generate_secrets.yaml). This is the case for elasticsearch. The task "Generating secrets for elasticsearch" will replace the existing secrets because the new list ["admin-cert", "searchguard.key", "admin-ca", "key", "truststore", "admin-key", "searchguard.truststore"] does not match the old list ["admin-ca", "admin-cert", "admin-key", "key", "searchguard.key", "truststore"]. The new certs use the new CA, and the old services don't trust ES (and vice versa). In this case, ansible doesn't need to install new secrets for all of the above, perhaps just the searchguard.key which has quite a different format post 3.3. And the contents of searchguard.truststore are identical to truststore. One workaround is to add the new CA cert to the the CA cert file of the other services, and add the old CA cert to the Elasticsearch CA and truststores. I guess this can be done by editing the secrets, I'll have to find out.
Finally, from what I've been able to find out, upgrade directly from 3.3 to 3.5 is not supported. The supported upgrade path is 3.3 to 3.4 to 3.5. But there still may be a problem if there is no way to handle the CA certs correctly. Eric, do you know if the old upgrade in 3.4 will correctly handle the certs as described above? If so, then this bug is likely CLOSED INVALID.
@rmeggins, Upgraded from 3.4.1 to 3.5.0 via ansible was successful, checked the upgrade log, there were outputs: No matching indices found - skipping update_for_uuid No matching indexes found - skipping update_for_common_data_model according to defect which about upgrade from 3.2 to 3.4: https://bugzilla.redhat.com/show_bug.cgi?id=1395170#c3, 4) "Observe in upgrade pod that this isn't seen "No matching indexes found - skipping update_for_common_data_model". Is it the same with upgrade from 3.4.1 to 3.5.0 via ansible, should not there have "No matching indexes found - skipping update_for_common_data_model" in upgrade log? I remember the file roles/openshift_logging/files/fluent.conf, we have been changed to @include configs.d/openshift/filter-viaq-data-model.conf I think we can ignore this output, please correct me if I am wrong.
Created attachment 1263202 [details] upgrade log from 3.4.1 to 3.5.0 via ansible
(In reply to Junqi Zhao from comment #19) > @rmeggins, > > Upgraded from 3.4.1 to 3.5.0 via ansible was successful, checked the upgrade > log, > there were outputs: > No matching indices found - skipping update_for_uuid > No matching indexes found - skipping update_for_common_data_model > > according to defect which about upgrade from 3.2 to 3.4: > https://bugzilla.redhat.com/show_bug.cgi?id=1395170#c3, > 4) "Observe in upgrade pod that this isn't seen "No matching indexes found - > skipping update_for_common_data_model". > > Is it the same with upgrade from 3.4.1 to 3.5.0 via ansible, should not > there have "No matching indexes found - skipping > update_for_common_data_model" in upgrade log? It is not the same the upgrade from 3.4 to 3.5. When upgrading from 3.4 to 3.5 I would expect to see "No matching indexes found - skipping update_for_common_data_model" because the 3.4 indices are already using the common data model. > > I remember the file roles/openshift_logging/files/fluent.conf, we have been > changed to > @include configs.d/openshift/filter-viaq-data-model.conf > > I think we can ignore this output, please correct me if I am wrong. You are correct. So there still may be a bug when upgrading from 3.3 to 3.4. Was 3.3 to 3.4 upgrade tested for the OCP 3.4 release? If so, I think we can close this bug. At any rate, if you run into this situation, the workaround is this: * dump all of your secrets that contain CA information e.g. $ oc get secret logging-kibana \ --template='{{index .data "ca"}}' | base64 -d > kibana.ca $ oc get secret logging-elasticsearch \ --template='{{index .data "truststore"}}' | base64 -d > es.truststore $ oc get secret logging-elasticsearch \ --template='{{index .data "key"}}' | base64 -d > es.key $ oc get secret logging-elasticsearch \ --template='{{index .data "admin-ca"}}' | base64 -d > es.ca * for the PEM based ca files (kibana.ca, etc.) just append the es.ca: $ cat es.ca >> kibana.ca * for the jks based files, import the kibana.ca: keytool -import -file kibana.ca -keystore es.truststore -storepass tspass -noprompt -alias old-ca keytool -import -file kibana.ca -keystore es.key -storepass kspass -noprompt -alias old-ca * base64 all of these: $ for file in kibana.ca es.truststore .... ; do cat $file | base64 -w 0 > $file.b64 ; done * edit the secrets e.g. oc edit secret logging-kibana, logging-elasticsearch, etc. replace the ca value with the contents of the .b64 file. For example, do oc edit secret logging-kibana - replace the value of the "ca:" key with the contents of kibana.ca.b64 * redeploy and restart all logging pods
@rmeggins, We did 3.3 to 3.4 upgrade testing on OCP 3.4 release, but we also did 3.2 to 3.4 upgrade testing, it was successfully upgrade from logging 3.2 to 3.4. Since we are not to fix this issue, I think this defect should be closed as WONTFIX instead of NOTABUG.