Bug 1827032
Summary: | ES pod couldn't start after upgrade from 4.4 to 4.5 -- upgrade CLO firstly | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Qiaoling Tang <qitang> | ||||
Component: | Logging | Assignee: | IgorKarpukhin <ikarpukh> | ||||
Status: | CLOSED ERRATA | QA Contact: | Qiaoling Tang <qitang> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | urgent | ||||||
Version: | 4.5 | CC: | aos-bugs, jcantril, periklis | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: |
Cause: Not enough resources in the cluster
Consequence: EO couldn't start
Fix: Set `resources.requests.memory` to 1.5 Gi
Result: EO starts normally
|
Story Points: | --- | ||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-07-13 17:30:26 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1809511, 1882495 | ||||||
Attachments: |
|
Anping, when you testing this upgrade. Please add the following section under the `spec` for CLO's CR: resources: limits: memory: 2Gi requests: cpu: 100m memory: 1Gi So, 2Gi instead of 4Gi. I managed to finish the upgrade by editing the CLO's CR resources and removing the ES deployment. Could you please try to do that? @IgorKarpukhin,
when I deployed logging 4.4, my clusterlogging CR was:
apiVersion: "logging.openshift.io/v1"
kind: "ClusterLogging"
metadata:
name: "instance"
namespace: "openshift-logging"
spec:
managementState: "Managed"
logStore:
type: "elasticsearch"
elasticsearch:
nodeCount: 1
redundancyPolicy: "ZeroRedundancy"
resources:
requests:
cpu: 1
memory: "4Gi"
storage:
storageClassName: "gp2"
size: "20Gi"
visualization:
type: "kibana"
kibana:
replicas: 1
curation:
type: "curator"
curator:
schedule: "*/30 * * * *"
collection:
logs:
type: "fluentd"
fluentd: {}
> I managed to finish the upgrade by editing the CLO's CR resources and removing the ES deployment. Could you please try to do that?
What should I change in the CLO's CR instance? The elasticsearch's resource configurations? I thought the resource configurations shouldn't affect the upgrading unless the resources are not enough for the pod to start. Please correct me if I'm wrong.
I tried to delete the ES deployment and waited for the EO to create a new ES deployment, then the ES pod could start.
After I delete the ES deployment, the kibana pod became CrashLoopBackOff, I found some error logs in the kibana container: $ oc logs -c kibana kibana-6f56fc95dc-g78nl #The following values dynamically added from environment variable overrides: Using NODE_OPTIONS: '--max_old_space_size=368' Memory setting is in MB {"type":"log","@timestamp":"2020-04-26T03:02:58Z","tags":["fatal","root"],"pid":120,"message":"Error: Index .kibana belongs to a version of Kibana that cannot be automatically migrated. Reset it or use the X-Pack upgrade assistant.\n at assertIsSupportedIndex (/opt/app-root/src/src/server/saved_objects/migrations/core/elastic_index.js:246:15)\n at Object.fetchInfo (/opt/app-root/src/src/server/saved_objects/migrations/core/elastic_index.js:52:12)"} FATAL Error: Index .kibana belongs to a version of Kibana that cannot be automatically migrated. Reset it or use the X-Pack upgrade assistant. $ oc exec elasticsearch-cdm-3pt2ggfm-1-5c85f4d596-tkzlz -- indices Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-3pt2ggfm-1-5c85f4d596-tkzlz -n openshift-logging' to see all of the containers in this pod. Sun Apr 26 03:03:24 UTC 2020 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open project.qitang.2fa1b0f3-bf11-4419-808b-0fe652635819.2020.04.26 3-c6nKUOTIC5-0kfOTC2eg 1 0 335 0 0 0 green open .operations.2020.04.26 rku-dZVMQ_O6xgCCD6mUiQ 1 0 30194 0 28 28 green open infra-000002 LcIpMVvGTqClb6ewjNGTQg 1 0 34661 0 24 24 green open .security Tv6zJrlbSPmj7Qb0d9A7Vg 1 0 5 0 0 0 green open .searchguard X_FZTPSSQYyQ3JjHhWpo-g 1 0 5 0 0 0 green open app-000001 CGbLQOX5Q1mHWaZntYuYeg 1 0 459 0 0 0 green open infra-000001 0QyyxL6QSIiMNrJfMt81hw 1 0 151361 0 129 129 green open .kibana dCsGB9d3RkWawJ8WLuJsRA 1 0 1 0 0 0 green open audit-000001 XvbY5oxOTFWwX8D8oVUw_g 1 0 0 0 0 0 @Anping, On the cluster that you gave me last time, I had to modify the CLO's CR, because EO pod couldn't start due to lack of resources: ES POD: -------------- Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 2 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Warning FailedScheduling <unknown> default-scheduler 0/6 nodes are available: 1 Insufficient cpu, 1 node(s) had volume node affinity conflict, 2 Insufficient memory, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. -------------- The ES pods could upgrade to 4.5 succeeded now, but still have other issues, will file some new bugs to track them. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-d857bb6db-2fjrr 1/1 Running 0 9m56s curator-1589333400-7vl4w 0/1 Completed 0 35m curator-1589335200-rg5z2 0/1 Error 0 5m41s elasticsearch-cdm-m2j2lxw9-1-596649ffc8-nrz5k 2/2 Running 0 9m4s elasticsearch-cdm-m2j2lxw9-2-78787b64cf-vvwvk 2/2 Running 0 6m10s elasticsearch-cdm-m2j2lxw9-3-74d7bd8db4-n9fbf 2/2 Running 0 4m16s elasticsearch-delete-app-1589335200-2rjlz 0/1 Error 0 5m41s elasticsearch-delete-audit-1589335200-xwl2h 0/1 Error 0 5m41s elasticsearch-delete-infra-1589335200-nk5ls 0/1 Error 0 5m41s elasticsearch-rollover-app-1589335200-sd5q5 0/1 Error 0 5m41s elasticsearch-rollover-audit-1589335200-flpbd 0/1 Error 0 5m41s elasticsearch-rollover-infra-1589335200-lxks9 0/1 Error 0 5m41s fluentd-8dvgn 1/1 Running 0 7m8s fluentd-q89wn 1/1 Running 0 8m34s fluentd-r57cg 1/1 Running 0 6m28s fluentd-sfvhg 1/1 Running 0 9m23s fluentd-t2bdg 1/1 Running 0 7m51s fluentd-xf9mm 1/1 Running 0 5m54s kibana-6ff5c8d8f-s5tgq 2/2 Running 0 45m $ oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterlogging.v4.5.0 Cluster Logging 4.5.0 clusterlogging.4.4.0-202005120551 Succeeded elasticsearch-operator.v4.5.0 Elasticsearch Operator 4.5.0 elasticsearch-operator.4.4.0-202005120551 Succeeded $ oc exec elasticsearch-cdm-m2j2lxw9-1-596649ffc8-nrz5k -- indices Defaulting container name to elasticsearch. Use 'oc describe pod/elasticsearch-cdm-m2j2lxw9-1-596649ffc8-nrz5k -n openshift-logging' to see all of the containers in this pod. Wed May 13 02:07:22 UTC 2020 health status index uuid pri rep docs.count docs.deleted store.size pri.store.size green open infra-000001 bUZeHMtHShuRwemjAk067Q 3 1 322228 0 818 397 green open audit-000001 u87iPiYlSPWfRrNnoi0CwA 3 1 0 0 0 0 green open project.qitang.5682515f-a751-4bb1-a98c-a1ca0b381376.2020.05.13 LmyeqjBQTuOVO6-HEr03YA 3 1 2741 0 3 1 green open .operations.2020.05.13 Ob9V0iMiTQ6uTnuu5qOSGg 3 1 2124104 0 4594 2288 green open .kibana.a5f01f00ae88a880fd91ed1dbace3dff08f5c0b2 nJxuXyS4RiOFmzWmSvzI6Q 1 1 0 0 0 0 green open .security cWRUcXT-TcmIzaPFOiowsw 1 1 5 0 0 0 green open .kibana.647a750f1787408bf50088234ec0edd5a6a9b2ac CvSIX5rNRVSCMFEIH9hyeQ 1 1 2 0 0 0 green open .searchguard UC4ukmZvTc61M5HNPLiqyg 1 1 5 52 0 0 green open app-000001 iJwFUbNATx-R73r2YFITTg 3 1 233 0 1 0 green open .kibana 1LLeP9ZRROKU0WyFaGfQfw 1 1 1 0 0 0 Seems the ES images are set in the EO now, no ES image in the CLO. I saw the ES pods only restart once. Given comments in c#5, please mark this BZ verified or identify what is left specifically for this issue. ES pod could start now, so move to VERIFIED. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |
Created attachment 1681014 [details] CLO, EO and ES pod logs Description of problem: The ES pod stuck in CrashLoopBackOff status after upgrade from 4.4 to 4.5 when upgrade CLO firstly. $ oc get pod NAME READY STATUS RESTARTS AGE cluster-logging-operator-7cbb957c67-5ztpd 1/1 Running 0 66m curator-1587616200-jkz4q 1/1 Running 0 58m curator-1587618000-5rfxf 1/1 Running 0 28m elasticsearch-cdm-b1w2fana-1-57bc77596f-958hz 1/2 CrashLoopBackOff 17 64m elasticsearch-delete-app-1587618900-qsv8m 0/1 Error 0 12m elasticsearch-delete-audit-1587618900-zs6qj 0/1 Error 0 12m elasticsearch-delete-infra-1587618900-b8rcv 0/1 Error 0 12m elasticsearch-rollover-app-1587618900-bxbhv 0/1 Error 0 12m elasticsearch-rollover-audit-1587618900-rpnfv 0/1 Error 0 12m elasticsearch-rollover-infra-1587618900-lx492 0/1 Error 0 12m fluentd-5htfq 1/1 Running 0 64m fluentd-bd9lg 1/1 Running 0 65m fluentd-ghwxz 1/1 Running 0 65m fluentd-jzxxb 1/1 Running 0 64m fluentd-kngdj 1/1 Running 0 65m fluentd-vz4nl 1/1 Running 0 64m kibana-56677488fd-w4qql 2/2 Running 1 65m $ oc get csv NAME DISPLAY VERSION REPLACES PHASE clusterlogging.v4.5.0 Cluster Logging 4.5.0 clusterlogging.4.4.0-202004211517 Succeeded elasticsearch-operator.v4.5.0 Elasticsearch Operator 4.5.0 elasticsearch-operator.4.4.0-202004211517 Succeeded Version-Release number of selected component (if applicable): Logging images are found from payload: 4.5.0-0.ci-2020-04-22-212726 the manifests are copied from the github repo with the latest code. How reproducible: Always Steps to Reproduce: 1.deploy logging 4.4 2.upgrade to 4.5 3. Actual results: Expected results: Additional info: