Bug 1828907
| Summary: | The elasticsearch deployments weren't updated | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | ewolinet | |
| Component: | Logging | Assignee: | ewolinet | |
| Status: | CLOSED ERRATA | QA Contact: | Anping Li <anli> | |
| Severity: | high | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | 4.3.z | CC: | anli, aos-bugs, ewolinet, jcantril, periklis, scuppett, smilner, vlaad | |
| Target Milestone: | --- | Keywords: | UpcomingSprint, Upgrades | |
| Target Release: | 4.6.0 | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | Bug Fix | ||
| Doc Text: |
Cause:
The mechanism for EO to detect that changes were rolled out was reporting false positives.
Consequence:
The Operator would update the deployment and not unpause it long enough for the scheduler to create a new RS and a new pod.
Fix:
The mechanism to check for changes was updated to explicitly ensure that the pod spec contained the updates made to the deployment.
Result:
The EO correctly waits to pause the deployment again once the changes have been properly rolled out.
|
Story Points: | --- | |
| Clone Of: | 1827690 | |||
| : | 1861529 (view as bug list) | Environment: | ||
| Last Closed: | 2020-10-27 15:09:31 UTC | Type: | --- | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1861529 | |||
|
Comment 1
Jeff Cantrill
2020-05-11 19:26:25 UTC
@ewolinetz Do we miss the link to the backport PR here? https://github.com/openshift/elasticsearch-operator/pull/333 and was added recently by the bot :-) Once I fix the image pullspec error in app-registry, the CSV bundles were upraded, the cluster-logging-operator and elaticsearch-operator weren't reboot. the clusterlogging cr and elasticsearch cr weren't updated. the ES clusters is still ImagePullBackOff, curator-1591019400-ncnjg 1/1 Running 0 8m54s elasticsearch-cdm-3kh4p4xz-1-6bf57cd5f9-8tsms 1/2 ImagePullBackOff 0 28m elasticsearch-cdm-3kh4p4xz-2-6cb9b897d-5p7p8 1/2 ImagePullBackOff 0 28m elasticsearch-cdm-3kh4p4xz-3-7486df7f4c-f5klt 1/2 ImagePullBackOff 0 28m @anli can you please post the images the elastic deployments are attempting to Pull. Maybe also post: * the elasticsearch operator deployment yaml * the cluster logging operator yaml * the elasticsearch elasticsearch yaml I presume you mean a 4.3.x errata? Moving to MODIFIED per #c11 Test on elasticsearch-operator.4.3.27-202006211650.p0. openshift/ose-elasticsearch-operator:v4.3.27-202006211650.p0
The PR is merged. after the upgrade, the ES deployment are updated, but the pod wasn't redeployed as '.spec.paused: true'.
$oc logs elasticsearch-operator-58c76ff44d-l7bzx
time="2020-06-24T06:44:50Z" level=warning msg="Unable to parse loglevel \"\""
{"level":"info","ts":1592981090.672264,"logger":"cmd","msg":"Go Version: go1.12.12"}
{"level":"info","ts":1592981090.672299,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1592981090.6723056,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"}
{"level":"info","ts":1592981090.67269,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1592981090.9011495,"logger":"leader","msg":"Found existing lock","LockOwner":"elasticsearch-operator-689576fff-mz2xx"}
{"level":"info","ts":1592981090.91178,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1592981092.0484293,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1592981094.4317753,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1592981094.6038837,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1592981094.604168,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"elasticsearch-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1592981094.839042,"logger":"cmd","msg":"failed to create or get service for metrics: services \"elasticsearch-operator\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
{"level":"info","ts":1592981094.8390784,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1592981094.9393675,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1592981095.03962,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-06-24T06:48:02Z" level=info msg="Requested to update node 'elasticsearch-cdm-e7k1nfj4-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-06-24T06:48:03Z" level=info msg="Requested to update node 'elasticsearch-cdm-e7k1nfj4-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-06-24T06:48:04Z" level=info msg="Requested to update node 'elasticsearch-cdm-e7k1nfj4-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
It is expected that the pod spec has `paused: true` this is to keep updates from happening immediately. The logic used unpauses nodes as we are ready for them to restart [1]. What does your elasticsearch CR look like? [1] https://github.com/openshift/elasticsearch-operator/blob/release-4.3/pkg/k8shandler/deployment.go#L583 The CR Elasicsearh was updated. The deployment Elasticsearch wasn't updated.
Before upgrade, nodeSpec.image is quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch55@xxxx.
After upgrade, nodeSpec.image is quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch5@xxxx.
{
"apiVersion": "logging.openshift.io/v1",
"kind": "Elasticsearch",
"metadata": {
"generation": 7,
"name": "elasticsearch",
"namespace": "openshift-logging",
},
"spec": {
"managementState": "Managed",
"nodeSpec": {
"image": "quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch5@sha256:90d85d18846f851b5845b181411a122081c6bd5a62d28d61f4ed12b3288ce6b6",
},
"nodes": [
{
"genUUID": "e96wmaeq",
"nodeCount": 1,
"resources": {},
"roles": [
"client",
"data",
"master"
],
"storage": {}
}
],
"redundancyPolicy": "ZeroRedundancy"
},
"status": {
"cluster": {
"activePrimaryShards": 0,
"activeShards": 0,
"initializingShards": 0,
"numDataNodes": 0,
"numNodes": 0,
"pendingTasks": 0,
"relocatingShards": 0,
"status": "cluster health unknown",
"unassignedShards": 0
},
"clusterHealth": "",
"conditions": [],
"nodes": [
{
"conditions": [
{
"lastTransitionTime": "2020-07-02T06:30:46Z",
"message": "Back-off pulling image \"quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch55@sha256:90d85d18846f851b5845b181411a122081c6bd5a62d28d61f4ed12b3288ce6b6\"",
"reason": "ImagePullBackOff",
"status": "True",
"type": "ElasticsearchContainerWaiting"
}
],
"deploymentName": "elasticsearch-cdm-e96wmaeq-1",
"upgradeStatus": {}
}
],
"shardAllocationEnabled": "shard allocation unknown"
}
}
Must we wait for the green status before the upgrade?
$ oc logs elasticsearch-operator-555884d8b4-wbfck |tee eo-pod.logs
time="2020-07-02T06:26:56Z" level=warning msg="Unable to parse loglevel \"\""
{"level":"info","ts":1593671216.8078368,"logger":"cmd","msg":"Go Version: go1.12.12"}
{"level":"info","ts":1593671216.8078604,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1593671216.807865,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"}
{"level":"info","ts":1593671216.808159,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1593671216.9419112,"logger":"leader","msg":"Found existing lock","LockOwner":"elasticsearch-operator-df8cc886-9vvjn"}
{"level":"info","ts":1593671216.9501584,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671218.138512,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671220.522263,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671225.0637383,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671233.7697678,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1593671233.8965669,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1593671233.8968284,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"elasticsearch-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1593671234.0148737,"logger":"cmd","msg":"failed to create or get service for metrics: services \"elasticsearch-operator\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
{"level":"info","ts":1593671234.014903,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1593671234.115109,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1593671234.2152753,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-07-02T06:30:46Z" level=info msg="Requested to update node 'elasticsearch-cdm-e96wmaeq-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-07-02T06:31:17Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-e96wmaeq-1: / green"
time="2020-07-02T06:31:17Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-e96wmaeq-1: Cluster not in green state before beginning upgrade: "
@Anping -- I am able to recreate this locally with the latest EO code and it seems to be due to the deployment being paused prior to us seeing the new pod roll out. @ewollinetz The scenarios: 1. We released CSV bundles which using non-existing Elasticsearch pull-spec.( registry.example.com/openshift/ose-logging-elasticsearch55@xxxx) 2. The ES pods in crashloopback status, as it couldn't find the image 'registry.example.com/openshift/ose-logging-elasticsearch55@xxxx'. No running pods, no ES status. 3. We provide new CSV bundles which provide a correct pull spec Expected Result: The CSV was updated automatically. The ES deployment is upgraded automatically. The ES pod are running after the upgrade Actual Result: The CSV was updated. The ES deployment wasn't upgraded. @Anping,
the deployment wasn't updated? then I'm not sure I am able to recreate this yet...
in the case where there is an imagepullbackoff the operator will bypass trying to do a normal upgrade and instead just update the deployments to fix a broken state:
time="2020-07-02T06:30:46Z" level=info msg="Requested to update node 'elasticsearch-cdm-e96wmaeq-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
though for some reason we then see it is trying to do a normal update afterwards:
time="2020-07-02T06:31:17Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-e96wmaeq-1: / green"
especially because that node is not scheduled for a normal upgrade (indicated to me by upgradeStatus):
{
"conditions": [
{
"lastTransitionTime": "2020-07-02T06:30:46Z",
"message": "Back-off pulling image \"quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch55@sha256:90d85d18846f851b5845b181411a122081c6bd5a62d28d61f4ed12b3288ce6b6\"",
"reason": "ImagePullBackOff",
"status": "True",
"type": "ElasticsearchContainerWaiting"
}
],
"deploymentName": "elasticsearch-cdm-e96wmaeq-1",
"upgradeStatus": {}
}
The ES pods weren't restarted. The elastic search deployments are updated. but the deployments are in PAUSE status. The Elasticsearch-operator prints message as Comment 23. $oc logs elasticsearch-operator-8599687444-tzsvv {"level":"info","ts":1595428596.2246425,"logger":"cmd","msg":"Starting the Cmd."} {"level":"info","ts":1595428596.3249218,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"} {"level":"info","ts":1595428596.4251525,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1} time="2020-07-22T14:36:37Z" level=info msg="Requested to update node 'elasticsearch-cdm-fg45zjrh-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now" time="2020-07-22T14:36:39Z" level=info msg="Requested to update node 'elasticsearch-cdm-fg45zjrh-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now" time="2020-07-22T14:36:40Z" level=info msg="Requested to update node 'elasticsearch-cdm-fg45zjrh-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now" $oc get deployment elasticsearch-cdm-2k1giybq-1 -o json|jq '.status' "conditions": [ { "lastTransitionTime": "2020-07-22T14:28:21Z", "lastUpdateTime": "2020-07-22T14:28:21Z", "message": "Deployment does not have minimum availability.", "reason": "MinimumReplicasUnavailable", "status": "False", "type": "Available" }, { "lastTransitionTime": "2020-07-22T14:36:39Z", "lastUpdateTime": "2020-07-22T14:36:39Z", "message": "Deployment is paused", "reason": "DeploymentPaused", "status": "Unknown", "type": "Progressing" } ], "observedGeneration": 5, "replicas": 1, "unavailableReplicas": 1 } Failed in 4.5. The elasticsearch pod stuck on imagePullBackout. after the CSV was updated, The elasticsearch deployments weren't updated. $ oc get pods -n openshift-logging NAME READY STATUS RESTARTS AGE cluster-logging-operator-7c57bb66d-7x76c 1/1 Running 0 11m curator-1596465600-9gl4k 1/1 Running 0 2m1s elasticsearch-cdm-s08q4j20-1-5dc9fd4499-547sp 1/2 ImagePullBackOff 0 10m elasticsearch-cdm-s08q4j20-2-68d5946564-6fc9s 0/2 Pending 0 9m12s elasticsearch-cdm-s08q4j20-3-bb49fd6d9-pb578 0/2 Pending 0 8m11s $EO logs /util/wait/wait.go:88"} time="2020-08-03T14:32:21Z" level=info msg="Updating status of Kibana" time="2020-08-03T14:32:21Z" level=info msg="Kibana status successfully updated" time="2020-08-03T14:32:50Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n" time="2020-08-03T14:32:51Z" level=info msg="Updating status of Kibana" time="2020-08-03T14:32:51Z" level=info msg="Kibana status successfully updated" time="2020-08-03T14:33:21Z" level=info msg="Kibana status successfully updated" time="2020-08-03T14:33:52Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n" time="2020-08-03T14:34:53Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n" time="2020-08-03T14:35:23Z" level=warning msg="Unable to enable shard allocation: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: dial tcp 172.30.233.100:9200: i/o timeout" time="2020-08-03T14:38:53Z" level=warning msg="Unable to list existing templates in order to reconcile stale ones: Get https://elasticsearch.openshift-logging.svc:9200/_template: dial tcp 172.30.233.100:9200: i/o timeout" time="2020-08-03T14:39:23Z" level=error msg="Error creating index template for mapping app: Put https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app: dial tcp 172.30.233.100:9200: i/o timeout" {"level":"error","ts":1596465563.4152765,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile IndexMangement for Elasticsearch cluster: Put https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app: dial tcp 172.30.233.100:9200: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"} time="2020-08-03T14:39:55Z" level=warning msg="Unable to clear transient shard allocation for \"openshift-logging\" \"openshift-logging\": Response: , Error: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: dial tcp 172.30.233.100:9200: i/o timeout" time="2020-08-03T14:40:55Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n" time="2020-08-03T14:41:55Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n" time="2020-08-03T14:42:55Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n @Anping, can you please verify if this has been fixed in 4.6? If so, this pr should fix this for 4.5 https://github.com/openshift/elasticsearch-operator/pull/436 Pass when use elasticsearch-operators:v4.6. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198 |