Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1828907

Summary:	The elasticsearch deployments weren't updated
Product:	OpenShift Container Platform	Reporter:	ewolinet
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED ERRATA	QA Contact:	Anping Li <anli>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	4.3.z	CC:	anli, aos-bugs, ewolinet, jcantril, periklis, scuppett, smilner, vlaad
Target Milestone:	---	Keywords:	UpcomingSprint, Upgrades
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: The mechanism for EO to detect that changes were rolled out was reporting false positives. Consequence: The Operator would update the deployment and not unpause it long enough for the scheduler to create a new RS and a new pod. Fix: The mechanism to check for changes was updated to explicitly ensure that the pod spec contained the updates made to the deployment. Result: The EO correctly waits to pause the deployment again once the changes have been properly rolled out.	Story Points:	---
Clone Of:	1827690
Clones:	1861529 (view as bug list)		Environment:
Last Closed:	2020-10-27 15:09:31 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1861529

Comment 1 Jeff Cantrill 2020-05-11 19:26:25 UTC

Blocked by parent issue

Comment 2 Periklis Tsirakidis 2020-05-22 08:18:49 UTC

@ewolinetz

Do we miss the link to the backport PR here?

Comment 3 Steve Milner 2020-05-27 18:44:30 UTC

https://github.com/openshift/elasticsearch-operator/pull/333 and was added recently by the bot :-)

Comment 6 Anping Li 2020-06-01 14:01:29 UTC

Once I fix the image pullspec error in app-registry, the CSV bundles were upraded, the cluster-logging-operator and elaticsearch-operator weren't reboot. the clusterlogging cr and elasticsearch cr weren't updated. the ES clusters is still ImagePullBackOff,


curator-1591019400-ncnjg                        1/1     Running            0          8m54s
elasticsearch-cdm-3kh4p4xz-1-6bf57cd5f9-8tsms   1/2     ImagePullBackOff   0          28m
elasticsearch-cdm-3kh4p4xz-2-6cb9b897d-5p7p8    1/2     ImagePullBackOff   0          28m
elasticsearch-cdm-3kh4p4xz-3-7486df7f4c-f5klt   1/2     ImagePullBackOff   0          28m

Comment 7 Jeff Cantrill 2020-06-03 20:55:15 UTC

@anli can you please post the images the elastic deployments are attempting to Pull.  Maybe also post:

* the elasticsearch operator deployment yaml
* the cluster logging operator yaml
* the elasticsearch elasticsearch yaml

Comment 10 Jeff Cantrill 2020-06-08 00:05:50 UTC

I presume you mean a 4.3.x errata?

Comment 12 Jeff Cantrill 2020-06-18 19:04:30 UTC

Moving to MODIFIED per #c11

Comment 15 Anping Li 2020-06-24 07:04:24 UTC

Test on elasticsearch-operator.4.3.27-202006211650.p0. openshift/ose-elasticsearch-operator:v4.3.27-202006211650.p0

The PR is merged. after the upgrade, the ES deployment are updated, but the pod wasn't redeployed as '.spec.paused: true'.

$oc logs elasticsearch-operator-58c76ff44d-l7bzx
time="2020-06-24T06:44:50Z" level=warning msg="Unable to parse loglevel \"\""
{"level":"info","ts":1592981090.672264,"logger":"cmd","msg":"Go Version: go1.12.12"}
{"level":"info","ts":1592981090.672299,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1592981090.6723056,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"}
{"level":"info","ts":1592981090.67269,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1592981090.9011495,"logger":"leader","msg":"Found existing lock","LockOwner":"elasticsearch-operator-689576fff-mz2xx"}
{"level":"info","ts":1592981090.91178,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1592981092.0484293,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1592981094.4317753,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1592981094.6038837,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1592981094.604168,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"elasticsearch-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1592981094.839042,"logger":"cmd","msg":"failed to create or get service for metrics: services \"elasticsearch-operator\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
{"level":"info","ts":1592981094.8390784,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1592981094.9393675,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1592981095.03962,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-06-24T06:48:02Z" level=info msg="Requested to update node 'elasticsearch-cdm-e7k1nfj4-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-06-24T06:48:03Z" level=info msg="Requested to update node 'elasticsearch-cdm-e7k1nfj4-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-06-24T06:48:04Z" level=info msg="Requested to update node 'elasticsearch-cdm-e7k1nfj4-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"

Comment 16 ewolinet 2020-06-24 13:56:52 UTC

It is expected that the pod spec has `paused: true` this is to keep updates from happening immediately. The logic used unpauses nodes as we are ready for them to restart [1].

What does your elasticsearch CR look like?

[1] https://github.com/openshift/elasticsearch-operator/blob/release-4.3/pkg/k8shandler/deployment.go#L583

Comment 18 Anping Li 2020-07-02 06:39:56 UTC

The CR Elasicsearh was updated. The deployment Elasticsearch wasn't updated.

Before upgrade, nodeSpec.image is quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch55@xxxx.
After  upgrade, nodeSpec.image is quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch5@xxxx.

{
    "apiVersion": "logging.openshift.io/v1",
    "kind": "Elasticsearch",
    "metadata": {
        "generation": 7,
        "name": "elasticsearch",
        "namespace": "openshift-logging",
      },
    "spec": {
        "managementState": "Managed",
        "nodeSpec": {
            "image": "quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch5@sha256:90d85d18846f851b5845b181411a122081c6bd5a62d28d61f4ed12b3288ce6b6",
        },
        "nodes": [
            {
                "genUUID": "e96wmaeq",
                "nodeCount": 1,
                "resources": {},
                "roles": [
                    "client",
                    "data",
                    "master"
                ],
                "storage": {}
            }
        ],
        "redundancyPolicy": "ZeroRedundancy"
    },
    "status": {
        "cluster": {
            "activePrimaryShards": 0,
            "activeShards": 0,
            "initializingShards": 0,
            "numDataNodes": 0,
            "numNodes": 0,
            "pendingTasks": 0,
            "relocatingShards": 0,
            "status": "cluster health unknown",
            "unassignedShards": 0
        },
        "clusterHealth": "",
        "conditions": [],
        "nodes": [
            {
                "conditions": [
                    {
                        "lastTransitionTime": "2020-07-02T06:30:46Z",
                        "message": "Back-off pulling image \"quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch55@sha256:90d85d18846f851b5845b181411a122081c6bd5a62d28d61f4ed12b3288ce6b6\"",
                        "reason": "ImagePullBackOff",
                        "status": "True",
                        "type": "ElasticsearchContainerWaiting"
                    }
                ],
                "deploymentName": "elasticsearch-cdm-e96wmaeq-1",
                "upgradeStatus": {}
            }
        ],
         "shardAllocationEnabled": "shard allocation unknown"
    }
}

Comment 19 Anping Li 2020-07-02 06:43:54 UTC

Must we wait for the green status before the upgrade?

$ oc logs elasticsearch-operator-555884d8b4-wbfck |tee eo-pod.logs
time="2020-07-02T06:26:56Z" level=warning msg="Unable to parse loglevel \"\""
{"level":"info","ts":1593671216.8078368,"logger":"cmd","msg":"Go Version: go1.12.12"}
{"level":"info","ts":1593671216.8078604,"logger":"cmd","msg":"Go OS/Arch: linux/amd64"}
{"level":"info","ts":1593671216.807865,"logger":"cmd","msg":"Version of operator-sdk: v0.8.2"}
{"level":"info","ts":1593671216.808159,"logger":"leader","msg":"Trying to become the leader."}
{"level":"info","ts":1593671216.9419112,"logger":"leader","msg":"Found existing lock","LockOwner":"elasticsearch-operator-df8cc886-9vvjn"}
{"level":"info","ts":1593671216.9501584,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671218.138512,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671220.522263,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671225.0637383,"logger":"leader","msg":"Not the leader. Waiting."}
{"level":"info","ts":1593671233.7697678,"logger":"leader","msg":"Became the leader."}
{"level":"info","ts":1593671233.8965669,"logger":"cmd","msg":"Registering Components."}
{"level":"info","ts":1593671233.8968284,"logger":"kubebuilder.controller","msg":"Starting EventSource","controller":"elasticsearch-controller","source":"kind source: /, Kind="}
{"level":"info","ts":1593671234.0148737,"logger":"cmd","msg":"failed to create or get service for metrics: services \"elasticsearch-operator\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}
{"level":"info","ts":1593671234.014903,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1593671234.115109,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1593671234.2152753,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-07-02T06:30:46Z" level=info msg="Requested to update node 'elasticsearch-cdm-e96wmaeq-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-07-02T06:31:17Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-e96wmaeq-1:  / green"
time="2020-07-02T06:31:17Z" level=warning msg="Error occurred while updating node elasticsearch-cdm-e96wmaeq-1: Cluster not in green state before beginning upgrade: "

Comment 21 ewolinet 2020-07-06 14:50:58 UTC

@Anping -- I am able to recreate this locally with the latest EO code and it seems to be due to the deployment being paused prior to us seeing the new pod roll out.

Comment 22 Anping Li 2020-07-07 05:13:47 UTC

@ewollinetz
The scenarios:
1. We released CSV bundles which using non-existing Elasticsearch pull-spec.( registry.example.com/openshift/ose-logging-elasticsearch55@xxxx)
2. The ES pods in crashloopback status, as it couldn't find the image 'registry.example.com/openshift/ose-logging-elasticsearch55@xxxx'.  No running pods, no ES status.
3. We provide new CSV bundles which provide a correct pull spec
Expected Result: The CSV was updated automatically. The ES deployment is upgraded automatically.  The ES pod are running after the upgrade
Actual Result: 
The CSV was updated. The ES deployment wasn't upgraded.

Comment 23 ewolinet 2020-07-09 21:04:57 UTC

@Anping,

the deployment wasn't updated? then I'm not sure I am able to recreate this yet...

in the case where there is an imagepullbackoff the operator will bypass trying to do a normal upgrade and instead just update the deployments to fix a broken state:
time="2020-07-02T06:30:46Z" level=info msg="Requested to update node 'elasticsearch-cdm-e96wmaeq-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"

though for some reason we then see it is trying to do a normal update afterwards:
time="2020-07-02T06:31:17Z" level=info msg="Waiting for cluster to be fully recovered before upgrading elasticsearch-cdm-e96wmaeq-1:  / green"

especially because that node is not scheduled for a normal upgrade (indicated to me by upgradeStatus):
{
  "conditions": [
    {
      "lastTransitionTime": "2020-07-02T06:30:46Z",
      "message": "Back-off pulling image \"quay.io/openshift-qe-optional-operators/ose-logging-elasticsearch55@sha256:90d85d18846f851b5845b181411a122081c6bd5a62d28d61f4ed12b3288ce6b6\"",
      "reason": "ImagePullBackOff",
      "status": "True",
      "type": "ElasticsearchContainerWaiting"
    }
  ],
  "deploymentName": "elasticsearch-cdm-e96wmaeq-1",
  "upgradeStatus": {}
}

Comment 25 Anping Li 2020-07-22 15:02:24 UTC

The ES pods weren't restarted. The elastic search deployments are updated.  but the deployments are in PAUSE status.

The Elasticsearch-operator prints message as Comment 23.
$oc logs elasticsearch-operator-8599687444-tzsvv
{"level":"info","ts":1595428596.2246425,"logger":"cmd","msg":"Starting the Cmd."}
{"level":"info","ts":1595428596.3249218,"logger":"kubebuilder.controller","msg":"Starting Controller","controller":"elasticsearch-controller"}
{"level":"info","ts":1595428596.4251525,"logger":"kubebuilder.controller","msg":"Starting workers","controller":"elasticsearch-controller","worker count":1}
time="2020-07-22T14:36:37Z" level=info msg="Requested to update node 'elasticsearch-cdm-fg45zjrh-1', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-07-22T14:36:39Z" level=info msg="Requested to update node 'elasticsearch-cdm-fg45zjrh-2', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"
time="2020-07-22T14:36:40Z" level=info msg="Requested to update node 'elasticsearch-cdm-fg45zjrh-3', which is unschedulable. Skipping rolling restart scenario and performing redeploy now"

$oc get deployment elasticsearch-cdm-2k1giybq-1 -o json|jq '.status'
  "conditions": [
    {
      "lastTransitionTime": "2020-07-22T14:28:21Z",
      "lastUpdateTime": "2020-07-22T14:28:21Z",
      "message": "Deployment does not have minimum availability.",
      "reason": "MinimumReplicasUnavailable",
      "status": "False",
      "type": "Available"
    },
    {
      "lastTransitionTime": "2020-07-22T14:36:39Z",
      "lastUpdateTime": "2020-07-22T14:36:39Z",
      "message": "Deployment is paused",
      "reason": "DeploymentPaused",
      "status": "Unknown",
      "type": "Progressing"
    }
  ],
  "observedGeneration": 5,
  "replicas": 1,
  "unavailableReplicas": 1
}

Comment 29 Anping Li 2020-08-03 14:49:28 UTC

Failed in 4.5.

The elasticsearch pod stuck on imagePullBackout. after the CSV was updated, The elasticsearch deployments weren't updated.


$ oc get pods -n openshift-logging
NAME                                            READY   STATUS                  RESTARTS   AGE
cluster-logging-operator-7c57bb66d-7x76c        1/1     Running                 0          11m
curator-1596465600-9gl4k                        1/1     Running                 0          2m1s
elasticsearch-cdm-s08q4j20-1-5dc9fd4499-547sp   1/2     ImagePullBackOff        0          10m
elasticsearch-cdm-s08q4j20-2-68d5946564-6fc9s   0/2     Pending                 0          9m12s
elasticsearch-cdm-s08q4j20-3-bb49fd6d9-pb578    0/2     Pending                 0          8m11s

$EO logs
/util/wait/wait.go:88"}
time="2020-08-03T14:32:21Z" level=info msg="Updating status of Kibana"
time="2020-08-03T14:32:21Z" level=info msg="Kibana status successfully updated"
time="2020-08-03T14:32:50Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n"
time="2020-08-03T14:32:51Z" level=info msg="Updating status of Kibana"
time="2020-08-03T14:32:51Z" level=info msg="Kibana status successfully updated"
time="2020-08-03T14:33:21Z" level=info msg="Kibana status successfully updated"
time="2020-08-03T14:33:52Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n"
time="2020-08-03T14:34:53Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n"
time="2020-08-03T14:35:23Z" level=warning msg="Unable to enable shard allocation: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: dial tcp 172.30.233.100:9200: i/o timeout"
time="2020-08-03T14:38:53Z" level=warning msg="Unable to list existing templates in order to reconcile stale ones: Get https://elasticsearch.openshift-logging.svc:9200/_template: dial tcp 172.30.233.100:9200: i/o timeout"
time="2020-08-03T14:39:23Z" level=error msg="Error creating index template for mapping app: Put https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app: dial tcp 172.30.233.100:9200: i/o timeout"
{"level":"error","ts":1596465563.4152765,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"elasticsearch-controller","request":"openshift-logging/elasticsearch","error":"Failed to reconcile IndexMangement for Elasticsearch cluster: Put https://elasticsearch.openshift-logging.svc:9200/_template/ocp-gen-app: dial tcp 172.30.233.100:9200: i/o timeout","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/github.com/go-logr/zapr/zapr.go:128\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/openshift/elasticsearch-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
time="2020-08-03T14:39:55Z" level=warning msg="Unable to clear transient shard allocation for \"openshift-logging\" \"openshift-logging\": Response: , Error: Put https://elasticsearch.openshift-logging.svc:9200/_cluster/settings: dial tcp 172.30.233.100:9200: i/o timeout"
time="2020-08-03T14:40:55Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n"
time="2020-08-03T14:41:55Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n"
time="2020-08-03T14:42:55Z" level=warning msg="unable to get cluster node count. E: Get https://elasticsearch.openshift-logging.svc:9200/_cluster/health: dial tcp 172.30.233.100:9200: i/o timeout\r\n

Comment 30 ewolinet 2020-08-03 15:15:33 UTC

@Anping, can you please verify if this has been fixed in 4.6?


If so, this pr should fix this for 4.5 https://github.com/openshift/elasticsearch-operator/pull/436

Comment 31 Anping Li 2020-08-03 15:38:02 UTC

Pass when use elasticsearch-operators:v4.6.

Comment 34 errata-xmlrpc 2020-10-27 15:09:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198