Bug 1420256 - [intservice_public_324] ES is in pending status if deploy logging with dynamic PV enabled
Summary: [intservice_public_324] ES is in pending status if deploy logging with dynami...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.5.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.5.z
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-02-08 09:56 UTC by Xia Zhao
Modified: 2017-10-25 13:00 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-10-25 13:00:48 UTC
Target Upstream Version:


Attachments (Terms of Use)
full ansible execution logs with dynamic PV enabled (889.98 KB, text/plain)
2017-02-08 10:06 UTC, Xia Zhao
no flags Details
pvc_foo-ops--0 (1.08 KB, text/plain)
2017-02-13 08:51 UTC, Xia Zhao
no flags Details
pvc_foo--0 (1.06 KB, text/plain)
2017-02-13 08:51 UTC, Xia Zhao
no flags Details
pod_es_ops (6.98 KB, text/plain)
2017-02-13 08:51 UTC, Xia Zhao
no flags Details
pod_es (6.94 KB, text/plain)
2017-02-13 08:52 UTC, Xia Zhao
no flags Details
ansible log - 20170220 (1.29 MB, text/plain)
2017-02-20 09:12 UTC, Junqi Zhao
no flags Details
inventory_March_1st_2017 (1.10 KB, text/plain)
2017-03-01 09:25 UTC, Xia Zhao
no flags Details
Ansible log when used inventory_March_1st_2017 on a cloudprovider enabled env (1.72 MB, text/plain)
2017-03-01 09:36 UTC, Xia Zhao
no flags Details
The es-ops pvc (JSON output) (1.08 KB, text/plain)
2017-03-03 09:43 UTC, Xia Zhao
no flags Details
The es pvc (JSON output) which worked fine (1.58 KB, text/plain)
2017-03-03 09:44 UTC, Xia Zhao
no flags Details
es ops pod log, shows NoVolumeZoneConflict error (2.17 KB, text/plain)
2017-03-06 07:06 UTC, Junqi Zhao
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:3049 0 normal SHIPPED_LIVE OpenShift Container Platform 3.6, 3.5, and 3.4 bug fix and enhancement update 2017-10-25 15:57:15 UTC

Description Xia Zhao 2017-02-08 09:56:45 UTC
Description of problem:
Deploy logging by executing the ansible scripts enabling dynamic PV for ES and ES_ops storage, the deployment process can finish successfully, but ES is in pending status, PVCs are created & no PVs actually created:
# oc get po
logging-es-ko2qgp0w-1-gnvwv       0/1       Pending   0          8m
logging-es-ops-hdt1m5c8-1-cd2l3   0/1       Pending   0          8m

# oc get pvc
NAME         STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
foo--0       Pending                                      3m
foo-ops--0   Pending                                      3m

# oc get pv
NAME           CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                 REASON    AGE
regpv-volume   17G        RWX           Retain          Bound     default/regpv-claim             6h

Version-Release number of selected component (if applicable):
https://github.com/openshift/openshift-ansible

How reproducible:
Always

Steps to Reproduce:
1. Uninstall logging 3.5.0 stacks by executing ansible scripts, enable dynamic PV for ES storage in the inventory file:
openshift_logging_es_cluster_size=1
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_prefix=foo-
openshift_logging_es_pvc_size=1G
openshift_logging_use_ops=true
openshift_logging_es_ops_cluster_size=1
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_ops_pvc_prefix=foo-ops-
openshift_logging_es_ops_pvc_size=1G
3. After  ansible scripts finished in running, check PV, PVC and ES pods status

Actual results:
ES is in pending status

Expected results:
ES should be in running status

Additional info:
Issue didn't repro when using the logging deployer
Full ansible execution log attached.

Comment 1 Xia Zhao 2017-02-08 10:06:09 UTC
Created attachment 1248574 [details]
full ansible execution logs with dynamic PV enabled

Comment 2 Jeff Cantrill 2017-02-09 13:53:03 UTC
Is your cluster set-up for dynamic provisioning? Here are the 3.2 docs, but same applies to the 3.5 clusters as far as I know: https://docs.openshift.com/enterprise/3.2/install_config/persistent_storage/dynamically_provisioning_pvs.html#enabling-provisioner-plugins

Comment 3 Xia Zhao 2017-02-10 02:11:10 UTC
(In reply to Jeff Cantrill from comment #2)
> Is your cluster set-up for dynamic provisioning? Here are the 3.2 docs, but
> same applies to the 3.5 clusters as far as I know:
> https://docs.openshift.com/enterprise/3.2/install_config/persistent_storage/
> dynamically_provisioning_pvs.html#enabling-provisioner-plugins

Yes. I enabled cloud-provider in master-config when launching the OCP env.

Comment 4 Jeff Cantrill 2017-02-10 17:47:35 UTC
Please attach the JSON output for the various resources (pod, pvc, pv).  Assuming we are producing the object the same as a comparable 3.4 deployment, this would not be a logging bug

Comment 5 Xia Zhao 2017-02-13 08:50:33 UTC
Attached the JSON output of es, es-ops pods, and the pvcs. No pv was actually created.

# oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-lm9x9           1/1       Running   1          4m
logging-curator-ops-1-5427s       1/1       Running   1          4m
logging-es-b6ozqvp5-1-kcxjl       0/1       Pending   0          4m
logging-es-ops-g7s5xgaz-1-k1wdj   0/1       Pending   0          4m
logging-fluentd-c159n             1/1       Running   0          4m
logging-fluentd-k4z36             1/1       Running   0          4m
logging-kibana-1-b8pxh            2/2       Running   0          4m
logging-kibana-ops-1-522z7        2/2       Running   0          4m

# oc get pvc
NAME         STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
foo--0       Pending                                      5m
foo-ops--0   Pending                                      5m

# oc get pv
NAME           CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                 REASON    AGE
regpv-volume   17G        RWX           Retain          Bound     default/regpv-claim             35m

Comment 6 Xia Zhao 2017-02-13 08:51:02 UTC
Created attachment 1249779 [details]
pvc_foo-ops--0

Comment 7 Xia Zhao 2017-02-13 08:51:26 UTC
Created attachment 1249780 [details]
pvc_foo--0

Comment 8 Xia Zhao 2017-02-13 08:51:59 UTC
Created attachment 1249781 [details]
pod_es_ops

Comment 9 Xia Zhao 2017-02-13 08:52:22 UTC
Created attachment 1249782 [details]
pod_es

Comment 10 Xia Zhao 2017-02-13 08:55:53 UTC
From the es pods' JSON output, found es-ops pod is bound with wrong claimName foo--0, it should be foo-ops--0

# oc get po logging-es-ops-g7s5xgaz-1-k1wdj -o json | grep foo
                    "claimName": "foo--0"

# oc get po logging-es-b6ozqvp5-1-kcxjl -o json | grep foo
                    "claimName": "foo--0"

openshift_logging_es_pvc_prefix=foo-
openshift_logging_es_ops_pvc_prefix=foo-ops-

Comment 11 Jeff Cantrill 2017-02-16 14:20:48 UTC
Possibly related to: https://bugzilla.redhat.com/show_bug.cgi?id=1399523

Comment 12 Jeff Cantrill 2017-02-16 16:39:22 UTC
I am unable to recreate, but we have made some changes recently in openshift-ansible in this area that may have resolved this. I am unable to point to specific merges or commits.  I recently tested with:

openshift-ansible: HEAD - bdbb8d2ec6e81ec0eb8b5b5c512583392af2004d

Partial Inventory:

openshift_logging_es_ops_pvc_prefix=foo-ops-
openshift_logging_es_ops_pvc_size=1G

Comment 13 Junqi Zhao 2017-02-20 09:10:17 UTC
verified on GCE with the latest openshift-ansible, same error as xizhao reported before.

# oc get pv
No resources found.
# oc get pvc
NAME         STATUS    VOLUME    CAPACITY   ACCESSMODES   AGE
foo--0       Pending                                      13m
foo-ops--0   Pending                                      13m
# oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-kzbmp           1/1       Running   3          12m
logging-curator-ops-1-1l720       1/1       Running   3          12m
logging-es-1bmm4loa-1-3g2f7       0/1       Pending   0          12m
logging-es-ops-xgdflp96-1-4mtsj   0/1       Pending   0          12m
logging-fluentd-x9b33             1/1       Running   0          13m
logging-kibana-1-tdrrk            2/2       Running   0          12m
logging-kibana-ops-1-fhtv8        2/2       Running   0          12m

# oc get po logging-es-1bmm4loa-1-3g2f7 -o json| grep foo -A 2 -B 2
                "name": "elasticsearch-storage",
                "persistentVolumeClaim": {
                    "claimName": "foo--0"
                }
            },

# oc get po logging-es-ops-xgdflp96-1-4mtsj -o json| grep foo -A 2 -B 2
                "name": "elasticsearch-storage",
                "persistentVolumeClaim": {
                    "claimName": "foo--0"
                }
            },

re-open this defect and attached the fully ansible run log

Comment 14 Junqi Zhao 2017-02-20 09:12:07 UTC
Created attachment 1255624 [details]
ansible log - 20170220

Comment 15 Jeff Cantrill 2017-02-20 20:55:48 UTC
fixed in https://github.com/openshift/openshift-ansible/pull/3431

Comment 16 Xia Zhao 2017-03-01 09:24:51 UTC
Retested on AWS with openshift-ansible-3.5.15-1.git.0.8d2a456.el7.noarch, still did not get the dynamic PV created, attached the inventory file I used and ansible execution log. 

And i feel it a regression that no pvc created even these parameters are specified:

openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_prefix=foo-

openshift_logging_es_ops_pvc_dynamic=true
openshift_logging_es_ops_pvc_prefix=foo-ops-

[root@ip-172-18-0-24 ~]# oc get pv
No resources found.
[root@ip-172-18-0-24 ~]# oc get pvc -n logging
No resources found.

Comment 17 Xia Zhao 2017-03-01 09:25:54 UTC
Created attachment 1258615 [details]
inventory_March_1st_2017

Comment 18 Xia Zhao 2017-03-01 09:36:34 UTC
Created attachment 1258619 [details]
Ansible log when used inventory_March_1st_2017 on a cloudprovider enabled env

Comment 19 Jeff Cantrill 2017-03-02 16:48:34 UTC
Can you also provide the following:

<bc> get a dump of their PVC and of the storageclass
<bc> they need a storage class defined that they can provision against

I think we already have the PVC but need understanding of the storageclass.  To my knowledge, we have parity with PVC generation between 3.5 ansible and 3.4 deployer so I would expect logging to still deploy using dynamic PVC allocation.

Comment 20 Jeff Cantrill 2017-03-02 21:07:23 UTC
I noticed the PVCs you have do not have the proper annotation. But during investigation I found an issue with creating PVCs in general that is fixed by:

https://github.com/openshift/openshift-ansible/pull/3548

This resolves the issue assuming the logging namespace does not have PVCs already

Comment 21 openshift-github-bot 2017-03-02 21:40:58 UTC
Commits pushed to master at https://github.com/openshift/openshift-ansible

https://github.com/openshift/openshift-ansible/commit/bd7f9386fa2cbe201509486b9ba9ac74b23e8f8a
bug 1420256. Initialize openshift_logging pvc_facts to empty

https://github.com/openshift/openshift-ansible/commit/4630305622b6f8d6957f93da72b15fe4bda1fd02
Merge pull request #3548 from jcantrill/bz_1420256_again_reset_pvc_facts

bug 1420256. Initialize openshift_logging pvc_facts to empty

Comment 22 Xia Zhao 2017-03-03 09:40:59 UTC
(In reply to Jeff Cantrill from comment #19)
> Can you also provide the following:
> 
> <bc> get a dump of their PVC and of the storageclass
> <bc> they need a storage class defined that they can provision against
> 
> I think we already have the PVC but need understanding of the storageclass. 
> To my knowledge, we have parity with PVC generation between 3.5 ansible and
> 3.4 deployer so I would expect logging to still deploy using dynamic PVC
> allocation.

Hi Jeff,

Tested with today's latest code get from openshift-ansible repo , mater branch, HEAD revision 4630305622b6f8d6957f93da72b15fe4bda1fd02, the es pod can now be started up with dynamic PV provisioned, but the es-ops pod is still failed, could you please help to take a further look? Thanks!

1. Deployed with these paramters in the inventory file:

openshift_logging_es_cluster_size=1
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_pvc_prefix=foo-
openshift_logging_es_pvc_size=1G
openshift_logging_use_ops=true
openshift_logging_es_ops_cluster_size=1
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_ops_pvc_prefix=foo-ops-
openshift_logging_es_ops_pvc_size=1G

2. The es pod is bound with dynamic pv but es-ops pod Pending:

# oc get po
logging-es-lecuuah9-1-1cq38       1/1       Running   0          9m
logging-es-ops-j9b8tf1m-1-vmllt   0/1       Pending   0          9m


3. 
# oc get pvc
NAME         STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
foo--0       Bound     pvc-589d4995-fff0-11e6-8651-0e370a4933aa   1Gi        RWO           9m
foo-ops--0   Pending                                                                       9m

# oc describe po logging-es-lecuuah9-1-1cq38
...
  elasticsearch-storage:
    Type:	PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:	foo--0
    ReadOnly:	false
...

# oc get pv
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM            REASON    AGE
pvc-589d4995-fff0-11e6-8651-0e370a4933aa   1Gi        RWO           Delete          Bound     logging/foo--0             9m

4. Please find the JSON output of pvc foo-ops--0 in the attachment

Comment 23 Xia Zhao 2017-03-03 09:43:45 UTC
Created attachment 1259461 [details]
The es-ops pvc (JSON output)

Comment 24 Xia Zhao 2017-03-03 09:44:18 UTC
Created attachment 1259462 [details]
The es pvc (JSON output) which worked fine

Comment 25 Jeff Cantrill 2017-03-03 18:46:18 UTC
You dont have the ops dynamic variable set correctly:

openshift_logging_es_ops_cluster_size=1
openshift_logging_es_pvc_dynamic=true
openshift_logging_es_ops_pvc_prefix=foo-ops-
openshift_logging_es_ops_pvc_size=1G


it should be:

openshift_logging_es_ops_pvc_dynamic=true

Comment 26 Junqi Zhao 2017-03-06 07:05:25 UTC
@Jeff

Used the correct parameter and tested on GCE,(dynamicProvisioningEnabled is true)

# oc get pv
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                REASON    AGE
pvc-c4238016-0234-11e7-91c0-42010af00019   1Gi        RWO           Delete          Bound     logging/foo--0                 39m
pvc-c775476e-0234-11e7-91c0-42010af00019   1Gi        RWO           Delete          Bound     logging/foo-ops--0             39m

# oc get pvc
NAME         STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
foo--0       Bound     pvc-c4238016-0234-11e7-91c0-42010af00019   1Gi        RWO           39m
foo-ops--0   Bound     pvc-c775476e-0234-11e7-91c0-42010af00019   1Gi        RWO           39m


# oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-8zsbk           1/1       Running   0          37m
logging-curator-ops-1-ml9jg       1/1       Running   7          37m
logging-es-0x646pi9-1-nx5p3       1/1       Running   0          37m
logging-es-ops-xrrpzwmo-1-c4vpt   0/1       Pending   0          37m
logging-fluentd-8h5z9             1/1       Running   0          39m
logging-kibana-1-m448l            2/2       Running   0          37m
logging-kibana-ops-1-2m2md        2/2       Running   0          37m


# oc get po logging-es-0x646pi9-1-nx5p3 -o json| grep foo -A 2 -B 2
                "name": "elasticsearch-storage",
                "persistentVolumeClaim": {
                    "claimName": "foo--0"
                }
            },

# oc get po logging-es-ops-xrrpzwmo-1-c4vpt -o json| grep foo -A 2 -B 2
                "name": "elasticsearch-storage",
                "persistentVolumeClaim": {
                    "claimName": "foo-ops--0"
                }
            },

ES OPS pod is in pending status because of NoVolumeZoneConflict error. This is a knonw issue for GCE: https://bugzilla.redhat.com/show_bug.cgi?id=1397672

Comment 27 Junqi Zhao 2017-03-06 07:06:13 UTC
Created attachment 1260285 [details]
es ops pod log, shows NoVolumeZoneConflict error

Comment 28 Junqi Zhao 2017-03-06 07:11:19 UTC
(In reply to Junqi Zhao from comment #27)
> Created attachment 1260285 [details]
> es ops pod log, shows NoVolumeZoneConflict error

ES and ES OPS pod bound to correct pvc now

Comment 29 Junqi Zhao 2017-03-06 08:03:57 UTC
Tested on dynamic pv enabled EC2, ES and ES OPS pods are running well and bound to correct pvc now, since we specified wrong parameter, close this defect as 'Not A Bug'. 
# oc get po
NAME                              READY     STATUS    RESTARTS   AGE
logging-curator-1-x135f           1/1       Running   0          11m
logging-curator-ops-1-8ntj1       1/1       Running   0          11m
logging-es-h1zzrt4p-1-v1h85       1/1       Running   0          11m
logging-es-ops-17t409u3-1-t7t8q   1/1       Running   0          11m
logging-fluentd-vhtvk             1/1       Running   0          13m
logging-kibana-1-rsgvv            2/2       Running   0          11m
logging-kibana-ops-1-bvzt6        2/2       Running   0          11m
# oc get po logging-es-h1zzrt4p-1-v1h85 -o json| grep foo -A 2 -B 2
                "name": "elasticsearch-storage",
                "persistentVolumeClaim": {
                    "claimName": "foo--0"
                }
            },
# oc get po logging-es-ops-17t409u3-1-t7t8q -o json| grep foo -A 2 -B 2
                "name": "elasticsearch-storage",
                "persistentVolumeClaim": {
                    "claimName": "foo-ops--0"
                }
            },
# oc get pv
NAME                                       CAPACITY   ACCESSMODES   RECLAIMPOLICY   STATUS    CLAIM                REASON    AGE
pvc-9e3bec68-0240-11e7-9cf8-0e4d82dcdc76   1Gi        RWO           Delete          Bound     logging/foo--0                 15m
pvc-a185e5d5-0240-11e7-9cf8-0e4d82dcdc76   1Gi        RWO           Delete          Bound     logging/foo-ops--0             15m
# oc get pvc
NAME         STATUS    VOLUME                                     CAPACITY   ACCESSMODES   AGE
foo--0       Bound     pvc-9e3bec68-0240-11e7-9cf8-0e4d82dcdc76   1Gi        RWO           15m
foo-ops--0   Bound     pvc-a185e5d5-0240-11e7-9cf8-0e4d82dcdc76   1Gi        RWO           15m

Comment 30 Xia Zhao 2017-03-06 13:41:07 UTC
@juzhao The status should be set to "Verified" since the original issue had been fixed.

Comment 31 Xia Zhao 2017-03-06 13:41:40 UTC
Set to verified according to comment #29

Comment 32 Junqi Zhao 2017-03-07 00:43:43 UTC
(In reply to Xia Zhao from comment #30)
> @juzhao The status should be set to "Verified" since the original issue had
> been fixed.

It should be 'Not a Bug', there is no code change,  we wrongly used openshift_logging_es_pvc_dynamic for es_ops, should use  openshift_logging_es_ops_pvc_dynamic. see Comment 25

Comment 33 Xia Zhao 2017-03-07 02:40:40 UTC
(In reply to Junqi Zhao from comment #32)
> (In reply to Xia Zhao from comment #30)
> > @juzhao The status should be set to "Verified" since the original issue had
> > been fixed.
> 
> It should be 'Not a Bug', there is no code change,  we wrongly used
> openshift_logging_es_pvc_dynamic for es_ops, should use 
> openshift_logging_es_ops_pvc_dynamic. see Comment 25

If you read the full history data here, you'll find: the ORIGINAL issue was addressed and fixed as mentioned in comment #20, the PR: https://github.com/openshift/openshift-ansible/pull/3548

Please be careful next time when you resolve a bug as Not A BUG.

Comment 35 errata-xmlrpc 2017-10-25 13:00:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3049


Note You need to log in before you can comment on or make changes to this bug.