Bug 1372829

Summary: osp-director-10: PCS cluster down before the Controller upgraded from Osp9 to OSP10 (Error:"cannot start with some cluster nodes being offline".).
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: rhosp-directorAssignee: Angus Thomas <athomas>
Status: CLOSED NOTABUG QA Contact: Omri Hochman <ohochman>
Severity: medium Docs Contact:
Priority: medium    
Version: 10.0 (Newton)CC: augol, dbecker, jcoufal, mandreou, mburns, morazi, ohochman, rhel-osp-director-maint, sathlang
Target Milestone: ---Keywords: Triaged
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-13 10:36:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Omri Hochman 2016-09-02 19:43:49 UTC
osp-director-10: PCS cluster down before the Controller upgraded from Osp9 to OSP10 (Error:"cannot start with some cluster nodes being offline".).  

Environment:
------------
instack-undercloud-5.0.0-0.20160818065636.41ef775.el7ost.noarch
instack-5.0.0-0.20160802165724.5aabf5c.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160823082523.1106458.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-33.el7ost.noarch
openstack-heat-templates-0.0.1-0.20160822094546.1ac2823.el7ost.noarch
python-heat-tests-7.0.0-0.20160823082523.1106458.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160823082523.1106458.el7ost.noarch
puppet-heat-9.1.0-0.20160815142726.d364553.el7ost.noarch
python-heatclient-1.3.0-0.20160802194627.44dfe53.el7ost.noarch
openstack-heat-common-7.0.0-0.20160823082523.1106458.el7ost.noarch
openstack-heat-api-7.0.0-0.20160823082523.1106458.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20160823140311.072404b.el7ost.noarch



Steps 
------
(1) Attempt to follow the guide to upgrade from osp9 to osp10 https://gitlab.cee.redhat.com/sathlang/ospd-9-to-10-upgrade#controller-and-block-storage-upgrade

(2)After successful run of the init command run 
openstack overcloud deploy --templates --control-scale 3 --compute-scale 1    --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml

Results:
---------
Upgrade controller fail due to PCS cluster down 


Upgrade View:
--------------
2016-08-24 14:02:44 [UpgradeInitConfig]: DELETE_COMPLETE state changed
2016-08-24 14:02:44 [0]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:02:45 [1]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:02:45 [BlockStorageAllNodesDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:02:45 [CephStorageAllNodesDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:02:45 [ObjectStorageAllNodesDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:02:46 [ObjectStorageAllNodesValidationDeployment]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:02:47 [BlockStorageAllNodesValidationDeployment]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:02:48 [CephStorageAllNodesValidationDeployment]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:02:51 [BlockStorageAllNodesValidationDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:02:51 [ObjectStorageAllNodesValidationDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:02:52 [CephStorageAllNodesValidationDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:03:17 [0]: SIGNAL_IN_PROGRESS Signal: deployment babdeb4c-b008-497d-a8aa-537061741749 succeeded
2016-08-24 14:03:17 [0]: UPDATE_COMPLETE state changed
2016-08-24 14:03:17 [overcloud-ComputeAllNodesDeployment-ula6xjjbqnuc]: UPDATE_COMPLETE Stack UPDATE completed successfully
2016-08-24 14:03:18 [ComputeAllNodesDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:03:19 [ComputeAllNodesValidationDeployment]: UPDATE_IN_PROGRESS state changed
2016-08-24 14:03:19 [overcloud-ComputeAllNodesValidationDeployment-mdmfmjmag7gg]: UPDATE_IN_PROGRESS Stack UPDATE started
2016-08-24 14:03:19 [overcloud-ComputeAllNodesValidationDeployment-mdmfmjmag7gg]: UPDATE_COMPLETE Stack UPDATE completed successfully
2016-08-24 14:03:20 [ComputeAllNodesValidationDeployment]: UPDATE_COMPLETE state changed
2016-08-24 14:03:43 [0]: SIGNAL_IN_PROGRESS Signal: deployment c7145966-5644-4fe5-9727-d0a5685d0597 failed (1)
2016-08-24 14:03:43 [0]: CREATE_FAILED Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-08-24 14:03:46 [2]: SIGNAL_IN_PROGRESS Signal: deployment b374b565-d6cb-409a-8f28-d3d060ea2c31 failed (1)
2016-08-24 14:03:47 [2]: CREATE_FAILED Error: resources[2]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-08-24 14:03:51 [2]: SIGNAL_IN_PROGRESS Signal: deployment 6b404c48-1450-4625-a9c5-a90a8e879ebe succeeded
2016-08-24 14:03:51 [2]: UPDATE_COMPLETE state changed
2016-08-24 14:04:04 [1]: SIGNAL_IN_PROGRESS Signal: deployment bd80f208-f0d4-4f34-8d65-57f4210f516d failed (1)
2016-08-24 14:04:05 [1]: CREATE_FAILED Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-08-24 14:04:05 [overcloud-UpdateWorkflow-57rzzvytb7mc-ControllerPacemakerUpgradeDeployment_Step1-gugu2kls6k5m]: CREATE_FAILED Resource CREATE failed: Error: resources[0]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 1
2016-08-24 14:04:05 [ControllerPacemakerUpgradeDeployment_Step1]: CREATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-08-24 14:04:05 [overcloud-UpdateWorkflow-57rzzvytb7mc]: UPDATE_FAILED Error: resources.ControllerPacemakerUpgradeDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-08-24 14:04:06 [UpdateWorkflow]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-08-24 14:04:06 [ControllerAllNodesDeployment]: UPDATE_FAILED UPDATE aborted
2016-08-24 14:04:06 [overcloud]: UPDATE_FAILED resources.UpdateWorkflow: Error: resources.ControllerPacemakerUpgradeDeployment_Step1.resources[0]: Deployment to server failed: deploy_status_code: Deployment exited with non-zero status code: 1
2016-08-24 14:04:07 [0]: UPDATE_FAILED UPDATE aborted
2016-08-24 14:04:07 [1]: UPDATE_FAILED UPDATE aborted
2016-08-24 14:04:08 [overcloud-ControllerAllNodesDeployment-lqutwxlcrcoi]: UPDATE_FAILED Operation cancelled
2016-08-24 14:04:11 [1]: SIGNAL_FAILED Signal: deployment c737033f-20f4-490e-b679-1ce9b2837bf7 succeeded
Stack overcloud UPDATE_FAILED
Heat Stack update failed.
(reverse-i-search)`re': openstack overcloud deploy --templates --control-scale 3 --compute-scale 1    --neutron-network-type vxlan --neutron-tunnel-types vxlan  --ntp-server 10.5.26.10 --timeout 90 -e /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e network-environment.yaml -e /usr/sha^C/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml
[stack@undercloud72 ~]$ heat stack-list
WARNING (shell) "heat stack-list" is deprecated, please use "openstack stack list" instead
+--------------------------------------+------------+---------------+---------------------+---------------------+
| id                                   | stack_name | stack_status  | creation_time       | updated_time        |
+--------------------------------------+------------+---------------+---------------------+---------------------+
| 59ba3729-b247-4600-83b7-df119ce96542 | overcloud  | UPDATE_FAILED | 2016-08-23T17:34:16 | 2016-08-24T13:58:36 |
+--------------------------------------+------------+---------------+---------------------+---------------------+
[stack@undercloud72 ~]$ heat resource-list overcloud -n5 | grep -v COMPLETE
WARNING (shell) "heat resource-list" is deprecated, please use "openstack stack resource list" instead
+--------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------+-----------------+---------------------+------------------------------------------------------------------------------------------------------------------------+
| resource_name                              | physical_resource_id                          | resource_type                                                                                 | resource_status | updated_time        | stack_name                                                                                                             |
+--------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------+-----------------+---------------------+------------------------------------------------------------------------------------------------------------------------+
| UpdateWorkflow                             | 6c447028-cb03-4044-acfd-b7b6f3ccc6f4          | OS::TripleO::Tasks::UpdateWorkflow                                                            | UPDATE_FAILED   | 2016-08-24T14:02:24 | overcloud                                                                                                              |
| 0                                          | c7145966-5644-4fe5-9727-d0a5685d0597          | OS::Heat::SoftwareDeployment                                                                  | CREATE_FAILED   | 2016-08-24T14:02:35 | overcloud-UpdateWorkflow-57rzzvytb7mc-ControllerPacemakerUpgradeDeployment_Step1-gugu2kls6k5m                          |
| 1                                          | bd80f208-f0d4-4f34-8d65-57f4210f516d          | OS::Heat::SoftwareDeployment                                                                  | CREATE_FAILED   | 2016-08-24T14:02:35 | overcloud-UpdateWorkflow-57rzzvytb7mc-ControllerPacemakerUpgradeDeployment_Step1-gugu2kls6k5m                          |
| ControllerPacemakerUpgradeDeployment_Step1 | 5ee12238-fb71-4616-9219-ca7e5271171e          | OS::Heat::SoftwareDeploymentGroup                                                             | CREATE_FAILED   | 2016-08-24T14:02:35 | overcloud-UpdateWorkflow-57rzzvytb7mc                                                                                  |
| 2                                          | b374b565-d6cb-409a-8f28-d3d060ea2c31          | OS::Heat::SoftwareDeployment                                                                  | CREATE_FAILED   | 2016-08-24T14:02:36 | overcloud-UpdateWorkflow-57rzzvytb7mc-ControllerPacemakerUpgradeDeployment_Step1-gugu2kls6k5m                          |
| ControllerAllNodesDeployment               | 7ec5ffd8-906d-43fe-adce-451a8490327e          | OS::Heat::StructuredDeployments                                                               | UPDATE_FAILED   | 2016-08-24T14:02:39 | overcloud                                                                                                              |
| 0                                          | 799c1464-b61b-4737-85ef-0803ac07fb39          | OS::Heat::StructuredDeployment                                                                | UPDATE_FAILED   | 2016-08-24T14:02:43 | overcloud-ControllerAllNodesDeployment-lqutwxlcrcoi                                                                    |
| 1                                          | c737033f-20f4-490e-b679-1ce9b2837bf7          | OS::Heat::StructuredDeployment                                                                | UPDATE_FAILED   | 2016-08-24T14:02:44 | overcloud-ControllerAllNodesDeployment-lqutwxlcrcoi                                                                    |
+--------------------------------------------+-----------------------------------------------+-----------------------------------------------------------------------------------------------+-----------------+---------------------+------------------------------------------------------------------------------------------------------------------------+

[stack@undercloud72 ~]$ heat deployment-show b374b565-d6cb-409a-8f28-d3d060ea2c31
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "7996e49b-f08f-4ef6-8969-39354cbb5c40", 
  "config_id": "68a54dc4-a401-424c-aae5-aaf0e2268edc", 
  "output_values": {
    "deploy_stdout": "Error: cluster is not currently running on this node\nERROR: upgrade cannot start with some cluster nodes being offline\n", 
    "deploy_stderr": "+ cluster_sync_timeout=1800\n+ check_cluster\n+ pcs status\n+ grep -E '(cluster is not currently running)|(OFFLINE:)'\n+ echo_error 'ERROR: upgrade cannot start with some cluster nodes being offline'\n+ echo 'ERROR: upgrade cannot start with some cluster nodes being offline'\n+ tee /dev/fd2\n+ exit 1\n", 
    "deploy_status_code": 1
  }, 
  "creation_time": "2016-08-24T14:02:44", 
  "updated_time": "2016-08-24T14:03:46", 
  "input_values": {
    "update_identifier": "", 
    "deploy_identifier": "1472047107"
  }, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 1", 
  "id": "b374b565-d6cb-409a-8f28-d3d060ea2c31"
}
[stack@undercloud72 ~]$ 
[stack@undercloud72 ~]$ 
[stack@undercloud72 ~]$ 
[stack@undercloud72 ~]$ 
[stack@undercloud72 ~]$ 
[stack@undercloud72 ~]$ nova list
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
| ID                                   | Name                   | Status | Task State | Power State | Networks              |
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
| fe7570d7-91ad-431a-bfcb-8786ae7ead4e | overcloud-compute-0    | ACTIVE | -          | Running     | ctlplane=192.168.0.7  |
| 1c1f6c46-1836-4e31-bf35-871e3589f6f0 | overcloud-controller-0 | ACTIVE | -          | Running     | ctlplane=192.168.0.9  |
| 1e958f1d-7697-4433-945e-82c1f4cc18e2 | overcloud-controller-1 | ACTIVE | -          | Running     | ctlplane=192.168.0.8  |
| 7996e49b-f08f-4ef6-8969-39354cbb5c40 | overcloud-controller-2 | ACTIVE | -          | Running     | ctlplane=192.168.0.10 |
+--------------------------------------+------------------------+--------+------------+-------------+-----------------------+
[stack@undercloud72 ~]$ ssh heat-admin.0.9
The authenticity of host '192.168.0.9 (192.168.0.9)' can't be established.
ECDSA key fingerprint is 57:64:0f:07:33:c4:d9:ae:3d:7c:1b:45:5b:68:39:55.
Are you sure you want to continue connecting (yes/no)? yes


[heat-admin@overcloud-controller-0 ~]$ sudo su -
Last login: Fri Sep  2 14:06:52 UTC 2016 on pts/0

[root@overcloud-controller-0 ~]# pcs status
Error: cluster is not currently running on this node

Comment 2 Omri Hochman 2016-09-02 19:45:51 UTC
After running : pcs cluster start (on all controllers) :


2016-08-24 15:30:55 [overcloud]: UPDATE_COMPLETE Stack UPDATE completed successfully
Stack overcloud UPDATE_COMPLETE
Overcloud Endpoint: http://10.19.184.210:5000/v2.0
Overcloud Deployed

Comment 3 Sofer Athlan-Guyot 2016-09-06 16:29:16 UTC
I went past this stage today.  Are you sure that the problem was not local to you deployment ?

Comment 4 Omri Hochman 2016-09-09 00:16:46 UTC
(In reply to Sofer Athlan-Guyot from comment #3)
> I went past this stage today.  Are you sure that the problem was not local
> to you deployment ?

hi Sofer, it's contently reproduced on my BM ,  I've opened a followup Bz for failed resources of gnocchi service, occurs just after I'm starting the pcs cluster manually : 
https://bugzilla.redhat.com/show_bug.cgi?id=1374531

Comment 5 Omri Hochman 2016-09-11 02:55:46 UTC
could have happened because I called the file /usr/share/openstack-tripleo-heat-templates/environments/storage-environment.yaml   in my deployment command, 
while my setup didn't include ceph nodes. 

When I used ceph nodes in my deployment the issue didn't occur , same for   : https://bugzilla.redhat.com/show_bug.cgi?id=1374531

Comment 6 Marios Andreou 2016-10-13 10:36:09 UTC
Ok given comment #4 I think we can close this as not a bug for now (environmental, incorrect environment files specified)? Please reopen if you disagree.

Comment 7 Marios Andreou 2016-10-13 10:38:08 UTC
(In reply to marios from comment #6)
> Ok given comment #4 I think we can close this as not a bug for now
> (environmental, incorrect environment files specified)? Please reopen if you
> disagree.

Sorry, I meant comment #5, where Omri explains inclusion of the storage-environment file

Comment 9 Amit Ugol 2018-05-02 10:54:11 UTC
closed, no need for needinfo.