Bug 1382863

Summary: osp-director-10: Major Upgrade OSP9 -> 10 with Ceph-node, fails on : "Ceph cluster status to go HEALTH_OK\nWARNING"
Product: Red Hat OpenStack Reporter: Omri Hochman <ohochman>
Component: documentationAssignee: Dan Macpherson <dmacpher>
Status: CLOSED NOTABUG QA Contact: RHOS Documentation Team <rhos-docs>
Severity: high Docs Contact:
Priority: high    
Version: 10.0 (Newton)CC: dbecker, dmacpher, flucifre, jcoufal, jomurphy, josorior, lbopf, mburns, morazi, rhel-osp-director-maint, srevivo
Target Milestone: ---Keywords: Documentation, Reopened
Target Release: 10.0 (Newton)   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-15 16:23:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Omri Hochman 2016-10-07 22:31:54 UTC
osp-director-10:  Major Upgrade OSP9 -> 10 with Ceph-node, fails on : "Could not find class ::tripleo::trusted_cas for overcloud-controller"  

Environment: 
-------------
instack-undercloud-5.0.0-0.20160930175750.9d2a655.el7ost.noarch
instack-5.0.0-1.el7ost.noarch
python-heat-agent-0.0.1-0.20160920204709.f123aa1.el7ost.noarch
openstack-heat-templates-0.0.1-0.20160920204709.f123aa1.el7ost.noarch
openstack-heat-common-7.0.0-0.20160926200847.dd707bc.el7ost.noarch
python-heat-tests-7.0.0-0.20160926200847.dd707bc.el7ost.noarch
openstack-tripleo-heat-templates-compat-2.0.0-34.3.el7ost.noarch
puppet-heat-9.4.0-1.1.el7ost.noarch
openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost.noarch
python-heatclient-1.5.0-0.20161001073130.3c3f8ee.el7ost.noarch
openstack-heat-api-7.0.0-0.20160926200847.dd707bc.el7ost.noarch
openstack-heat-engine-7.0.0-0.20160926200847.dd707bc.el7ost.noarch
openstack-heat-api-cfn-7.0.0-0.20160926200847.dd707bc.el7ost.noarch
heat-cfntools-1.3.0-2.el7ost.noarch


Steps : 
---------
(1) Deploy OSP9  (environment with Ceph node) 
(2) Attempt to Upgrade according to https://gitlab.cee.redhat.com/sathlang/ospd-9-to-10-upgrade/


Results: 
---------
Upgrade Failed during " Upgrade Controller and Block-storage"  
 
-> Stack overcloud UPDATE_FAILED 


heat deployment-show
----------------------
more /var/lib/heat-config/deployed/4edf31ec-53e5-407a-a6f1-0f2f4b9bf98d.notify.json
{
  "deploy_stdout": "", 
  "deploy_stderr": "\u001b[1;31mError: Could not find class ::tripleo::trusted_cas for overcloud-controller-0.localdomain on node overcloud-controller-0.localdomain\u001b[0m\n\u001b[1;31mErr
or: Could not find class ::tripleo::trusted_cas for overcloud-controller-0.localdomain on node overcloud-controller-0.localdomain\u001b[0m\n", 
  "deploy_status_code": 1
}



parameter_defaults:
  controllerExtraConfig:
  # In releases before Mitaka, HeatWorkers doesn't modify
  # num_engine_workers, so handle via heat::config 
    heat::config::heat_config:
      DEFAULT/num_engine_workers:
        value: 1
    heat::api_cloudwatch::enabled: false
    heat::api_cfn::enabled: false
  HeatWorkers: 1
  CeilometerWorkers: 1
  CinderWorkers: 1
  GlanceWorkers: 1
  KeystoneWorkers: 1
  NeutronWorkers: 1
  NovaWorkers: 1
  SwiftWorkers: 1
  IgnoreCephUpgradeWarnings: true


Upgrade view:
--------------
:38:36 2016-10-07 14:32:49Z [CephStorageAllNodesValidationDeployment]: UPDATE_IN_PROGRESS  state changed
14:38:36 2016-10-07 14:32:50Z [overcloud-CephStorageAllNodesValidationDeployment-wflwiinwzehd]: UPDATE_IN_PROGRESS  Stack UPDATE started
14:38:36 2016-10-07 14:32:51Z [overcloud-CephStorageAllNodesValidationDeployment-wflwiinwzehd]: UPDATE_COMPLETE  Stack UPDATE completed successfully
14:38:36 2016-10-07 14:32:51Z [CephStorageAllNodesValidationDeployment]: UPDATE_COMPLETE  state changed
14:38:36 2016-10-07 14:33:34Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.2]: SIGNAL_IN_PROGRESS  Signal: deployment dc02cee8-061a-43f8-abb6-81932e28e9ac succeeded
14:38:36 2016-10-07 14:33:35Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.0]: SIGNAL_IN_PROGRESS  Signal: deployment 937105fa-5ae5-4c0c-a487-adce7ffa6f28 succeeded
14:38:36 2016-10-07 14:33:36Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.2]: UPDATE_COMPLETE  state changed
14:38:36 2016-10-07 14:33:36Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.0]: UPDATE_COMPLETE  state changed
14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment.1]: SIGNAL_IN_PROGRESS  Signal: deployment d3215c31-333f-4923-ba7f-aacf30d1524e failed (124)
14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment.1]: CREATE_FAILED  Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment]: UPDATE_FAILED  Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment]: CREATE_FAILED  resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp]: UPDATE_FAILED  resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
14:38:36 2016-10-07 14:38:34Z [UpdateWorkflow]: UPDATE_FAILED  resources.UpdateWorkflow: resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
14:38:36 2016-10-07 14:38:34Z [ControllerAllNodesDeployment]: UPDATE_FAILED  UPDATE aborted
14:38:36 2016-10-07 14:38:34Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.1]: UPDATE_FAILED  UPDATE aborted
14:38:36 2016-10-07 14:38:34Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba]: UPDATE_FAILED  Operation cancelled
14:38:36 2016-10-07 14:38:34Z [overcloud]: UPDATE_FAILED  resources.UpdateWorkflow: resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
14:38:36 
14:38:36  Stack overcloud UPDATE_FAILED 
14:38:36 
14:38:36 ### UPGRADE CONTROLLER AND BLOCKSTORAGE FINISHED ###



heat-engine.log 
-----------------
016-10-07 10:38:33.646 9685 INFO heat.engine.resource [req-3142213f-6c83-4f27-a171-d9d05af3fe13 - - - - -] CREATE: SoftwareDeploymentGroup "CephMonUpgradeDeployment" [e0234aba-398b-424e-970
6-d34f42181549] Stack "overcloud-UpdateWorkflow-dxrn3a2du6lp" [3c8d107d-e90d-4009-a5f0-0b11300bd32c]
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource Traceback (most recent call last):
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 753, in _action_recorder
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     yield
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 855, in _do_action
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     yield self.action_handler_task(action, args=handler_args)
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     step = next(subtask)
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 806, in action_handler_task
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     done = check(handler_data)
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 375, in check_create_complete
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     if not checker.step():
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 219, in step
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     poll_period = next(self._runner)
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 384, in _run_to_completion
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     self).check_update_complete(updater):
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 498, in check_update_complete
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     cookie=cookie)
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource   File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 404, in _check_status_complete
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource     action=action)
2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource ResourceFailure: resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment
 exited with non-zero status code: 124

Comment 1 Omri Hochman 2016-10-10 20:44:35 UTC
It might be that the Ceph Warnings are the reason for this failure? in that case we would need to add ignore ( because deployments with 1 ceph will always have those Warnings )  

[stack@undercloud-0 ~]$ heat deployment-show 12d566de-88e4-47f8-9675-ffad9d01f76a
WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead
{
  "status": "FAILED", 
  "server_id": "9bd3ba9f-2eaa-4b69-99f0-478bda88d090", 
  "config_id": "a1789b3c-1dbe-4b9c-bff3-80f7dfa1c404", 
  "output_values": {
    "deploy_stdout": "INFO: starting a1789b3c-1dbe-4b9c-bff3-80f7dfa1c404\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\
ARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting 
r Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster s
tus to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n", 
    "deploy_stderr": "", 
    "deploy_status_code": 124
  }, 
  "creation_time": "2016-10-10T20:04:08Z", 
  "updated_time": "2016-10-10T20:10:36Z", 
  "input_values": {
    "update_identifier": "", 
    "deploy_identifier": "1476129604"
  }, 
  "action": "CREATE", 
  "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 124", 
  "id": "12d566de-88e4-47f8-9675-ffad9d01f76a"



On Controller:
--------------- 
[root@controller-0 ~]# ceph health
HEALTH_WARN 192 pgs degraded; 192 pgs stuck degraded; 192 pgs stuck unclean; 192 pgs stuck undersized; 192 pgs undersized
[root@controller-0 ~]# ceph health status
status not valid:  status not in detail
Invalid command:  unused arguments: ['status']
health {detail} :  show cluster health
Error EINVAL: invalid command
[root@controller-0 ~]# ceph status
    cluster 1a387610-8ce4-11e6-89aa-525400cc88d3
     health HEALTH_WARN
            192 pgs degraded
            192 pgs stuck degraded
            192 pgs stuck unclean
            192 pgs stuck undersized
            192 pgs undersized
     monmap e1: 3 mons at {controller-0=172.17.3.13:6789/0,controller-1=172.17.3.11:6789/0,controller-2=172.17.3.15:6789/0}
            election epoch 6, quorum 0,1,2 controller-1,controller-0,controller-2
     osdmap e9: 1 osds: 1 up, 1 in
      pgmap v18: 192 pgs, 5 pools, 0 bytes data, 0 objects
            34980 kB used, 39881 MB / 39915 MB avail
                 192 active+undersized+degraded

Comment 2 Omri Hochman 2016-10-11 18:35:07 UTC
trying to workaround by: 


cat > /home/stack/ignore-ceph.yaml <<EOF
parameter_defaults:
  IgnoreCephUpgradeWarnings: true
EOF


And add the following to upgrade steps:   -e /home/stack/ignore-ceph.yaml

Comment 3 Juan Antonio Osorio 2016-10-12 12:28:19 UTC
I don't think the puppet error of not finding the ::tripleo::trusted_cas class is related to Ceph. That class was not present in mitaka and was introduced in OSP10. Seems that the manifests are old and need to be updated.

Comment 4 Omri Hochman 2016-10-12 12:50:13 UTC
(In reply to Juan Antonio Osorio from comment #3)
> I don't think the puppet error of not finding the ::tripleo::trusted_cas
> class is related to Ceph. That class was not present in mitaka and was
> introduced in OSP10. Seems that the manifests are old and need to be updated.

please ignore the ::tripleo::trusted_cas  - the issue is - when there are Ceph Warnings (which we *always* have with 1 ceph deployments)
 
[root@controller-0 ~]# ceph status
    cluster 1a387610-8ce4-11e6-89aa-525400cc88d3
     health HEALTH_WARN
            192 pgs degraded

Upgrade will fail on the step:  Upgrade Controller and Block-storage .  

Notes: 
(1) this issue didn't happen during osp8 -> osp9 upgrades ,since we didn't upgraded Ceph .
(2) theoretically - with 3 ceph cluster we should not have those warnings and upgrade should pass. 
(3) I found the workaround from comment #2 Valid (adding: IgnoreCephUpgradeWarnings: true)


We would need PM decision about if we want to document it? or to find another solution.  

As I understand the 'IgnoreCephUpgradeWarnings' variable is for dev only, but we might use this exception for 1 ceph env upgrade.

Comment 5 Jaromir Coufal 2016-10-12 12:54:53 UTC
Ceph related topic, the decision needs to come out from their DFG. Moving there, raising urgency, targeting 10, raising question around blocker.

Comment 7 Federico Lucifredi 2016-10-12 21:11:06 UTC
Ceph is Operating as designed. Ceph requires 3 OSDs, and it is warning that an unhealthy cluster configuration is present — which it is, with one node.

There is a workaround in #2, I do not think we need anything further.

Comment 8 Omri Hochman 2016-10-25 22:34:09 UTC
(In reply to Federico Lucifredi from comment #7)
> Ceph is Operating as designed. Ceph requires 3 OSDs, and it is warning that
> an unhealthy cluster configuration is present — which it is, with one node.
> 
> There is a workaround in #2, I do not think we need anything further.

re-open the bug to make sure it's going to be documented. adding 	requires_doc_text ?

Comment 9 Lucy Bopf 2016-11-07 06:16:18 UTC
Moving to 'NEW' to be triaged as the schedule allows.