Hide Forgot
osp-director-10: Major Upgrade OSP9 -> 10 with Ceph-node, fails on : "Could not find class ::tripleo::trusted_cas for overcloud-controller" Environment: ------------- instack-undercloud-5.0.0-0.20160930175750.9d2a655.el7ost.noarch instack-5.0.0-1.el7ost.noarch python-heat-agent-0.0.1-0.20160920204709.f123aa1.el7ost.noarch openstack-heat-templates-0.0.1-0.20160920204709.f123aa1.el7ost.noarch openstack-heat-common-7.0.0-0.20160926200847.dd707bc.el7ost.noarch python-heat-tests-7.0.0-0.20160926200847.dd707bc.el7ost.noarch openstack-tripleo-heat-templates-compat-2.0.0-34.3.el7ost.noarch puppet-heat-9.4.0-1.1.el7ost.noarch openstack-tripleo-heat-templates-5.0.0-0.20161003064637.d636e3a.1.1.el7ost.noarch python-heatclient-1.5.0-0.20161001073130.3c3f8ee.el7ost.noarch openstack-heat-api-7.0.0-0.20160926200847.dd707bc.el7ost.noarch openstack-heat-engine-7.0.0-0.20160926200847.dd707bc.el7ost.noarch openstack-heat-api-cfn-7.0.0-0.20160926200847.dd707bc.el7ost.noarch heat-cfntools-1.3.0-2.el7ost.noarch Steps : --------- (1) Deploy OSP9 (environment with Ceph node) (2) Attempt to Upgrade according to https://gitlab.cee.redhat.com/sathlang/ospd-9-to-10-upgrade/ Results: --------- Upgrade Failed during " Upgrade Controller and Block-storage" -> Stack overcloud UPDATE_FAILED heat deployment-show ---------------------- more /var/lib/heat-config/deployed/4edf31ec-53e5-407a-a6f1-0f2f4b9bf98d.notify.json { "deploy_stdout": "", "deploy_stderr": "\u001b[1;31mError: Could not find class ::tripleo::trusted_cas for overcloud-controller-0.localdomain on node overcloud-controller-0.localdomain\u001b[0m\n\u001b[1;31mErr or: Could not find class ::tripleo::trusted_cas for overcloud-controller-0.localdomain on node overcloud-controller-0.localdomain\u001b[0m\n", "deploy_status_code": 1 } parameter_defaults: controllerExtraConfig: # In releases before Mitaka, HeatWorkers doesn't modify # num_engine_workers, so handle via heat::config heat::config::heat_config: DEFAULT/num_engine_workers: value: 1 heat::api_cloudwatch::enabled: false heat::api_cfn::enabled: false HeatWorkers: 1 CeilometerWorkers: 1 CinderWorkers: 1 GlanceWorkers: 1 KeystoneWorkers: 1 NeutronWorkers: 1 NovaWorkers: 1 SwiftWorkers: 1 IgnoreCephUpgradeWarnings: true Upgrade view: -------------- :38:36 2016-10-07 14:32:49Z [CephStorageAllNodesValidationDeployment]: UPDATE_IN_PROGRESS state changed 14:38:36 2016-10-07 14:32:50Z [overcloud-CephStorageAllNodesValidationDeployment-wflwiinwzehd]: UPDATE_IN_PROGRESS Stack UPDATE started 14:38:36 2016-10-07 14:32:51Z [overcloud-CephStorageAllNodesValidationDeployment-wflwiinwzehd]: UPDATE_COMPLETE Stack UPDATE completed successfully 14:38:36 2016-10-07 14:32:51Z [CephStorageAllNodesValidationDeployment]: UPDATE_COMPLETE state changed 14:38:36 2016-10-07 14:33:34Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.2]: SIGNAL_IN_PROGRESS Signal: deployment dc02cee8-061a-43f8-abb6-81932e28e9ac succeeded 14:38:36 2016-10-07 14:33:35Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.0]: SIGNAL_IN_PROGRESS Signal: deployment 937105fa-5ae5-4c0c-a487-adce7ffa6f28 succeeded 14:38:36 2016-10-07 14:33:36Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.2]: UPDATE_COMPLETE state changed 14:38:36 2016-10-07 14:33:36Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.0]: UPDATE_COMPLETE state changed 14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment.1]: SIGNAL_IN_PROGRESS Signal: deployment d3215c31-333f-4923-ba7f-aacf30d1524e failed (124) 14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment.1]: CREATE_FAILED Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment]: UPDATE_FAILED Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp.CephMonUpgradeDeployment]: CREATE_FAILED resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 14:38:36 2016-10-07 14:38:33Z [overcloud-UpdateWorkflow-dxrn3a2du6lp]: UPDATE_FAILED resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 14:38:36 2016-10-07 14:38:34Z [UpdateWorkflow]: UPDATE_FAILED resources.UpdateWorkflow: resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 14:38:36 2016-10-07 14:38:34Z [ControllerAllNodesDeployment]: UPDATE_FAILED UPDATE aborted 14:38:36 2016-10-07 14:38:34Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba.1]: UPDATE_FAILED UPDATE aborted 14:38:36 2016-10-07 14:38:34Z [overcloud-ControllerAllNodesDeployment-kv6c7pxyhsba]: UPDATE_FAILED Operation cancelled 14:38:36 2016-10-07 14:38:34Z [overcloud]: UPDATE_FAILED resources.UpdateWorkflow: resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124 14:38:36 14:38:36 Stack overcloud UPDATE_FAILED 14:38:36 14:38:36 ### UPGRADE CONTROLLER AND BLOCKSTORAGE FINISHED ### heat-engine.log ----------------- 016-10-07 10:38:33.646 9685 INFO heat.engine.resource [req-3142213f-6c83-4f27-a171-d9d05af3fe13 - - - - -] CREATE: SoftwareDeploymentGroup "CephMonUpgradeDeployment" [e0234aba-398b-424e-970 6-d34f42181549] Stack "overcloud-UpdateWorkflow-dxrn3a2du6lp" [3c8d107d-e90d-4009-a5f0-0b11300bd32c] 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource Traceback (most recent call last): 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 753, in _action_recorder 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource yield 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 855, in _do_action 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource yield self.action_handler_task(action, args=handler_args) 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 353, in wrapper 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource step = next(subtask) 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resource.py", line 806, in action_handler_task 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource done = check(handler_data) 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 375, in check_create_complete 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource if not checker.step(): 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/scheduler.py", line 219, in step 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource poll_period = next(self._runner) 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/openstack/heat/resource_group.py", line 384, in _run_to_completion 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource self).check_update_complete(updater): 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 498, in check_update_complete 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource cookie=cookie) 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource File "/usr/lib/python2.7/site-packages/heat/engine/resources/stack_resource.py", line 404, in _check_status_complete 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource action=action) 2016-10-07 10:38:33.646 9685 ERROR heat.engine.resource ResourceFailure: resources.CephMonUpgradeDeployment: Error: resources[1]: Deployment to server failed: deploy_status_code : Deployment exited with non-zero status code: 124
It might be that the Ceph Warnings are the reason for this failure? in that case we would need to add ignore ( because deployments with 1 ceph will always have those Warnings ) [stack@undercloud-0 ~]$ heat deployment-show 12d566de-88e4-47f8-9675-ffad9d01f76a WARNING (shell) "heat deployment-show" is deprecated, please use "openstack software deployment show" instead { "status": "FAILED", "server_id": "9bd3ba9f-2eaa-4b69-99f0-478bda88d090", "config_id": "a1789b3c-1dbe-4b9c-bff3-80f7dfa1c404", "output_values": { "deploy_stdout": "INFO: starting a1789b3c-1dbe-4b9c-bff3-80f7dfa1c404\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\ ARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting r Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\nWARNING: Waiting for Ceph cluster s tus to go HEALTH_OK\nWARNING: Waiting for Ceph cluster status to go HEALTH_OK\n", "deploy_stderr": "", "deploy_status_code": 124 }, "creation_time": "2016-10-10T20:04:08Z", "updated_time": "2016-10-10T20:10:36Z", "input_values": { "update_identifier": "", "deploy_identifier": "1476129604" }, "action": "CREATE", "status_reason": "deploy_status_code : Deployment exited with non-zero status code: 124", "id": "12d566de-88e4-47f8-9675-ffad9d01f76a" On Controller: --------------- [root@controller-0 ~]# ceph health HEALTH_WARN 192 pgs degraded; 192 pgs stuck degraded; 192 pgs stuck unclean; 192 pgs stuck undersized; 192 pgs undersized [root@controller-0 ~]# ceph health status status not valid: status not in detail Invalid command: unused arguments: ['status'] health {detail} : show cluster health Error EINVAL: invalid command [root@controller-0 ~]# ceph status cluster 1a387610-8ce4-11e6-89aa-525400cc88d3 health HEALTH_WARN 192 pgs degraded 192 pgs stuck degraded 192 pgs stuck unclean 192 pgs stuck undersized 192 pgs undersized monmap e1: 3 mons at {controller-0=172.17.3.13:6789/0,controller-1=172.17.3.11:6789/0,controller-2=172.17.3.15:6789/0} election epoch 6, quorum 0,1,2 controller-1,controller-0,controller-2 osdmap e9: 1 osds: 1 up, 1 in pgmap v18: 192 pgs, 5 pools, 0 bytes data, 0 objects 34980 kB used, 39881 MB / 39915 MB avail 192 active+undersized+degraded
trying to workaround by: cat > /home/stack/ignore-ceph.yaml <<EOF parameter_defaults: IgnoreCephUpgradeWarnings: true EOF And add the following to upgrade steps: -e /home/stack/ignore-ceph.yaml
I don't think the puppet error of not finding the ::tripleo::trusted_cas class is related to Ceph. That class was not present in mitaka and was introduced in OSP10. Seems that the manifests are old and need to be updated.
(In reply to Juan Antonio Osorio from comment #3) > I don't think the puppet error of not finding the ::tripleo::trusted_cas > class is related to Ceph. That class was not present in mitaka and was > introduced in OSP10. Seems that the manifests are old and need to be updated. please ignore the ::tripleo::trusted_cas - the issue is - when there are Ceph Warnings (which we *always* have with 1 ceph deployments) [root@controller-0 ~]# ceph status cluster 1a387610-8ce4-11e6-89aa-525400cc88d3 health HEALTH_WARN 192 pgs degraded Upgrade will fail on the step: Upgrade Controller and Block-storage . Notes: (1) this issue didn't happen during osp8 -> osp9 upgrades ,since we didn't upgraded Ceph . (2) theoretically - with 3 ceph cluster we should not have those warnings and upgrade should pass. (3) I found the workaround from comment #2 Valid (adding: IgnoreCephUpgradeWarnings: true) We would need PM decision about if we want to document it? or to find another solution. As I understand the 'IgnoreCephUpgradeWarnings' variable is for dev only, but we might use this exception for 1 ceph env upgrade.
Ceph related topic, the decision needs to come out from their DFG. Moving there, raising urgency, targeting 10, raising question around blocker.
Ceph is Operating as designed. Ceph requires 3 OSDs, and it is warning that an unhealthy cluster configuration is present — which it is, with one node. There is a workaround in #2, I do not think we need anything further.
(In reply to Federico Lucifredi from comment #7) > Ceph is Operating as designed. Ceph requires 3 OSDs, and it is warning that > an unhealthy cluster configuration is present — which it is, with one node. > > There is a workaround in #2, I do not think we need anything further. re-open the bug to make sure it's going to be documented. adding requires_doc_text ?
Moving to 'NEW' to be triaged as the schedule allows.