Bug 1306845
Summary: | rhel-osp-director: Update from 7.1 fails due to keystone errors. | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Alexander Chuzhoy <sasha> |
Component: | openstack-tripleo-heat-templates | Assignee: | Marios Andreou <mandreou> |
Status: | CLOSED NOTABUG | QA Contact: | yeylon <yeylon> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 7.0 (Kilo) | CC: | augol, dbecker, jcoufal, mandreou, mburns, morazi, ohochman, rhel-osp-director-maint, sasha, srevivo |
Target Milestone: | y3 | ||
Target Release: | 7.0 (Kilo) | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-02-15 10:13:11 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Alexander Chuzhoy
2016-02-11 21:36:47 UTC
(In reply to Alexander Chuzhoy from comment #0) > | ControllerOvercloudServicesDeployment_Step6 | > 6818b1d1-f41d-4a1a-9829-93174a95e11a | > OS::Heat::StructuredDeployments | UPDATE_FAILED | > 2016-02-11T20:15:07Z | ControllerNodesPostDeployment | > > | 0 | > 14198409-d23a-4616-9a5b-a8ae2ee1dd6c | > OS::Heat::StructuredDeployment | UPDATE_FAILED | > 2016-02-11T20:15:10Z | ControllerOvercloudServicesDeployment_Step6 | > > +------------ ^^^ at this point all the openstack-* services are managed by pacemaker and the constraints are all defined etc. Here we are doing the initialisation of the keystone admin role and heat domain: 1698 # SERVICES INIT 1699 # Needs to happen on all nodes (some parts do .conf file amendments) 1700 # but it needs to happen on $pacemaker_master first (other parts make 1701 # API calls and we don't want races, and the .conf file amendments 1702 # often depend on the API calls having been made). 1703 if (hiera('step') >= 5 and $pacemaker_master) or hiera('step') >= 6 { 1704 1705 include ::keystone::roles::admin 1706 1707 # TO-DO: Remove this class as soon as Keystone v3 will be fully functional 1708 include ::heat::keystone::domain 1709 Class['::keystone::roles::admin'] -> Exec['heat_domain_create'] 1710 1711 } # END SERVICES INIT (STEP 5 and 6) Not sure why this is failing at the moment, just pointing this out as a first data point and because this sounds familiar I think we may have hit it before. At the moment, we just know that Keystone initialisation couldn't happen (in particular error resolving keystone_tenant/role/user and so couldn't initialise the Admin role, like we ask it to in overcloud_controller_pacemaker.pp: 1705 include ::keystone::roles::admin There isn't enough info on the bug to know for sure why this is happening. It may for example be because rabbit isn't running, since we have a constraint like 1049 pacemaker::constraint::base { 'rabbitmq-then-keystone-constraint': I mention this case in particular because I found a (seemingly, at least matching the error text in the report here) identical trace in my logs like: 9657:Jan 19 07:46:38 overcloud-controller-2.localdomain os-collect-config[5291]: ies and will be changed in a later release.\u001b[0m\n\u001b[1;31mError: Could not prefetch keystone_tenant provider 'openstack': undefined method `collect' for nil:NilClass\u001b[0m\n\u001b[1;31mError: Could not prefetch keystone_role provider 'openstack': undefined method `collect' for nil:NilClass\u001b[0m\n\u001b[1;31mError: Could not prefetch keystone_user provider 'openstack': undefined method `collect' for nil:NilClass\u001b[0m\n\u001b[1;31mError: /Stage[main]/Keystone::Roles::Admin/Keystone_user_role[admin@admin]: Could not evaluate: undefined method `empty?' for nil:NilClass\u001b[0m\n\u001b[1;31mWarning: /Stage[main]/Heat::Keystone::Domain/Exec[heat_domain_create]: Skipping because of failed dependencies\u001b[0m\n", "deploy_status_code": 6} The root in that case was Keystone not running cos rabbit wasn't. In any case, I can't really do much until folks come online who can either provide access to the env or fixup permissions on the sos reports for a closer look. Hopefully once we find it it will be an easy fix... :/ #famous_last_words thanks, marios Update: tl;dr, the update failed because yum did (couldn't reach mirrors). Not sure if the keystone error described by sasha is just a symptom of the update stopping at that point. More info below. I just spent a ages staring at the logs from the various boxes. Assuming I've read things correclty, the update started with the overcloud-cephstorage-0 node, which completed, and then subsequently failed on the overcloud-objectstorage-0. Best I can tell, it fails because it can't reach any yum repos: Feb 11 20:07:36 overcloud-objectstorage-0.localdomain os-collect-config[4445]: [mNotice: /Stage[main]/Tripleo::Packages/Exec[package-upgrade]/returns: Error downloading packages:[0m Feb 11 20:07:36 overcloud-objectstorage-0.localdomain os-collect-config[4445]: [mNotice: /Stage[main]/Tripleo::Packages/Exec[package-upgrade]/returns: rabbitmq-server-3.3.5-15.el7ost.noarch: [Errno 256] No more mirrors to try.[0m Feb 11 20:07:37 overcloud-objectstorage-0.localdomain os-collect-config[4445]: [1;31mError: yum -y update returned 1 instead of one of [0][0m None of the controllers or the compute got updated so I don't think it has anything to do with the update procedure per-say (since controllers is where it gets most interesting with the removal of the given node from the cluster). There are various errors in the controller logs, including the one described by Sasha for the keystone heat domain, but given the yum_update fails then the rest of the puppet run doesn't happen either, so our setup is left in a half-updated, half started state. In particular I see all pacemaker managed services are started on controllers 1 and 2. Just got access to the environment so will poke further. In particular I'd like to see if the yum update on the object-storage was just a transient network issue and if after that completes ok the rest of the update can continue. thanks, marios (In reply to marios from comment #8) > Update: tl;dr, the update failed because yum did (couldn't reach mirrors). > Not sure if the keystone error described by sasha is just a symptom of the > update stopping at that point. More info below. I just confirmed on the environment that all services can be started fine (so there is no config issue at the moment, afaics). Some services were down on controller-0 and all services were unmanaged because the cluster was set into maintenance mode. I unset that pcs property unset maintenance-mode and in combination with a previous pcs cluster stop and after a while pcs cluster start on controller 0 all pacemaker managed services seem OK again including keystone and heat: [root@overcloud-controller-0 heat-admin]# pcs status | grep "keystone\|heat" -A 1 Clone Set: openstack-heat-engine-clone [openstack-heat-engine] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-heat-api-clone [openstack-heat-api] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-keystone-clone [openstack-keystone] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] I also got onto the objectstorage node (where yum failed) and I was able to yum install vim, so that seems to be ok (i.e. repos are reachable). @Sasha, do you think now that the cluster is back into a reasonable state you could try the update again so we can see if that yum_update issue is transient and if we can get further after/if it passes. thanks (In reply to marios from comment #9) > (In reply to marios from comment #8) > > Update: tl;dr, the update failed because yum did (couldn't reach mirrors). > > Not sure if the keystone error described by sasha is just a symptom of the > > update stopping at that point. More info below. > > I just confirmed on the environment that all services can be started fine > (so there is no config issue at the moment, afaics). Some services were down After more debugging with Sasha on the environment on Friday evening and since this is definitely not a config issue, putting it down to transient yum repo problems for now (we can re-open if you disagree Sasha?). We started another update which successfully updated the object-storage node (which is what failed initially, and because of which this bug was filed) but I left before that completed. In any case, since this doesn't appear for now to be a config issue (pacemaker cluster running fine and all managed services OK on 3 controllers) am closing for now. closed, no need for needinfo. |