Bug 1463850
Summary: | Director attempts to delete controller ports in OpenStack Platform on stack update - need to revert heat to a good state | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Andreas Karis <akaris> |
Component: | openstack-tripleo | Assignee: | Zane Bitter <zbitter> |
Status: | CLOSED NOTABUG | QA Contact: | Arik Chernetsky <achernet> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 10.0 (Newton) | CC: | akaris, aschultz, atelang, emacchi, jslagle, mandreou, mburns, mcornea, rhel-osp-director-maint, sathlang, therve, zbitter |
Target Milestone: | --- | Keywords: | Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-07-28 17:47:22 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andreas Karis
2017-06-21 23:08:30 UTC
Hi, Adding some error discovered in the logs: So from the log we can see an failure on controller2 in Step4 of the deployment: ./sosreport-20170620-225759/overcloud-controller-2.localdomain/sos_commands/pacemaker/crm_report/overcloud-controller-2.localdomain/journal.log:Jun 20 19:38:00 Error: Duplicate declaration: Package[python-memcache] is already declared; cannot redeclare at /etc/puppet/modules/oslo/manifests/cache.pp:159 on node overcloud-controller-2.localdomain This looks like https://bugzilla.redhat.com/show_bug.cgi?id=1392583 so it may be that puppet oslo and puppet horizon are not up to date there (from the latter bz it looks like you need puppet-oslo-9.4.0-2.el7ost, puppet-horizon-9.4.1-2.el7ost) It doesn't happen on controller0, controller1. It looks like something is also happening on Step4 of one of the compute node. Looking from the undercloud at: for i in $COMPUTE_NODES; do ssh heat-admin@$i bash -c "'journalctl -u os-collect-config | egrep -r \"deploy_status_code[^0-9]+[1-9]\"'" done might help identifying the problem. The only failed server resource I see is: | Controller | 36504793-379e-46c5-a813-0a1e86766eb4 | OS::TripleO::Server | 2017-06-21T23:18:32Z | overcloud-Controller-oviemczckrbn-1-hcyy7ivmh4mo It would be interesting to see the output of: openstack stack event list overcloud-Controller-oviemczckrbn-1-hcyy7ivmh4mo to see the history of that resource, and then do openstack stack event show overcloud-Controller-oviemczckrbn-1-hcyy7ivmh4mo Controller <event_id> on the first event showing a failure, so we can see what caused it. If the port detach thing wasn't the initial failure, and the conditions causing the initial failure have gone away, then it's likely that we can get things up and running again by marking the controller server COMPLETE. If the initial failure was the port detach then it's a mystery why we're trying to replace the server (although the property values in the events could give us a clue), and there's every reason to think it will happen again. Hi, Sorry, I'm on another call right now. I forwarded all your requests to the customer. I'll forward the output as soon as I have it and we could swing for setting the resource to COMPLETE and see if it resolves, I guess. Can you tell me how? - Andreas The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |