Hide Forgot
Description of problem: I tried to replace controller-2 with controller-3 using https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes overcloud.AllNodesDeploySteps.ControllerDeployment_Step3.0: resource_type: OS::Heat::StructuredDeployment physical_resource_id: 63711c38-3e69-435e-8d2c-e4fa6f81b467 status: UPDATE_FAILED full failures list output http://pastebin.test.redhat.com/518376 (undercloud) [stack@undercloud-0 ~]$ ssh heat-admin.24.11 "sudo cat /etc/corosync/corosync.conf" totem { version: 2 cluster_name: tripleo_cluster transport: udpu token: 10000 } nodelist { node { ring0_addr: controller-0 nodeid: 1 } node { ring0_addr: controller-1 nodeid: 2 } node { ring0_addr: controller-2 nodeid: 3 } } quorum { provider: corosync_votequorum } logging { to_logfile: yes logfile: /var/log/cluster/corosync.log to_syslog: yes Version-Release number of selected component (if applicable): OSP12 How reproducible: Alway Steps to Reproduce: 1.Deploy OSP12 OC with 3ctrl+1comp+ 1 empty ironic node 2.https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/11/html-single/director_installation_and_usage/#sect-Replacing_Controller_Nodes 3. steps 5-8 http://etherpad.corp.redhat.com/JtZ84Hp2nQ Actual results: Update Failed, Expected results: Update failed on ControllerNodesPostDeployment as wrote in manual in step 9.4.3 Additional info:
openstack-keystone-12.0.1-0.20170907172639.6a67918.el7ost.noarch python-openstackclient-lang-3.12.0-0.20170821150739.f67ebce.el7ost.noarch openstack-neutron-common-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch openstack-tripleo-common-containers-7.6.1-0.20170912115321.el7ost.noarch openstack-ironic-inspector-6.0.1-0.20170824132804.0e72dcb.el7ost.noarch python-openstackclient-3.12.0-0.20170821150739.f67ebce.el7ost.noarch openstack-tripleo-common-7.6.1-0.20170912115321.el7ost.noarch openstack-mistral-common-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-nova-api-16.0.1-0.20170908213719.el7ost.noarch openstack-nova-conductor-16.0.1-0.20170908213719.el7ost.noarch openstack-glance-15.0.0-0.20170830130905.9820166.el7ost.noarch openstack-nova-compute-16.0.1-0.20170908213719.el7ost.noarch puppet-openstacklib-11.3.1-0.20170825142820.18ee919.el7ost.noarch openstack-heat-api-9.0.1-0.20170911115334.0c64134.el7ost.noarch openstack-swift-object-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch openstack-swift-proxy-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch openstack-ironic-api-9.1.1-0.20170908114346.feb64c2.el7ost.noarch openstack-mistral-engine-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-nova-common-16.0.1-0.20170908213719.el7ost.noarch puppet-openstack_extras-11.3.1-0.20170906070209.b99c3a4.el7ost.noarch openstack-tripleo-puppet-elements-7.0.0-0.20170910154847.2094778.el7ost.noarch openstack-tripleo-heat-templates-7.0.0-0.20170913050524.0rc2.el7ost.noarch openstack-swift-container-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch openstack-neutron-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch openstack-neutron-openvswitch-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch openstack-heat-engine-9.0.1-0.20170911115334.0c64134.el7ost.noarch openstack-ironic-conductor-9.1.1-0.20170908114346.feb64c2.el7ost.noarch openstack-tempest-17.0.0-0.20170901201711.ad75393.el7ost.noarch openstack-mistral-executor-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-tripleo-validations-7.3.1-0.20170907082220.efe8a72.el7ost.noarch openstack-selinux-0.8.9-0.1.el7ost.noarch openstack-nova-placement-api-16.0.1-0.20170908213719.el7ost.noarch openstack-swift-account-2.15.2-0.20170824165102.c54c6b3.el7ost.noarch openstack-heat-common-9.0.1-0.20170911115334.0c64134.el7ost.noarch python-openstacksdk-0.9.17-0.20170821143340.7946243.el7ost.noarch openstack-tripleo-ui-7.4.1-0.20170911164240.16684db.el7ost.noarch openstack-ironic-common-9.1.1-0.20170908114346.feb64c2.el7ost.noarch openstack-puppet-modules-11.0.0-0.20170828113154.el7ost.noarch openstack-mistral-api-5.1.1-0.20170909041831.a8e648c.el7ost.noarch openstack-nova-scheduler-16.0.1-0.20170908213719.el7ost.noarch openstack-neutron-ml2-11.0.1-0.20170913033853.6b26bc5.el7ost.noarch openstack-heat-api-cfn-9.0.1-0.20170911115334.0c64134.el7ost.noarch openstack-tripleo-image-elements-7.0.0-0.20170910153513.526772d.el7ost.noarch openstack-zaqar-5.0.1-0.20170905222047.el7ost.noarch
(undercloud) [stack@undercloud-0 ~]$ cat remove-controller.yaml parameters: ControllerRemovalPolicies: [{'resource_list': ['2']}] CorosyncSettleTries: 5
(undercloud) [stack@undercloud-0 ~]$ nova list +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | ID | Name | Status | Task State | Power State | Networks | +--------------------------------------+--------------+--------+------------+-------------+------------------------+ | a954f4a6-57bf-4feb-b301-1675a24af625 | compute-0 | ACTIVE | - | Running | ctlplane=192.168.24.7 | | 4c890019-5b13-4e10-b716-baa4a454bba4 | controller-0 | ACTIVE | - | Running | ctlplane=192.168.24.11 | | 6844a375-44dd-4ad9-918a-3892b6f2d39e | controller-1 | ACTIVE | - | Running | ctlplane=192.168.24.12 | | 8a7ae32c-b808-4357-86a8-db8f873aff4f | controller-3 | ACTIVE | - | Running | ctlplane=192.168.24.8 | ping 192.168.24.8 PING 192.168.24.8 (192.168.24.8) 56(84) bytes of data. 64 bytes from 192.168.24.8: icmp_seq=1 ttl=64 time=0.430 ms
Adding Test-Blocker/Blocker - as it block replace controller in OSP12 .
I looked at http://pastebin.test.redhat.com/518376 and I'm not sure that's the root cause of the update failure. It has a reason of "UPDATE aborted", which usually means something else failed and caused Heat to kill that resource too. It's frequently seen on timeouts in my experience, although I believe there are other things that can cause it too.
So to follow up on my previous comment, I think we need to know if there were any other failed resources or if the update as a whole timed out to cause the aborted resource status that I'm seeing. In the logs provided there isn't actually a failure that I can see so it's impossible to say what went wrong.
Update stuck on memcached container on new node [heat-admin@controller-2 ~]$ sudo docker inspect memcached |grep command "config_data": "{\"start_order\": 1, \"image\": \"192.168.24.1:8787/rhosp12/openstack-memcached-docker:2017-09-27.3\", \"environment\": [\"TRIPLEO_CONFIG_HASH=010a1638903ff5c9a080e00eeb23cc0f\"], \"command\": [\"/bin/bash\", \"-c\", \"source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS\"], \"volumes\": [\"/etc/hosts:/etc/hosts:ro\", \"/etc/localtime:/etc/localtime:ro\", \"/etc/puppet:/etc/puppet:ro\", \"/etc/pki/ca-trust/extracted:/etc/pki/ca-trust/extracted:ro\", \"/etc/pki/tls/certs/ca-bundle.crt:/etc/pki/tls/certs/ca-bundle.crt:ro\", \"/etc/pki/tls/certs/ca-bundle.trust.crt:/etc/pki/tls/certs/ca-bundle.trust.crt:ro\", \"/etc/pki/tls/cert.pem:/etc/pki/tls/cert.pem:ro\", \"/dev/log:/dev/log\", \"/etc/ssh/ssh_known_hosts:/etc/ssh/ssh_known_hosts:ro\", \"/sys/fs/selinux:/sys/fs/selinux\", \"/var/lib/config-data/memcached/etc/sysconfig/memcached:/etc/sysconfig/memcached:ro\"], \"net\": \"host\", \"privileged\": false, \"restart\": \"always\"}", [heat-admin@controller-2 ~]$ sudo docker exec -it memcached /bin/bash tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found ()[memcached@ /]$ ps PID TTY TIME CMD 72 ? 00:00:00 bash 84 ? 00:00:00 ps tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found ()[memcached@ /]$ ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND memcach+ 1 0.0 0.0 11636 1376 ? Ss 13:20 0:00 /bin/bash -c source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS memcach+ 7 0.0 0.0 629048 2384 ? Sl 13:20 0:00 /usr/bin/memcached -p 11211 -u memcached -m 7941 -c 8192 -l 172.17.1.23 -U 11211 -t 8 >> /var/log/memcached.log 2>&1 memcach+ 72 0.5 0.0 11640 1708 ? Ss 14:57 0:00 /bin/bash memcach+ 91 0.0 0.0 47448 1664 ? R+ 14:57 0:00 ps -aux tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found ()[memcached@ /]$ cat /var/log/memcached.log cat: /var/log/memcached.log: No such file or directory tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found [heat-admin@controller-3 ~]$ sudo docker ps CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES 55d2050d2664 192.168.24.1:8787/rhosp12/openstack-memcached-docker:2017-09-27.3 "/bin/bash -c 'source" 38 minutes ago Up 38 minutes memcached [heat-admin@controller-3 ~]$ sudo docker exec -it memcached /bin/bash tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found ()[memcached@ /]$ cat /var/log/memcached.log cat: /var/log/memcached.log: No such file or directory tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found ()[memcached@ /]$ ps -aux USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND memcach+ 1 0.0 0.0 11636 1372 ? Ss 14:23 0:00 /bin/bash -c source /etc/sysconfig/memcached; /usr/bin/memcached -p ${PORT} -u ${USER} -m ${CACHESIZE} -c ${MAXCONN} $OPTIONS memcach+ 7 0.0 0.0 628024 3380 ? Sl 14:23 0:00 /usr/bin/memcached -p 11211 -u memcached -m 7941 -c 8192 -l 172.17.1.16 -U 11211 -t 8 >> /var/log/memcached.log 2>&1 memcach+ 44 0.3 0.0 11640 1700 ? Ss 15:03 0:00 /bin/bash memcach+ 63 0.0 0.0 47448 1656 ? R+ 15:03 0:00 ps -aux tput: No value for $TERM and no -T specified tput: No value for $TERM and no -T specified bash: hostname: command not found ()[memcached@ /]$
Hey folks, this doesn't seem like a PIDONE bug to me - if it's getting stuck on memcached that is definitely outside our wheelhouse, but I also think it's too broad to say that PIDONE is automatically responsible for controller replacement issues. We will definitely always help for issues where we can, but IMHO that would generally mean that the lower level clustering (ie Pacemaker & Friends) is involved. Re-assigning to DFG:DF - please let me know if you disagree :)
I've historically done a lot of work on controller replacement, so I don't have a problem with assigning this to DF. That said, I've looked at this and my container-fu is not strong enough to figure out what's going on. As Artem posted above, it seems to be stuck on a single memcached container with no logs that either of us could find to explain what was happening. I'm not familiar enough with containers to guess at where it is stuck.
I'm told that the lack of logs from memcached is actually normal, so that's probably a red herring. That said, I've debugged this about as much as I can and haven't come up with anything. I'm sending it to the containers DFG as it seems to be a regression in the controller node replacement procedure that is happening during container deployment and I think we need someone from that team to take a look.
(In reply to Chris Jones from comment #10) > Hey folks, this doesn't seem like a PIDONE bug to me - if it's getting stuck > on memcached that is definitely outside our wheelhouse, but I also think > it's too broad to say that PIDONE is automatically responsible for > controller replacement issues. > > We will definitely always help for issues where we can, but IMHO that would > generally mean that the lower level clustering (ie Pacemaker & Friends) is > involved. > > Re-assigning to DFG:DF - please let me know if you disagree :) Chris: We do run memcached under pacemaker btw but I'll have a look and see what I can figure out.
Took a look at a dev environment this afternoon and noticed the following: On controller-0 we have the pacemaker version of memcached running: (In reply to Dan Prince from comment #13) > (In reply to Chris Jones from comment #10) > > Hey folks, this doesn't seem like a PIDONE bug to me - if it's getting stuck > > on memcached that is definitely outside our wheelhouse, but I also think > > it's too broad to say that PIDONE is automatically responsible for > > controller replacement issues. > > > > We will definitely always help for issues where we can, but IMHO that would > > generally mean that the lower level clustering (ie Pacemaker & Friends) is > > involved. > > > > Re-assigning to DFG:DF - please let me know if you disagree :) > > Chris: We do run memcached under pacemaker btw but I'll have a look and see > what I can figure out. Sorry, I was thinking of redis. Having a look anyways!
A couple of things I noticed this afternoon. On controller-0 (which is untouched by this controller replacement operation) we have the following 'pcmklatest' tagged containers: [root@controller-0 ~]# docker ps | grep pcmklatest bb0484adb579 192.168.24.1:8787/rhosp12/openstack-haproxy-docker:pcmklatest "/bin/bash /usr/local" 23 hours ago Up 23 hours haproxy-bundle-docker-0 a1fac4cc7caf 192.168.24.1:8787/rhosp12/openstack-redis-docker:pcmklatest "/bin/bash /usr/local" 23 hours ago Up 23 hours redis-bundle-docker-0 ebb60a21be59 192.168.24.1:8787/rhosp12/openstack-mariadb-docker:pcmklatest "/bin/bash /usr/local" 23 hours ago Up 23 hours galera-bundle-docker-0 3716733347d1 192.168.24.1:8787/rhosp12/openstack-rabbitmq-docker:pcmklatest "/bin/bash /usr/local" 23 hours ago Up 23 hours (healthy) rabbitmq-bundle-docker-0 ------- On both controller-2 and controller-3 containers with these docker tags are missing entirely. I'm wondering if the issue here is that a stack update was missing the docker-ha.yaml heat environment. Or that perhaps due to some other method this environment the resource registry entries for the HA services were overridden during the upgrade.
> On both controller-2 and controller-3 containers with these docker tags are > missing entirely. > > I'm wondering if the issue here is that a stack update was missing the > docker-ha.yaml heat environment. Or that perhaps due to some other method > this environment the resource registry entries for the HA services were > overridden during the upgrade. That could definitely be. There should not be a situation where those tags are missing (pacemaker would not be able to spawn those containers as the resource definition of bundles has the 'pcmklatest' tag in it). (On a side node, I'll be able to help more around this next week when I am back home)
We had a first round of investigation with Michele. According to the completer replacement procedure [1], step 9.4.2 consist in redeploying with a special puppet template "remove-controller.yaml", which should stop early at ControllerDeployment_Step1 with an "error" status, in order to perform some additional steps to reconfigure pacemaker. What we see from the heat-deployed ansible task that I'm attaching is that this initial puppet run correctly detects that a Pacemaker resource is in Error... ... "Error: /sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2>&1 returned 1 instead of one of [0]", "Error: /Stage[main]/Pacemaker::Corosync/Exec[wait-for-settle]/returns: change from notrun to 0 failed: /sbin/pcs status | grep -q 'partition with quorum' > /dev/null 2> ... ... but for some reason, either the puppet run doesn't finish in error, or the error is not reported back to the ansible task... "Notice: Applied catalog in 3710.85 seconds" ], "failed": false, "failed_when_result": false } ... thus the controller deployment follows up and finishes in error in later steps, because pacemaker was never given a chance to be configured on that new node. This is the ansible task that should error out (comments inlined): - name: Write the config_step hieradata copy: content="{{dict(step=step|int)|to_json}}" dest=/etc/puppet/hieradata/config_step.json force=true mode=0600 - name: Run puppet host configuration for step {{step}} command: >- puppet apply --modulepath=/etc/puppet/modules:/opt/stack/puppet-modules:/usr/share/openstack-puppet/modules --logdest syslog --logdest console --color=false /var/lib/tripleo-config/puppet_step_config.pp changed_when: false check_mode: no register: outputs failed_when: false no_log: true # The above never fails *but* it registers output and return code in the 'outputs' variable - debug: var=(outputs.stderr|default('')).split('\n')|union(outputs.stdout_lines|default([])) when: outputs is defined failed_when: outputs|failed # The line above is the one that should fail (but does not) We're still investigating if puppet itself is not returning error or if this ansible task masks the error.
I'm moving this bug to MODIFIED as it doesn't require any code change, it just needs bz#1499217 and bz#1501852 to be fixed.
VERIFIED openstack-tripleo-heat-templates-7.0.3-16.el7ost.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:3462