Bug 1467947 - Tempest fails to complete successfully because Ceilometer causes massive load, which in turn causes Pacemaker to kill RabbitMQ [NEEDINFO]
Tempest fails to complete successfully because Ceilometer causes massive load...
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-ceilometer (Show other bugs)
10.0 (Newton)
Unspecified Unspecified
low Severity medium
: ---
: 10.0 (Newton)
Assigned To: Mehdi ABAAKOUK
Sasha Smolyak
: Triaged, ZStream
Depends On:
Blocks: 1465529
  Show dependency treegraph
 
Reported: 2017-07-05 11:07 EDT by Forrest Taylor
Modified: 2017-10-12 09:35 EDT (History)
15 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2017-10-12 09:35:12 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
mabaakou: needinfo? (ftaylor)


Attachments (Terms of Use)
All logs on the controller node during the test run. (5.40 MB, application/x-xz)
2017-07-05 11:07 EDT, Forrest Taylor
no flags Details
Tempest preparation script (3.07 KB, application/x-shellscript)
2017-07-05 12:15 EDT, Forrest Taylor
no flags Details
Tempest preparation script (9.07 KB, application/x-shellscript)
2017-07-11 19:56 EDT, Forrest Taylor
no flags Details

  None (edit)
Description Forrest Taylor 2017-07-05 11:07:51 EDT
Created attachment 1294655 [details]
All logs on the controller node during the test run.

Description of problem:
Tempest fails to complete successfully.  When testr starts running, Ceilometer causes massive load (60+).  This eventually causes RabbitMQ to quit responding, so Pacemaker kills RabbitMQ.  This causes several tests to fail.



Version-Release number of selected component (if applicable):


How reproducible:
Always fails, but it fails in different places because RabbitMQ stops responding at different times.

Steps to Reproduce:
1. On workstation, run the following command to configure resources:
[student@workstation ~]$ lab deployment-overcloud-verif setup


2. Log in to director and authenticate as admin in OpenStack:
[student@workstation ~]$ ssh stack@director

[stack@director ~]$ source ~/overcloudrc


3. Prepare the director system for the Testing Service:

[stack@director ~]$ sudo ovs-vsctl add-port br-ctlplane vlan10 tag=10 \
-- set interface vlan10 type=internal

[stack@director ~]$ sudo ip link set dev vlan10 up

[stack@director ~]$ sudo ip addr add 172.24.1.200/24 dev vlan10

[stack@director ~]$ ip addr | grep vlan10


4. Install Tempest and prepare the environment:
[stack@director ~]$ sudo yum -y install openstack-tempest{,-all}

[stack@director ~]$ mkdir ~/tempest

[stack@director ~]$ cd ~/tempest

[stack@director tempest]$ /usr/share/openstack-tempest-13.0.0/tools/configure-tempest-directory


5. Locate the network ID for the provider-172.25.250 external network:
[stack@director tempest]$ openstack network show \
provider-172.25.250 -c id -f value


6. Run the config_tempest setup script using the external network ID:
[stack@director tempest]$ tools/config_tempest.py \
--deployer-input ~/tempest-deployer-input.conf --debug \
--create identity.uri $OS_AUTH_URL identity.admin_password $OS_PASSWORD \
--image http://materials.example.com/cirros-0.3.4-x86_64-disk.img \
--network-id PROVIDER-NETWORK-ID

(Use the real network ID found in step 5).


7. Edit ./etc/tempest.conf.  In the [service_available] section, add the following to the end of the list:
mistral = False
designate = False


8. Download the skip test file to avoid running some tests that will fail:
[stack@director tempest]$ scp student@workstation:./Downloads/tempest-smoke-skip ./tempest-smoke-skip

(student@workstation password is: student)


9. Run the tests:
[stack@director tempest]$ tools/run-tests.sh --skip-file ./tempest-smoke-skip


Actual results:
Tests eventually fail.

The tests default to running as many threads as CPUs.  This was originally 2, but we have increased resources for the environment, so it is now 6.  In both cases, this eventually fails.




Expected results:
All tests pass.


Additional info:

If we force a single thread for the tests:
[stack@director tempest]$ tools/run-tests.sh --skip-file ./tempest-smoke-skip --
concurrency 1

It does actually pass consistently, but takes longer to run.

During the run, Ceilometer creates a massive load and slows things down.  Eventually, RabbitMQ gets stuck and Pacemaker kills RabbitMQ.  This cycle repeats several times during the test run.
Comment 2 Forrest Taylor 2017-07-05 12:15 EDT
Created attachment 1294665 [details]
Tempest preparation script

This script should be run on director as the stack user.  It will prepare everything for the Tempest tests.  Run this script instead of following the setup steps listed in the bugzilla.  At the end, it will print out the final Tempest commands to run, but it will not run them.
Comment 4 Forrest Taylor 2017-07-11 19:56 EDT
Created attachment 1296607 [details]
Tempest preparation script

This script should be run on director as the stack user.  It will prepare everything for the Tempest tests.  Run this script instead of following the setup steps listed in the bugzilla.  This version includes a smaller test and is self-actualizing.  It includes the test run as well.  It should take less than 10 minutes to run in the online environment and less than 3 minutes in a physical environment.
Comment 7 Mehdi ABAAKOUK 2017-07-17 11:29:22 EDT
I have tried the lab and ceph looks not healthy: 

$ ceph health
HEALTH_ERR 68pgs are stuck inactive....

ceph pool have size of 3 replicas with min_size 1. This makes ceph slow. size should be 1 too, because you have only one node.

Also in ceph, three osds are configured, when you have only one, so ceph try to reach unexisting nodes, that also make ceph slow, by waiting a lot on osd that will never come back.

Also, even without running tempest ceph is already reporting slow request like more than 500s to write data. So, adding the tempest load is not going to work.
This slow requests have good chance to come from the missing osd nodes.

So, my guess is, the ceph node is too slow, Gnocchi can't write the backlog to it. Also Ceilometer can't post measures to Gnocchi because Ceph is too slow to write them. That make many messages waiting to be processed on rabbitmq.

You should first fix the ceph setup.
Comment 8 Forrest Taylor 2017-07-17 13:26:42 EDT
(In reply to Mehdi ABAAKOUK from comment #7)
> I have tried the lab and ceph looks not healthy: 
> 
> $ ceph health
> HEALTH_ERR 68pgs are stuck inactive....
> 
> ceph pool have size of 3 replicas with min_size 1. This makes ceph slow.
> size should be 1 too, because you have only one node.
> 
> Also in ceph, three osds are configured, when you have only one, so ceph try
> to reach unexisting nodes, that also make ceph slow, by waiting a lot on osd
> that will never come back.

Mehdi,

Did you run the script for the Tempest test?  As part of that script, it checks that Ceph is healthy and restarts it if it is not:

ceph0=ceph0
ssh="ssh -q -o ConnectTimeout=10 -o BatchMode=yes -o StrictHostKeyChecking=no"
cephservice=ceph.target
connstorage="${ssh} heat-admin@${ceph0}"

    if ! ${connstorage} 'sudo ceph health | grep HEALTH_OK'; then
      for i in ceph-disk@dev-vd{b,c,d}{1,2}
        do ${connstorage} "sudo systemctl start $i"
      done

      if ${connstorage} "sudo systemctl restart ${cephservice}"; then
        echo SUCCESS
      else
        echo FAIL
      fi

After running that, the Ceph node should have all OSDs up and running.

When I start my tests, Ceph status shows healthy, and the script will check this for you.
Comment 9 Mehdi ABAAKOUK 2017-07-17 17:21:38 EDT
I have restarted the lab and ceph works fine now.

I have reproduce the issue but I'm not sure I can't help here. 

The controller have only 6 vcpus while the minimal recommendation is 24 vcpus.

6 vcpus to run ~ 300 processes with some of them that are cpu intensive have no change to run fine under load. pacemaker just timeout to get the rabbitmq status, because the server have just chosen to do something else.
Comment 10 Robert Locke 2017-07-17 17:39:46 EDT
While real world might require 24vCPUs, you should be aware that we are running these *identical* images as VMs on local hardware with only 32G of RAM and the exercises/labs are fine. In fact the controller node is configured with only 2vCPUs in that scenario.

Other thoughts?
Comment 11 Mehdi ABAAKOUK 2017-07-17 18:02:53 EDT
I have retried with Ceilometer stopped, tempest still fail, more tests are able to pass. But pacemaker still timeout to retrieve the rabbitmq status when load is above 40 and kill rabbit during the tempest run.
Comment 12 Forrest Taylor 2017-07-17 18:13:14 EDT
(In reply to Mehdi ABAAKOUK from comment #9)
> I have restarted the lab and ceph works fine now.
> 
> I have reproduce the issue but I'm not sure I can't help here. 
> 
> The controller have only 6 vcpus while the minimal recommendation is 24
> vcpus.
> 
> 6 vcpus to run ~ 300 processes with some of them that are cpu intensive have
> no change to run fine under load. pacemaker just timeout to get the rabbitmq
> status, because the server have just chosen to do something else.

The documentation states no such requirement.

https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/10/html-single/director_installation_and_usage/#sect-Controller_Node_Requirements

This just mentions a 64-bit x86 processor with support for the Intel 64 or AMD64 CPU extensions.

As mentioned above, we have been successful at running these tests where the controller only has 2 VCPUs (on a physical box).

(In reply to Mehdi ABAAKOUK from comment #11)
> I have retried with Ceilometer stopped, tempest still fail, more tests are
> able to pass. But pacemaker still timeout to retrieve the rabbitmq status
> when load is above 40 and kill rabbit during the tempest run.

First, it works consistently for me when I stop all Ceilometer services.  Running these commands on the Undercloud (director) and all of the Overcloud nodes will allow the Tempest tests to succeed every time:

sudo systemctl stop openstack-ceilometer-api.service
sudo systemctl stop openstack-ceilometer-central.service
sudo systemctl stop openstack-ceilometer-collector.service
sudo systemctl stop openstack-ceilometer-compute.service
sudo systemctl stop openstack-ceilometer-notification.service
sudo systemctl stop openstack-ceilometer-polling.service


Second, you can extend the monitor timeout in Pacemaker for RabbitMQ by running:

sudo pcs resource update rabbitmq op monitor timeout=600s

I ran that at the beginning of my troubleshooting, but running the commands above allow the tests to succeed without changing the timeout.
Comment 13 Mehdi ABAAKOUK 2017-07-18 05:45:59 EDT
I have used "sudo systemctl stop openstack-ceilometer*" and remove openstack-ceilometer service check in your tempest script.

And I reproduce the issue every times, and rabbit is killed at least one time on each tempest run. Something makes the VM slow (at least rabbit). Ceilometer just increase the load on the VM and make the issue occurs quickier.

I have a equivalent setup on ovb cloud (with 4cpus/8G VMs) and autoscaling works, rabbitmq return the status in less than 1 seconds.

On your setup, it takes between 5 to 10 seconds to get the rabbit status even without load, that's doesn't looks good.
Comment 14 Robert Locke 2017-07-20 09:05:40 EDT
Can we escalate this and have someone figure out what is wrong with rabbit?
Comment 21 Peter Lemenkov 2017-07-27 11:59:30 EDT
(In reply to Robert Locke from comment #14)
> Can we escalate this and have someone figure out what is wrong with rabbit?

I don't think it has something specifically wrong with RabbitMQ. As far as I know RabbitMQ suffers from lack of resources as any other application in the cluster. That's why it replies 5 seconds even to "rabbitmqctl help" command.

Improve overall performance (lower down "load average", so internal clustering logic will work properly), and RabbitMQ will work flawlessly.
Comment 23 Mehdi ABAAKOUK 2017-10-12 09:35:12 EDT
I close this issue, since we known what happen now and we have nothing to fix. The used hardware is just not enough.

Note You need to log in before you can comment on or make changes to this bug.