Description of problem: A customer has reported an issue with the autoscaling deployment described in our doc[1] [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/auto_scaling_for_instances/index While debugging the issue I have reproduced the very strange results Gnocchi returned for rate:mean aggregates. Since Aodh rely on these results returned, it would trigger wrong alarm, which results in unexpected scale-in/scale-out of the cluster. First we had only one instance in the deployment, and this moment gnocchi returned the proper results. ~~~ (overcloud) [stack@undercloud-0 example1]$ openstack server list +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | ex-xbgc-74ypa4tgpa2w-pwc3n5yj6l5z-server-odyfkdeuo6r3 | ACTIVE | private=192.168.10.147, 10.0.0.195 | | | +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi metric list| grep cpu | 0b815173-e7a4-438a-8e9d-cbe9d449edc9 | ceilometer-low-rate | cpu | ns | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | | 50c382ec-0655-4dff-8f4b-f86b846cc812 | ceilometer-low | vcpus | vcpu | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 0b815173-e7a4-438a-8e9d-cbe9d449edc9 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T12:45:00+00:00 | 300.0 | 12520000000.0 | | 2021-01-20T12:50:00+00:00 | 300.0 | 13740000000.0 | | 2021-01-20T12:55:00+00:00 | 300.0 | 15060000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | 16270000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 example1]$ openstack metric measures aggregation --resource-type instance --granularity 300 --aggregation rate:mean --metric cpu --query server_group=ef852196-edf3-43f7-9f01-cb0689cb8a04 --needed-overlap 0 +---------------------------+-------------+--------------+ | timestamp | granularity | value | +---------------------------+-------------+--------------+ | 2021-01-20T12:50:00+00:00 | 300.0 | 1220000000.0 | | 2021-01-20T12:55:00+00:00 | 300.0 | 1320000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | 1210000000.0 | +---------------------------+-------------+--------------+ ~~~ However when the cluster was scaled out and a new server was added to the cluster, the aggregation results from Gnocchi had unreasonable negative value. ~~~ (overcloud) [stack@undercloud-0 example1]$ openstack server list +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ | d83697ba-6c16-4c2a-9000-142d3d24bba8 | ex-xbgc-azosslbwyedb-jb2r5ejlot4m-server-3iicpwm7yh7z | ACTIVE | private=192.168.10.88, 10.0.0.154 | | | | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | ex-xbgc-74ypa4tgpa2w-pwc3n5yj6l5z-server-odyfkdeuo6r3 | ACTIVE | private=192.168.10.147, 10.0.0.195 | | | +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi metric list | grep cpu | 0b815173-e7a4-438a-8e9d-cbe9d449edc9 | ceilometer-low-rate | cpu | ns | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | | 50c382ec-0655-4dff-8f4b-f86b846cc812 | ceilometer-low | vcpus | vcpu | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | | 61628b14-d9b2-4c20-b990-1352ef060b10 | ceilometer-low-rate | cpu | ns | d83697ba-6c16-4c2a-9000-142d3d24bba8 | | f6a130c6-a217-4813-8881-789648e92330 | ceilometer-low | vcpus | vcpu | d83697ba-6c16-4c2a-9000-142d3d24bba8 | (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 0b815173-e7a4-438a-8e9d-cbe9d449edc9 +---------------------------+-------------+----------------+ | timestamp | granularity | value | +---------------------------+-------------+----------------+ | 2021-01-20T12:45:00+00:00 | 300.0 | 12520000000.0 | | 2021-01-20T12:50:00+00:00 | 300.0 | 13740000000.0 | | 2021-01-20T12:55:00+00:00 | 300.0 | 15060000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | 16270000000.0 | | 2021-01-20T13:05:00+00:00 | 300.0 | 17480000000.0 | | 2021-01-20T13:10:00+00:00 | 300.0 | 42370000000.0 | | 2021-01-20T13:15:00+00:00 | 300.0 | 340970000000.0 | (*) Here I triggered load on the instance and that is why we see a huge bump here. +---------------------------+-------------+----------------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 61628b14-d9b2-4c20-b990-1352ef060b10 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:15:00+00:00 | 300.0 | 12460000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 example1]$ openstack metric measures aggregation --resource-type instance --granularity 300 --aggregation rate:mean --metric cpu --query server_group=ef852196-edf3-43f7-9f01-cb0689cb8a04 --needed-overlap 0 +---------------------------+-------------+----------------+ | timestamp | granularity | value | +---------------------------+-------------+----------------+ | 2021-01-20T12:55:00+00:00 | 300.0 | 100000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | -110000000.0 | | 2021-01-20T13:05:00+00:00 | 300.0 | 0.0 | | 2021-01-20T13:10:00+00:00 | 300.0 | 23680000000.0 | | 2021-01-20T13:15:00+00:00 | 300.0 | 273710000000.0 | +---------------------------+-------------+----------------+ ~~~ Then 3rd instance was added. Gnocchi still returned broken results. ~~~ (overcloud) [stack@undercloud-0 example1]$ openstack server list +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ | c90af056-acd9-4350-9852-e0b4b39fe247 | ex-xbgc-zwegoxdaexln-fsngb3ju77rz-server-zll7q4ebvb7d | ACTIVE | private=192.168.10.173, 10.0.0.180 | | | | d83697ba-6c16-4c2a-9000-142d3d24bba8 | ex-xbgc-azosslbwyedb-jb2r5ejlot4m-server-3iicpwm7yh7z | ACTIVE | private=192.168.10.88, 10.0.0.154 | | | | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | ex-xbgc-74ypa4tgpa2w-pwc3n5yj6l5z-server-odyfkdeuo6r3 | ACTIVE | private=192.168.10.147, 10.0.0.195 | | | +--------------------------------------+-------------------------------------------------------+--------+------------------------------------+-------+--------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi metric list | grep cpu | 0b815173-e7a4-438a-8e9d-cbe9d449edc9 | ceilometer-low-rate | cpu | ns | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | | 50c382ec-0655-4dff-8f4b-f86b846cc812 | ceilometer-low | vcpus | vcpu | 0aa7bb29-8c10-4501-83f2-e53d5867c03d | | 61628b14-d9b2-4c20-b990-1352ef060b10 | ceilometer-low-rate | cpu | ns | d83697ba-6c16-4c2a-9000-142d3d24bba8 | | 754f6645-210e-46c1-8f1f-ff87f464d042 | ceilometer-low-rate | cpu | ns | c90af056-acd9-4350-9852-e0b4b39fe247 | | b42e4a2f-5ff3-41dd-ab0b-8a5ea0af5103 | ceilometer-low | vcpus | vcpu | c90af056-acd9-4350-9852-e0b4b39fe247 | | f6a130c6-a217-4813-8881-789648e92330 | ceilometer-low | vcpus | vcpu | d83697ba-6c16-4c2a-9000-142d3d24bba8 | (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 754f6645-210e-46c1-8f1f-ff87f464d042 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:20:00+00:00 | 300.0 | 12680000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 61628b14-d9b2-4c20-b990-1352ef060b10 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:15:00+00:00 | 300.0 | 12460000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | 13730000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 0b815173-e7a4-438a-8e9d-cbe9d449edc9 +---------------------------+-------------+----------------+ | timestamp | granularity | value | +---------------------------+-------------+----------------+ | 2021-01-20T12:45:00+00:00 | 300.0 | 12520000000.0 | | 2021-01-20T12:50:00+00:00 | 300.0 | 13740000000.0 | | 2021-01-20T12:55:00+00:00 | 300.0 | 15060000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | 16270000000.0 | | 2021-01-20T13:05:00+00:00 | 300.0 | 17480000000.0 | | 2021-01-20T13:10:00+00:00 | 300.0 | 42370000000.0 | | 2021-01-20T13:15:00+00:00 | 300.0 | 340970000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | 638260000000.0 | +---------------------------+-------------+----------------+ (overcloud) [stack@undercloud-0 example1]$ openstack metric measures aggregation --resource-type instance --granularity 300 --aggregation rate:mean --metric cpu --query server_group=ef852196-edf3-43f7-9f01-cb0689cb8a04 --needed-overlap 0 +---------------------------+-------------+-----------------+ | timestamp | granularity | value | +---------------------------+-------------+-----------------+ | 2021-01-20T12:55:00+00:00 | 300.0 | 100000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | -110000000.0 | | 2021-01-20T13:05:00+00:00 | 300.0 | 0.0 | | 2021-01-20T13:10:00+00:00 | 300.0 | 23680000000.0 | | 2021-01-20T13:15:00+00:00 | 300.0 | 273710000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | -149320000000.0 | +---------------------------+-------------+-----------------+ ~~~ After waiting for a while until some new measures are added, Gnocchi still returns broken results. ~~~ (overcloud) [stack@undercloud-0 example1]$ openstack metric measures aggregation --resource-type instance --granularity 300 --aggregation rate:mean --metric cpu --query server_group=ef852196-edf3-43f7-9f01-cb0689cb8a04 --needed-overlap 0 +---------------------------+-------------+---------------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------------+ | 2021-01-20T13:30:00+00:00 | 300.0 | 480000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | -360000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 350000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | -423333333.33332825 | | 2021-01-20T13:50:00+00:00 | 300.0 | 316666666.66667175 | +---------------------------+-------------+---------------------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 61628b14-d9b2-4c20-b990-1352ef060b10 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:15:00+00:00 | 300.0 | 12460000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | 13730000000.0 | | 2021-01-20T13:25:00+00:00 | 300.0 | 15080000000.0 | | 2021-01-20T13:30:00+00:00 | 300.0 | 16290000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | 17510000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 18700000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | 19880000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 21090000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 0b815173-e7a4-438a-8e9d-cbe9d449edc9 +---------------------------+-------------+-----------------+ | timestamp | granularity | value | +---------------------------+-------------+-----------------+ | 2021-01-20T12:45:00+00:00 | 300.0 | 12520000000.0 | | 2021-01-20T12:50:00+00:00 | 300.0 | 13740000000.0 | | 2021-01-20T12:55:00+00:00 | 300.0 | 15060000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | 16270000000.0 | | 2021-01-20T13:05:00+00:00 | 300.0 | 17480000000.0 | | 2021-01-20T13:10:00+00:00 | 300.0 | 42370000000.0 | | 2021-01-20T13:15:00+00:00 | 300.0 | 340970000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | 638260000000.0 | | 2021-01-20T13:25:00+00:00 | 300.0 | 934970000000.0 | | 2021-01-20T13:30:00+00:00 | 300.0 | 1233120000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | 1530310000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 1828570000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | 2125600000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 2423520000000.0 | +---------------------------+-------------+-----------------+ (overcloud) [stack@undercloud-0 example1]$ gnocchi measures show 754f6645-210e-46c1-8f1f-ff87f464d042 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:20:00+00:00 | 300.0 | 12680000000.0 | | 2021-01-20T13:25:00+00:00 | 300.0 | 13850000000.0 | | 2021-01-20T13:30:00+00:00 | 300.0 | 15160000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | 16340000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 17530000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | 18690000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 19880000000.0 | +---------------------------+-------------+---------------+ ~~~ Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Follow the documentation and set up auto scaling resource[1] [1] https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html-single/auto_scaling_for_instances/index 2. Check aggregation results which is used in Aodh alarm Actual results: The results include some negative values after the cluster is scaled out Expected results: The results should only positive values which would show reasonable cpu usage. Additional info:
The similar issue was reported in the community. https://github.com/gnocchixyz/gnocchi/issues/1044 In the above discussion it was mentioned that using reaggregation would solve the problem, and in fact gnocchi returns "better" results (I've not yet confirmed that this is CORRECT) with --reaggregation mean. (overcloud) [stack@undercloud-0 ~]$ gnocchi measures show 61628b14-d9b2-4c20-b990-1352ef060b10 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:15:00+00:00 | 300.0 | 12460000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | 13730000000.0 | | 2021-01-20T13:25:00+00:00 | 300.0 | 15080000000.0 | | 2021-01-20T13:30:00+00:00 | 300.0 | 16290000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | 17510000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 18700000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | 19880000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 21090000000.0 | | 2021-01-20T13:55:00+00:00 | 300.0 | 22290000000.0 | | 2021-01-20T14:00:00+00:00 | 300.0 | 23500000000.0 | | 2021-01-20T14:05:00+00:00 | 300.0 | 24700000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 ~]$ gnocchi measures show 0b815173-e7a4-438a-8e9d-cbe9d449edc9 +---------------------------+-------------+-----------------+ | timestamp | granularity | value | +---------------------------+-------------+-----------------+ | 2021-01-20T12:45:00+00:00 | 300.0 | 12520000000.0 | | 2021-01-20T12:50:00+00:00 | 300.0 | 13740000000.0 | | 2021-01-20T12:55:00+00:00 | 300.0 | 15060000000.0 | | 2021-01-20T13:00:00+00:00 | 300.0 | 16270000000.0 | | 2021-01-20T13:05:00+00:00 | 300.0 | 17480000000.0 | | 2021-01-20T13:10:00+00:00 | 300.0 | 42370000000.0 | | 2021-01-20T13:15:00+00:00 | 300.0 | 340970000000.0 | | 2021-01-20T13:20:00+00:00 | 300.0 | 638260000000.0 | | 2021-01-20T13:25:00+00:00 | 300.0 | 934970000000.0 | | 2021-01-20T13:30:00+00:00 | 300.0 | 1233120000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | 1530310000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 1828570000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | 2125600000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 2423520000000.0 | | 2021-01-20T13:55:00+00:00 | 300.0 | 2721700000000.0 | | 2021-01-20T14:00:00+00:00 | 300.0 | 3019660000000.0 | | 2021-01-20T14:05:00+00:00 | 300.0 | 3316890000000.0 | +---------------------------+-------------+-----------------+ (overcloud) [stack@undercloud-0 ~]$ gnocchi measures show 754f6645-210e-46c1-8f1f-ff87f464d042 +---------------------------+-------------+---------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------+ | 2021-01-20T13:20:00+00:00 | 300.0 | 12680000000.0 | | 2021-01-20T13:25:00+00:00 | 300.0 | 13850000000.0 | | 2021-01-20T13:30:00+00:00 | 300.0 | 15160000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | 16340000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 17530000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | 18690000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 19880000000.0 | | 2021-01-20T13:55:00+00:00 | 300.0 | 21080000000.0 | | 2021-01-20T14:00:00+00:00 | 300.0 | 22250000000.0 | | 2021-01-20T14:05:00+00:00 | 300.0 | 23430000000.0 | +---------------------------+-------------+---------------+ (overcloud) [stack@undercloud-0 ~]$ openstack metric measures aggregation --resource-type instance --granularity 300 --aggregation rate:mean --metric cpu --query server_group=ef852196-edf3-43f7-9f01-cb0689cb8a04 --needed-overlap 0 --reaggregation mean +---------------------------+-------------+--------------------+ | timestamp | granularity | value | +---------------------------+-------------+--------------------+ | 2021-01-20T13:25:00+00:00 | 300.0 | 99743333333.33333 | | 2021-01-20T13:30:00+00:00 | 300.0 | 100223333333.33333 | | 2021-01-20T13:35:00+00:00 | 300.0 | 99863333333.33333 | | 2021-01-20T13:40:00+00:00 | 300.0 | 100213333333.33333 | | 2021-01-20T13:45:00+00:00 | 300.0 | 99790000000.0 | | 2021-01-20T13:50:00+00:00 | 300.0 | 100106666666.66667 | | 2021-01-20T13:55:00+00:00 | 300.0 | 100193333333.33333 | | 2021-01-20T14:00:00+00:00 | 300.0 | 100113333333.33333 | | 2021-01-20T14:05:00+00:00 | 300.0 | 99870000000.0 | +---------------------------+-------------+--------------------+ (overcloud) [stack@undercloud-0 ~]$ openstack metric measures aggregation --resource-type instance --granularity 300 --aggregation rate:mean --metric cpu --query server_group=ef852196-edf3-43f7-9f01-cb0689cb8a04 --needed-overlap 0 +---------------------------+-------------+---------------------+ | timestamp | granularity | value | +---------------------------+-------------+---------------------+ | 2021-01-20T13:30:00+00:00 | 300.0 | 480000000.0 | | 2021-01-20T13:35:00+00:00 | 300.0 | -360000000.0 | | 2021-01-20T13:40:00+00:00 | 300.0 | 350000000.0 | | 2021-01-20T13:45:00+00:00 | 300.0 | -423333333.33332825 | | 2021-01-20T13:50:00+00:00 | 300.0 | 316666666.66667175 | | 2021-01-20T13:55:00+00:00 | 300.0 | 86666666.6666565 | | 2021-01-20T14:00:00+00:00 | 300.0 | -80000000.0 | | 2021-01-20T14:05:00+00:00 | 300.0 | -243333333.33332825 | +---------------------------+-------------+---------------------+ However even if we can solve the issue by reaggregation, the problem here would be that aodh doesn't support reaggregation or any combined aggregations, IIUC.
This bug was reported for 16.1, since there aren't any plans to release 16.1 z-streams any more, should we close this bug as won't fix ? Fixes for this bug have been backported to: 16.2 - https://bugzilla.redhat.com/show_bug.cgi?id=2133030 17.0 - https://bugzilla.redhat.com/show_bug.cgi?id=2133029 17.1 - https://bugzilla.redhat.com/show_bug.cgi?id=2133027
Yes, this totally makes sense to close it.
(In reply to Matthias Runge from comment #18) > Yes, this totally makes sense to close it. This has been closed, but did anyone consider release note or documentation implication? Should there be documentation for this feature, or a release note rather than silently closing this?
(In reply to Leif Madsen from comment #19) > (In reply to Matthias Runge from comment #18) > > Yes, this totally makes sense to close it. > > This has been closed, but did anyone consider release note or documentation > implication? Should there be documentation for this feature, or a release > note rather than silently closing this? Do you suggest to document that this was not fixed, where it has been fixed in 16.2, 17.0, and 17.1, see bugs in https://bugzilla.redhat.com/show_bug.cgi?id=1918349#c17 ? These bugs have doc texts.