Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1616237

Summary: null characters in oslo.message cause mistral to fail
Product: Red Hat OpenStack Reporter: danil.zhigalin
Component: python-oslo-messagingAssignee: Victor Stinner <vstinner>
Status: CLOSED NOTABUG QA Contact: Udi Shkalim <ushkalim>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: apevec, chjones, danil.zhigalin, fpercoco, jeckersb, lhh, srevivo
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-23 06:28:18 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
all files in one archive none

Description danil.zhigalin 2018-08-15 11:07:46 UTC
Created attachment 1476113 [details]
all files in one archive

Description of problem:

We are facing problems with OSP13 overcloud installation. Director has been installed successfully and is operational. We already have experience with OSP12 and we based templates on the version of templates that worked for us in the 12 version. This is a fresh installation of director - i.e. not an upgrade of an existing OSP12 environment.

Version-Release number of selected component (if applicable):

OSP13, mistral 6.0.2, heat 10.0.1, oslo-messaging 5.35.0, python-tripleoclient 9.2.1-13 - these are the components that we are seeing problems with


How reproducible:

Steps to Reproduce:

1. Deployment is started using the following command: 

openstack overcloud deploy \
    --templates \
    --libvirt-type kvm \
    --timeout 120 \
    -r /home/stack/templates/roles_data.yaml \
    -e /home/stack/templates/fra1-ost2/network-environment.yaml \
    -e /home/stack/templates/scale-environment.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml \
    -e /home/stack/templates/fra1-ost2/overcloud_images.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/host-config-and-reboot.yaml \
    -e /usr/share/openstack-tripleo-heat-templates/environments/neutron-sriov.yaml \
    -e /home/stack/templates/sriov-env.yaml \
    -e /home/stack/templates/extraconfig-environment.yaml \
    --ntp-server 172.16.95.210

2a. In some occasions deployment is running up to this point:

Started Mistral Workflow tripleo.validations.v1.check_pre_deployment_validations. Execution ID: 7784ff99-6a0e-44fe-8193-adb07eeab91d
Waiting for messages on queue 'tripleo' with no timeout.
Removing the current plan files
Uploading new plan files
Started Mistral Workflow tripleo.plan_management.v1.update_deployment_plan. Execution ID: 845ae510-efc2-46d9-916a-ca8e7e4bbf2d
('The read operation timed out',)

2b. In some occations deployment is running up to this point:

Started Mistral Workflow tripleo.validations.v1.check_pre_deployment_validations. Execution ID: 779147aa-3ff8-450b-8588-e9a146a9b016
Waiting for messages on queue 'tripleo' with no timeout.
Removing the current plan files
Uploading new plan files
Started Mistral Workflow tripleo.plan_management.v1.update_deployment_plan. Execution ID: 6cc91812-cbd1-445f-a82d-03eec9aa8e8d
Plan updated.
Processing templates in the directory /tmp/tripleoclient-mr0HtD/tripleo-heat-templates
Unable to establish connection to https://172.21.50.11:13989/v2/action_executions: ('Connection aborted.', BadStatusLine("''",))


Actual results:

Deployment stops after this point. No stack is created and no attempt of creating the stack was done.

Expected results:

Deployment should continue and at least reach stack creation stage or gieve some meaningful answer why it can't continue.

Additional info:

in case of 2a.:

There is the following trace in /var/log/mistral/engine.log

2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit [req-508ca646-1519-4cb5-9c82-5e021f680c16 f1e33f8f3cb947408e8c15f7bc7f9677 bdfdab82c3b04e49a2b06b52a67e955d - - -] Failed to process message ... skipping it.: DecodeError: Invalid control character at: line 1 column 1694931 (char 1694930)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 370, in _callback
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 249, in __init__
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     rpc_common.deserialize_msg(raw_message.payload))
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 207, in payload
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     return self._decoded_cache if self._decoded_cache else self.decode()
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 192, in decode
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     self._decoded_cache = self._decode()
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 197, in _decode
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     self.content_encoding, accept=self.accept)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 263, in loads
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     return decode(data)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     self.gen.throw(type, value, traceback)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 54, in _reraise_errors
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     reraise(wrapper, wrapper(exc), sys.exc_info()[2])
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 50, in _reraise_errors
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     yield
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 263, in loads
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     return decode(data)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/utils/json.py", line 96, in loads
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     return stdjson.loads(s)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     return _default_decoder.decode(s)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit     obj, end = self.scan_once(s, idx)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit DecodeError: Invalid control character at: line 1 column 1694931 (char 1694930)
2018-08-15 12:04:30.088 59391 ERROR oslo.messaging._drivers.impl_rabbit

in case of 2b.:

There is the following trace in /var/log/heat/heat-engine.log:

2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit Traceback (most recent call last):
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 370, in _callback
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     self.callback(RabbitMessage(message))
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/oslo_messaging/_drivers/impl_rabbit.py", line 249, in __init__
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     rpc_common.deserialize_msg(raw_message.payload))
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 207, in payload
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     return self._decoded_cache if self._decoded_cache else self.decode()
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 192, in decode
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     self._decoded_cache = self._decode()
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/message.py", line 197, in _decode
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     self.content_encoding, accept=self.accept)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 263, in loads
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     return decode(data)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/contextlib.py", line 35, in __exit__
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     self.gen.throw(type, value, traceback)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 54, in _reraise_errors
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     reraise(wrapper, wrapper(exc), sys.exc_info()[2])
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 50, in _reraise_errors
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     yield
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/serialization.py", line 263, in loads
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     return decode(data)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib/python2.7/site-packages/kombu/utils/json.py", line 96, in loads
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     return stdjson.loads(s)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/json/__init__.py", line 338, in loads
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     return _default_decoder.decode(s)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/json/decoder.py", line 366, in decode
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     obj, end = self.raw_decode(s, idx=_w(s, 0).end())
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit   File "/usr/lib64/python2.7/json/decoder.py", line 382, in raw_decode
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit     obj, end = self.scan_once(s, idx)
2018-08-15 11:28:14.077 2880 ERROR oslo.messaging._drivers.impl_rabbit DecodeError: Invalid control character at: line 1 column 1053443 (char 1053442)

Further investigation shows that there are indeed invalid characters - null characters with code 'x00' in json that is being sent over the messaging bus. They are not present in the initial templates but get introduced by the message issuer. Attached is the file with the intercepted message. This characters can be detected using "grep -oP "\x00"  message_body.txt | wc -l" or they also get showed up in VIM. In the attached file first such character is encountered on the position 1694930, which complies with the stack trace referred in the case 2a. In the attached packet capture file amqp_mistral_messages.pcap this character is also seen in frame #90, packet bytes line  0000ddc0, there is this 00 byte. 

In this situation it is expected that json parser will fail, since \0 characters are not allowed.

Why does message sender introduce this characters and how can this be avoided?

Comment 1 danil.zhigalin 2018-08-23 06:28:18 UTC
In the end it turned out that all data corruption was due to faulty memory in the director host.