999164 – Some sequence of characters in console-log can cause nova-compute death

Bug 999164 - Some sequence of characters in console-log can cause nova-compute death

Summary: Some sequence of characters in console-log can cause nova-compute death

Keywords:
Status:	CLOSED DUPLICATE of bug 999272
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	openstack-nova
Sub Component:
Version:	3.0
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	async
Target Release:	3.0
Assignee:	Xavier Queralt
QA Contact:	Ami Jeain
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	CVE-2013-4261
TreeView+	depends on / blocked

Reported:	2013-08-20 20:33 UTC by Jaroslav Henner
Modified:	2019-09-09 15:47 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2013-08-21 17:07:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
api.log (10.78 KB, text/plain) 2013-08-20 20:33 UTC, Jaroslav Henner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	971188	0	unspecified	CLOSED	Console log lacks dashes.	2021-02-22 00:41:40 UTC

Internal Links: 971188

Description Jaroslav Henner 2013-08-20 20:33:06 UTC

Created attachment 788640 [details]
api.log

Description of problem:
for some sequence of characters in the console-log, nova console-log displays:
ERROR: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-49673c01-2765-4942-b1ad-6cc79b520b4d)

When console-log is ran often enough, it seems to be causeing death of nova-compute. 

Version-Release number of selected component (if applicable):
openstack-nova-compute-2013.1.3-1.el6ost.noarch
openstack-nova-api-2013.1.3-1.el6ost.noarch

How reproducible:
always

Steps to Reproduce:
1. on guest:
cat > test.py <<EOF
for a in range(9525, 0, -1):
    print u'%d %s' % (a, unichr(a))
EOF
python test.py > /dev/ttyS0
2. anywhere:
nova console-log foo --lengt 9522 | head -n 3  
9520 ?
9519 ?
9518 ?

nova console-log foo --lengt 9523 | head -n 3 
ERROR: The server has either erred or is incapable of performing the requested operation. (HTTP 500) (Request-ID: req-5a7f4f6c-7a10-45c8-9579-cc353de1b5b3)


Actual results:
Error in log, error on CLI

Expected results:
no error

Additional info:

Comment 2 Jaroslav Henner 2013-08-20 20:47:49 UTC

After trying several times:
nova console-log foo --lengt 9523 | head -n 3 &
nova-compute died:
[root@controller ~]# nova-manage service list 2> /dev/null  | grep XXX
nova-compute     master-01.rhos... nova             enabled    XXX   2013-08-20 20:41:40

marking urgent because this can be used to DOS attack. This is also why I am setting Security keyword.

Comment 3 Jaroslav Henner 2013-08-20 20:59:15 UTC

another approach how to stochastically trigger this would be:
cat /dev/urandom > /dev/ttyS0
performed on the guest

It looks like the --length parameter must be > 255 in order to trigger the bug.

Comment 6 Xavier Queralt 2013-08-21 14:58:32 UTC

I've been able to narrow down the cause of this bug.

Apparently it has nothing to do with the unicode characters but with the length of the message being sent (an instance of bug #962557).

If you check the traceback in the log you can see that it crashes while encoding the length of the string (65540 bytes) as a 16 bit uint. I imagine that by using unicode characters you managed to increase the size of log up to that limit.

In Grizzly, the fix for supporting >65K messages in qpid (https://review.openstack.org/31689) was half ported to prevent changing the message format in an stable release but still being able to receive messages in the new format if they come from havana.

We have two options here, either we decrease MAX_CONSOLE_BYTES in libvirt driver to prevent this issue or, we fully backport the fix from master (https://review.openstack.org/31689) to grizzly.

On the other hand, the second issue you detected here is that when there is an error in the messaging layer while cleaning the connection it is never moved back to the connection_pool, causing the component to become unresponsive after rpc_conn_pool_size (30 by default) failures because there is no more connections in the pool.

Note You need to log in before you can comment on or make changes to this bug.