Bug 801791

Summary: nova-compute fails to start on qemu-img error
Product: [Fedora] Fedora Reporter: Derek Higgins <derekh>
Component: openstack-novaAssignee: Pádraig Brady <pbrady>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 17CC: akscram, alexander.sakhnov, asalkeld, bfilippov, jonathansteffan, markmc, matt_domsch, mlvov, pbrady, p, rbryant, rkukura
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: openstack-nova-2012.1-0.10.rc1.fc17 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-12 03:00:56 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Derek Higgins 2012-03-09 14:23:51 UTC
While running Nova tests, (I havn't been able to pinpont exactly how), nova managed to get into a state where one of its instances didn't contain a disk file in the instance directory, causing the compute server to crash during startup. To reproduce this I manually deleted /var/lib/nova/instances/instance-0000003c/disk from an instance while the compute service was running.

This exception appears in the logs, every minute after the file is deleted but compute keeps runing and I can still destroy and create instances

2012-03-09 12:44:11 ERROR nova.manager [-] Error during ComputeManager.update_available_resource: Unexpected error while running command.
Command: qemu-img info /var/lib/nova/instances/instance-0000003c/disk
Exit code: 1
Stdout: ''
Stderr: "qemu-img: Could not open '/var/lib/nova/instances/instance-0000003c/disk': No such file or directory\n"
(nova.manager): TRACE: Traceback (most recent call last):
(nova.manager): TRACE:   File "/usr/lib/python2.7/site-packages/nova/manager.py", line 155, in periodic_tasks
(nova.manager): TRACE:     task(self, context)
(nova.manager): TRACE:   File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2293, in update_available_resource
(nova.manager): TRACE:     self.driver.update_available_resource(context, self.host)
(nova.manager): TRACE:   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/connection.py", line 1786, in update_available_resource
(nova.manager): TRACE:     'disk_available_least': self.get_disk_available_least()}
(nova.manager): TRACE:   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/connection.py", line 2136, in get_disk_available_least
(nova.manager): TRACE:     disk_infos = utils.loads(self.get_instance_disk_info(i_name))
(nova.manager): TRACE:   File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/connection.py", line 2095, in get_instance_disk_info
(nova.manager): TRACE:     out, err = utils.execute('qemu-img', 'info', path)
(nova.manager): TRACE:   File "/usr/lib/python2.7/site-packages/nova/utils.py", line 236, in execute
(nova.manager): TRACE:     cmd=' '.join(cmd))
(nova.manager): TRACE: ProcessExecutionError: Unexpected error while running command.
(nova.manager): TRACE: Command: qemu-img info /var/lib/nova/instances/instance-0000003c/disk
(nova.manager): TRACE: Exit code: 1
(nova.manager): TRACE: Stdout: ''
(nova.manager): TRACE: Stderr: "qemu-img: Could not open '/var/lib/nova/instances/instance-0000003c/disk': No such file or directory\n"
(nova.manager): TRACE:
2012-03-09 12:44:12 INFO nova.compute.manager [-] Updating host status




Restarting the compute service, results in the error below and the service fails to start


2012-03-09 12:44:33 AUDIT nova.service [-] Starting compute node (version 2012.1-LOCALBRANCH:LOCALREVISION)
2012-03-09 12:44:33 INFO nova.rpc.common [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] Connected to AMQP server on localhost:5672

2012-03-09 12:44:33 INFO nova.virt.libvirt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] [instance: 5c3bbebd-cd0b-4945-b5a2-071a8044fce1] Called setup_basic_filtering in nwfilter
2012-03-09 12:44:33 INFO nova.virt.libvirt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] [instance: 5c3bbebd-cd0b-4945-b5a2-071a8044fce1] Ensuring static filters
2012-03-09 12:44:34 INFO nova.virt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] Using cidr '0.0.0.0/0'
2012-03-09 12:44:34 INFO nova.virt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] Using fw_rules: ['-m state --state INVALID -j DROP', '-m state --state ESTABLISHED,RELATED -j ACCEPT', '-j $provider', u'-s 10.0.0.1 -p udp --sport 67 --dport 68 -j ACCEPT', u'-s 10.0.0.0/24 -j ACCEPT', '-j ACCEPT -p tcp --dport 22 -s 0.0.0.0/0']

2012-03-09 12:44:40 INFO nova.virt.libvirt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] [instance: e98da2c2-14f8-41a5-ab64-d8f84185658e] Called setup_basic_filtering in nwfilter
2012-03-09 12:44:40 INFO nova.virt.libvirt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] [instance: e98da2c2-14f8-41a5-ab64-d8f84185658e] Ensuring static filters
2012-03-09 12:44:40 INFO nova.virt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] Using cidr '0.0.0.0/0'
2012-03-09 12:44:40 INFO nova.virt.firewall [req-29839fbd-6c46-476c-8065-5359f5fbbf3a None None] Using fw_rules: ['-m state --state INVALID -j DROP', '-m state --state ESTABLISHED,RELATED -j ACCEPT', '-j $provider', u'-s 10.0.0.1 -p udp --sport 67 --dport 68 -j ACCEPT', u'-s 10.0.0.0/24 -j ACCEPT', '-j ACCEPT -p tcp --dport 22 -s 0.0.0.0/0']

2012-03-09 12:44:45 CRITICAL nova [-] Unexpected error while running command.
Command: qemu-img info /var/lib/nova/instances/instance-0000003c/disk
Exit code: 1
Stdout: ''
Stderr: "qemu-img: Could not open '/var/lib/nova/instances/instance-0000003c/disk': No such file or directory\n"


the difference seems to be that update_available_resource is called from
"/usr/lib/python2.7/site-packages/nova/manager.py", line 155, in periodic_tasks
while compute is running where the exception is caught

but on startup its called from
"/usr/lib/python2.7/site-packages/nova/service.py", line 174, in start
where is isn't caught

should both case be handled the same?


using
openstack-nova-2012.1-0.7.e4.fc17.noarch
python-nova-2012.1-0.7.e4.fc17.noarch

Comment 1 Pádraig Brady 2012-03-09 14:41:19 UTC
Nice one Derek.

Well the missing disk file could happen if the process was stopped just at that time (bug, reboot, power, ...) so nova should be more resilient in this situation.

I'll look at making that file creation transactional,
and also make the startup robust in this condition.

cheers

Comment 2 Pádraig Brady 2012-03-16 16:17:52 UTC
The trigger for this issue may have been:
https://review.openstack.org/#change,5296
Also possible are transient image creation failures addressed with:
https://review.openstack.org/#change,5442

A direct avoidance of the startup issue is proposed with:
https://review.openstack.org/5456
That logs about missing files.

Note any other exceptions in this path will
cause the service to fail. I wasn't sure whether
to change that behavior.

Comment 3 Pádraig Brady 2012-03-16 16:32:46 UTC
https://bugs.launchpad.net/nova/+bug/956719 has the same symptom,
but the cause may be different?

Comment 4 Fedora Update System 2012-03-16 23:55:04 UTC
openstack-nova-2012.1-0.8.e4.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/openstack-nova-2012.1-0.8.e4.fc17

Comment 5 Fedora Update System 2012-03-17 16:50:07 UTC
Package openstack-nova-2012.1-0.8.e4.fc17:
* should fix your issue,
* was pushed to the Fedora 17 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing openstack-nova-2012.1-0.8.e4.fc17'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2012-3964/openstack-nova-2012.1-0.8.e4.fc17
then log in and leave karma (feedback).

Comment 6 Fedora Update System 2012-03-20 16:07:38 UTC
openstack-nova-2012.1-0.8.rc1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/openstack-nova-2012.1-0.8.rc1.fc17

Comment 7 Fedora Update System 2012-03-26 09:15:36 UTC
openstack-nova-2012.1-0.9.rc1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/openstack-nova-2012.1-0.9.rc1.fc17

Comment 8 Fedora Update System 2012-03-29 19:01:44 UTC
openstack-nova-2012.1-0.10.rc1.fc17 has been submitted as an update for Fedora 17.
https://admin.fedoraproject.org/updates/openstack-nova-2012.1-0.10.rc1.fc17

Comment 9 Fedora Update System 2012-04-12 03:00:56 UTC
openstack-nova-2012.1-0.10.rc1.fc17 has been pushed to the Fedora 17 stable repository.  If problems still persist, please make note of it in this bug report.