Description of problem: All instances (windows/linux) fail to boot with following error : Customer face issues with rabbitmq partition which was recovered by restart . Customer uses emc storage backed : RHEL: ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- SeaBIOS (version seabios-1.7.5-8.el7) Machine UUID 4795a173-f718-ab4e-bc34-9a6b62a8c38b iPXE (http://ipxe.org) 00:03.0 C900 PCI2.10 PnP PMM+BFF97C60+BFEF7C60 C900 Booting from Hard Disk... . <================ This "." may imply to read content from disk. error: not a correct XFS inode. error: file '/grub2/i386-pc/normal.mod' nof found Entering rescue mode... grub rescue> ~~~ Windows: ~~~ Windows failed to start. A recent hardware or software change might be the cause. To fix the problem: 1. Insert your Windows installation disc and restart your computer 2. Choose your language settings, and then click "Next." 3. Click "Repair your computer." If you do not have this disc, contact your system administrator or computer manufacturer for assistance. File: \Boot\BCD Status: 0xc000014c <========== this may imply disk or file for booting are damaged. Info: The Boot Configuration Data for your PC is missing or contains errors. ~~~~ ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Version-Release number of selected component (if applicable): RHOS6 openstack-cinder-2014.2.3-3.1.el7ost.noarch python-cinder-2014.2.3-3.1.el7ost.noarch openstack-nova-common-2014.2.3-9.el7ost.noarch openstack-nova-compute-2014.2.3-9.el7ost.noarch python-nova-2014.2.3-9.el7ost.noarch How reproducible: Always On customer end . Steps to Reproduce: 1. 2. 3. Actual results: Instance boot fails. Expected results: Instance boots up the OS. Additional info:
Created attachment 1120711 [details] console
Created attachment 1120712 [details] console
The only thing I can add is that the order of events is unclear (to me). If a guest filesystem was mounted on the compute host, that could certainly cause the corruption we're seeing. But if the corruption happened earlier, and the host mounting was an attempt to investigate and/or repair a previously corrupt filesystem, then this may not be the initial cause.
Most of their instances (both Linux and Windows) fail to boot. We need to formulate an action plan to recover them together with Engineering. - Assess the current level of corruption and understand if this can be recovered. - If this can be recovered, create an action plan together with Engineering and execute. - I would prefer to have a Bomgar session with CEE, Engineering and Field team (Felix Tsang) to recover one Linux and one Windows vm. Then Engineering can disengage and GSS can do it for rest of the instances. We can then concentrate on RCA as sev2.
Right now just heard from SA that they are trying rebuild these instances from images and RCA for this is first priority now so that we can prevent this from happening again.
Rather than a dd of the first 10mb of the corrupted volume, please use the xfs_metadump tool to capture all metadata. If filenames and attributes are not considered sensitive information, please use the "-o" option. Then compress it and attach it to this bug. Thanks, -Eric
Created attachment 1123080 [details] For the attachment to comment 26
This is not a Cinder and most probably not even an Openstack bug. Once you discover who is running an os-prober reopen and assign to the relevant component.