Bug 1304321 - Most of the instances fail to boot. Need RCA
Summary: Most of the instances fail to boot. Need RCA
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-cinder
Version: 6.0 (Juno)
Hardware: All
OS: Linux
unspecified
urgent
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Jon Bernard
QA Contact: nlevinki
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-02-03 10:06 UTC by Jaison Raju
Modified: 2019-10-10 11:05 UTC (History)
18 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-02-25 13:53:06 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
console (49.75 KB, image/png)
2016-02-03 10:18 UTC, Jaison Raju
no flags Details
console (53.79 KB, image/png)
2016-02-03 10:19 UTC, Jaison Raju
no flags Details
For the attachment to comment 26 (40.76 KB, application/x-gzip)
2016-02-11 03:39 UTC, Faiaz Ahmed
no flags Details

Description Jaison Raju 2016-02-03 10:06:08 UTC
Description of problem:
All instances (windows/linux) fail to boot with following error :
Customer face issues with rabbitmq partition which was recovered by restart .
Customer uses emc storage backed :

RHEL:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SeaBIOS (version seabios-1.7.5-8.el7)
Machine UUID 4795a173-f718-ab4e-bc34-9a6b62a8c38b


iPXE (http://ipxe.org) 00:03.0 C900 PCI2.10 PnP PMM+BFF97C60+BFEF7C60 C900




Booting from Hard Disk...
.                                                         <================ This "." may imply to read content from disk.
error: not a correct XFS inode.
error: file '/grub2/i386-pc/normal.mod' nof found
Entering rescue mode...
grub rescue>
~~~

Windows:

~~~
Windows failed to start. A recent hardware or software change might be the cause. To fix the problem:

  1. Insert your Windows installation disc and restart your computer
  2. Choose your language settings, and then click "Next."
  3. Click "Repair your computer."

If you do not have this disc, contact your system administrator or computer manufacturer for assistance.

   File: \Boot\BCD
   Status: 0xc000014c                                      <========== this may imply disk or file for booting are damaged.
   Info: The Boot Configuration Data for your PC is missing or contains errors.
~~~~
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):
RHOS6
openstack-cinder-2014.2.3-3.1.el7ost.noarch
python-cinder-2014.2.3-3.1.el7ost.noarch
openstack-nova-common-2014.2.3-9.el7ost.noarch
openstack-nova-compute-2014.2.3-9.el7ost.noarch
python-nova-2014.2.3-9.el7ost.noarch

How reproducible:
Always On customer end .

Steps to Reproduce:
1.
2.
3.

Actual results:
Instance boot fails.

Expected results:
Instance boots up the OS.

Additional info:

Comment 3 Jaison Raju 2016-02-03 10:18:51 UTC
Created attachment 1120711 [details]
console

Comment 4 Jaison Raju 2016-02-03 10:19:17 UTC
Created attachment 1120712 [details]
console

Comment 16 Jon Bernard 2016-02-04 15:47:41 UTC
The only thing I can add is that the order of events is unclear (to me).  If a guest filesystem was mounted on the compute host, that could certainly cause the corruption we're seeing.  But if the corruption happened earlier, and the host mounting was an attempt to investigate and/or repair a previously corrupt filesystem, then this may not be the initial cause.

Comment 18 Sadique Puthen 2016-02-05 07:39:52 UTC
Most of their instances (both Linux and Windows) fail to boot. We need to formulate an action plan to recover them together with Engineering.

- Assess the current level of corruption and understand if this can be recovered.
- If this can be recovered, create an action plan together with Engineering and execute.
- I would prefer to have a Bomgar session with CEE, Engineering and Field team (Felix Tsang) to recover one Linux and one Windows vm. Then Engineering can disengage and GSS can do it for rest of the instances.

We can then concentrate on RCA as sev2.

Comment 21 Sadique Puthen 2016-02-05 10:27:12 UTC
Right now just heard from SA that they are trying rebuild these instances from images and RCA for this is first priority now so that we can prevent this from happening again.

Comment 26 Eric Sandeen 2016-02-05 22:02:52 UTC
Rather than a dd of the first 10mb of the corrupted volume, please use the xfs_metadump tool to capture all metadata.  If filenames and attributes are not considered sensitive information, please use the "-o" option.

Then compress it and attach it to this bug.

Thanks,
-Eric

Comment 30 Faiaz Ahmed 2016-02-11 03:39:13 UTC
Created attachment 1123080 [details]
For the attachment to comment 26

Comment 37 Sergey Gotliv 2016-02-25 13:53:06 UTC
This is not a Cinder and most probably not even an Openstack bug. Once you discover who is running an os-prober reopen and assign to the relevant component.


Note You need to log in before you can comment on or make changes to this bug.