Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1304321

Summary:

Most of the instances fail to boot. Need RCA

Product:

Red Hat OpenStack

Reporter:

Jaison Raju <jraju>

Component:

openstack-cinder

Assignee:

Jon Bernard <jobernar>

Status:

CLOSED NOTABUG

QA Contact:

nlevinki <nlevinki>

Severity:

urgent

Docs Contact:

Priority:

unspecified

Version:

6.0 (Juno)

CC:

areis, berrange, dpeacock, eharney, esandeen, fahmed, fjayalat, jobernar, jraju, knoel, kwolf, lyarwood, pbandark, sgotliv, sputhenp, swhiteho, tbarron, yeylon

Target Milestone:

---

Target Release:

8.0 (Liberty)

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2016-02-25 13:53:06 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
console	none
console	none
For the attachment to comment 26	none

Description Jaison Raju 2016-02-03 10:06:08 UTC

Description of problem:
All instances (windows/linux) fail to boot with following error :
Customer face issues with rabbitmq partition which was recovered by restart .
Customer uses emc storage backed :

RHEL:

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
SeaBIOS (version seabios-1.7.5-8.el7)
Machine UUID 4795a173-f718-ab4e-bc34-9a6b62a8c38b


iPXE (http://ipxe.org) 00:03.0 C900 PCI2.10 PnP PMM+BFF97C60+BFEF7C60 C900




Booting from Hard Disk...
.                                                         <================ This "." may imply to read content from disk.
error: not a correct XFS inode.
error: file '/grub2/i386-pc/normal.mod' nof found
Entering rescue mode...
grub rescue>
~~~

Windows:

~~~
Windows failed to start. A recent hardware or software change might be the cause. To fix the problem:

  1. Insert your Windows installation disc and restart your computer
  2. Choose your language settings, and then click "Next."
  3. Click "Repair your computer."

If you do not have this disc, contact your system administrator or computer manufacturer for assistance.

   File: \Boot\BCD
   Status: 0xc000014c                                      <========== this may imply disk or file for booting are damaged.
   Info: The Boot Configuration Data for your PC is missing or contains errors.
~~~~
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Version-Release number of selected component (if applicable):
RHOS6
openstack-cinder-2014.2.3-3.1.el7ost.noarch
python-cinder-2014.2.3-3.1.el7ost.noarch
openstack-nova-common-2014.2.3-9.el7ost.noarch
openstack-nova-compute-2014.2.3-9.el7ost.noarch
python-nova-2014.2.3-9.el7ost.noarch

How reproducible:
Always On customer end .

Steps to Reproduce:
1.
2.
3.

Actual results:
Instance boot fails.

Expected results:
Instance boots up the OS.

Additional info:

Comment 3 Jaison Raju 2016-02-03 10:18:51 UTC

Created attachment 1120711 [details]
console

Comment 4 Jaison Raju 2016-02-03 10:19:17 UTC

Created attachment 1120712 [details]
console

Comment 16 Jon Bernard 2016-02-04 15:47:41 UTC

The only thing I can add is that the order of events is unclear (to me).  If a guest filesystem was mounted on the compute host, that could certainly cause the corruption we're seeing.  But if the corruption happened earlier, and the host mounting was an attempt to investigate and/or repair a previously corrupt filesystem, then this may not be the initial cause.

Comment 18 Sadique Puthen 2016-02-05 07:39:52 UTC

Most of their instances (both Linux and Windows) fail to boot. We need to formulate an action plan to recover them together with Engineering.

- Assess the current level of corruption and understand if this can be recovered.
- If this can be recovered, create an action plan together with Engineering and execute.
- I would prefer to have a Bomgar session with CEE, Engineering and Field team (Felix Tsang) to recover one Linux and one Windows vm. Then Engineering can disengage and GSS can do it for rest of the instances.

We can then concentrate on RCA as sev2.

Comment 21 Sadique Puthen 2016-02-05 10:27:12 UTC

Right now just heard from SA that they are trying rebuild these instances from images and RCA for this is first priority now so that we can prevent this from happening again.

Comment 26 Eric Sandeen 2016-02-05 22:02:52 UTC

Rather than a dd of the first 10mb of the corrupted volume, please use the xfs_metadump tool to capture all metadata.  If filenames and attributes are not considered sensitive information, please use the "-o" option.

Then compress it and attach it to this bug.

Thanks,
-Eric

Comment 30 Faiaz Ahmed 2016-02-11 03:39:13 UTC

Created attachment 1123080 [details]
For the attachment to comment 26

Comment 37 Sergey Gotliv 2016-02-25 13:53:06 UTC

This is not a Cinder and most probably not even an Openstack bug. Once you discover who is running an os-prober reopen and assign to the relevant component.