Bug 1425316
Summary: | `nova rescue` of an instance with Ceph backend fails with corrupted XFS errors | ||
---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Anil Dhingra <adhingra> |
Component: | openstack-nova | Assignee: | Lee Yarwood <lyarwood> |
Status: | CLOSED ERRATA | QA Contact: | Gabriel Szasz <gszasz> |
Severity: | urgent | Docs Contact: | |
Priority: | high | ||
Version: | 9.0 (Mitaka) | CC: | adhingra, awaugama, berrange, dasmith, dmaley, eglynn, gszasz, kchamart, lyarwood, mschuppe, rjones, sbauza, sferdjao, sgordon, skinjo, sputhenp, srevivo, vromanso |
Target Milestone: | async | Keywords: | Triaged, ZStream |
Target Release: | 9.0 (Mitaka) | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | openstack-nova-13.1.2-18.el7ost | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-06-19 18:30:40 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Anil Dhingra
2017-02-21 07:52:49 UTC
The instance that need to be rescued is $ nova show 286775c6-cb4c-4182-be98-7153e7fe2467 +----------------------------------------+------------------------------------------------------------------+ | Property | Value | +----------------------------------------+------------------------------------------------------------------+ | OS-DCF:diskConfig | AUTO | | OS-EXT-AZ:availability_zone | nova | | OS-EXT-SRV-ATTR:host | myr-eqx-sg-ocpn-06.localdomain | | OS-EXT-SRV-ATTR:hostname | sg-eq-dpc-03 | | OS-EXT-SRV-ATTR:hypervisor_hostname | myr-eqx-sg-ocpn-06.localdomain | | OS-EXT-SRV-ATTR:instance_name | instance-0000006b | | OS-EXT-SRV-ATTR:kernel_id | | | OS-EXT-SRV-ATTR:launch_index | 0 | | OS-EXT-SRV-ATTR:ramdisk_id | | | OS-EXT-SRV-ATTR:reservation_id | r-mmx8iwwv | | OS-EXT-SRV-ATTR:root_device_name | /dev/vda | | OS-EXT-SRV-ATTR:user_data | - | | OS-EXT-STS:power_state | 1 | | OS-EXT-STS:task_state | - | | OS-EXT-STS:vm_state | active | | OS-SRV-USG:launched_at | 2017-01-09T20:49:52.000000 | | OS-SRV-USG:terminated_at | - | | accessIPv4 | | | accessIPv6 | | | config_drive | | | created | 2017-01-09T20:42:05Z | | description | SG-EQ-DPC-03 | | dns-internal-1241-provider-net network | 192.168.41.15 | | dns-mgmt-1240-provider-net network | 192.168.40.15 | | flavor | 4_vCPU_32GB_RAM_500GB_HDD (b22e7b69-e28d-4f10-af44-bd99ebb2b3af) | | hostId | 71a48e70b6f2f1002799e6e3825d999679bdb4ff80128ce80f7308f1 | | host_status | UP | | id | 286775c6-cb4c-4182-be98-7153e7fe2467 | | image | SG-EQ-DPC-03 (e89a8519-b968-41b1-9cbe-0a3d17cec2d1) | | key_name | sebastian-ssh | | locked | False | | metadata | {} | | name | SG-EQ-DPC-03 | | os-extended-volumes:volumes_attached | [] | | progress | 0 | | security_groups | whitelist-all | | status | ACTIVE | | tenant_id | 97ecd21c11d14ccf857ec41ef0afa22d | | updated | 2017-02-17T08:29:34Z | | user_id | 3e289fae9b694c789b370df4b97d6e8e | +----------------------------------------+------------------------------------------------------------------+ [stack@myr-eqx-sg-odn-01 ~]$ The rescue was tried with an RHEL image and Debian image. $ glance image-list +--------------------------------------+--------------------+ | ID | Name | +--------------------------------------+--------------------+ | c321e78a-b6bb-40b5-8f54-bd8f223a2a3a | Debian-8.6.3 | | 16e41469-1e8d-4458-a7fa-3ededf2e80bf | RHEL-7.3 | $ nova rescue --image 16e41469-1e8d-4458-a7fa-3ededf2e80bf 286775c6-cb4c-4182-be98-7153e7fe2467 This causes the system to boot to the instance instead of booting a new instance from the specified image and attaching the instance disk as secondary disk to it to repair. qemu 37433 1 3 11:47 ? 00:00:22 /usr/libexec/qemu-kvm -name guest=instance-0000006b,debug-threads=on -S -object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-25-instance-0000006b/master-key.aes -machine pc-i440fx-rhel7.3.0,accel=kvm,usb=off -cpu Broadwell,+vme,+ds,+acpi,+ss,+ht,+tm,+pbe,+dtes64,+monitor,+ds_cpl,+vmx,+smx,+est,+tm2,+xtpr,+pdcm,+dca,+osxsave,+f16c,+rdrand,+arat,+tsc_adjust,+xsaveopt,+pdpe1gb,+abm,+rtm,+hle -m 32768 -realtime mlock=off -smp 4,sockets=4,cores=1,threads=1 -uuid 286775c6-cb4c-4182-be98-7153e7fe2467 -smbios type=1,manufacturer=Red Hat,product=OpenStack Nova,version=13.1.1-7.el7ost,serial=6d20bf45-0918-413b-b4bf-dcc702889837,uuid=286775c6-cb4c-4182-be98-7153e7fe2467,family=Virtual Machine -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/domain-25-instance-0000006b/monitor.sock,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc,driftfix=slew -global kvm-pit.lost_tick_policy=discard -no-hpet -no-shutdown -boot strict=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -object secret,id=virtio-disk0-secret0,data=hTuL6fGIvWXUkd8B+ouiVIsOuZ2/VZrKuQFMedM/dqo=,keyid=masterKey0,iv=lng1Sg7vZfifqkSXF6llKg==,format=base64 -drive file=rbd:vms/286775c6-cb4c-4182-be98-7153e7fe2467_disk:id=cinder:auth_supported=cephx\;none:mon_host=192.168.22.12\:6789\;192.168.22.13\:6789\;192.168.22.14\:6789,file.password-secret=virtio-disk0-secret0,format=raw,if=none,id=drive-virtio-disk0,cache=writeback,discard=unmap -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=31,id=hostnet0,vhost=on,vhostfd=33 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=fa:16:3e:45:dd:5d,bus=pci.0,addr=0x3 -netdev tap,fd=34,id=hostnet1,vhost=on,vhostfd=35 -device virtio-net-pci,netdev=hostnet1,id=net1,mac=fa:16:3e:5f:dd:ae,bus=pci.0,addr=0x4 -add-fd set=4,fd=37 -chardev file,id=charserial0,path=/dev/fdset/4,append=on -device isa-serial,chardev=charserial0,id=serial0 -chardev pty,id=charserial1 -device isa-serial,chardev=charserial1,id=serial1 -device usb-tablet,id=input0,bus=usb.0,port=1 -vnc 0.0.0.0:0 -k en-us -device cirrus-vga,id=video0,bus=pci.0,addr=0x2 -incoming defer -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -msg timestamp=on The qemu-kvm process shows there is only one disk ceph mapped to the process. Does that mean the rescue just starts the vm without rescuing it? Is it expected to work with Ceph. We tested it with an instance using local disk which attaches both disks properly and helps to rescue. I'll note that errno 5 == EIO. Bad hardware? Attaching a screenshot from customer environment. The screenshot shows the rescue is trying to mount /dev/vdb1 as the root disk instead of /dev/vda1. This looks like a bug. We were able to reproduce it. Just created a an instance with 20GB disk. The disk is healthy. We then rescued it using "nova rescue <server>". We can see that /dev/vdb1 (20 GB) disk is recognized as the / and mounted instead of using /dev/vda1 as / disk and leaving /dev/vdb1 for repair. It has left /dev/vda1 as unmounted and left us to repair. I hope this would be easy to reproduce in a normal environment. We were however not able to reproduce when we use a different image using --image. For me it looks like both vda1 and vdb1 has same uuid and it picks the wrong / partition to mount it. How can we avoid it? We can reliably reproduce it with --image with without it. In most cases, it detects the /dev/vdb1 as the / disk and uses that. This may be because of uuid, how can we work around it? (In reply to Sadique Puthen from comment #15) > Attaching a screenshot from customer environment. The screenshot shows the > rescue is trying to mount /dev/vdb1 as the root disk instead of /dev/vda1. > This looks like a bug. > > We were able to reproduce it. Just created a an instance with 20GB disk. The > disk is healthy. We then rescued it using "nova rescue <server>". We can see > that /dev/vdb1 (20 GB) disk is recognized as the / and mounted instead of > using /dev/vda1 as / disk and leaving /dev/vdb1 for repair. It has left > /dev/vda1 as unmounted and left us to repair. > > I hope this would be easy to reproduce in a normal environment. We were > however not able to reproduce when we use a different image using --image. > For me it looks like both vda1 and vdb1 has same uuid and it picks the wrong > / partition to mount it. > > How can we avoid it? As I said before, use the RHEL boot ISO as the rescue image, it should avoid this behaviour. Lee, We used the boot.iso for RHEL-7.3, but it still detects /dev/vdb1 as its / disk and fails to boot due to xfs corruption. We investigated further and found below. We did unrescue and did "sudo rbd ls -l vms". We could see disk.rescue still staying in the vms pool which shows a size of 300G. So this means that when we do unrescue, the rescue image is not being deleted. Then we do rescue again using boot.iso, it uses the old disk.rescue to boot the vm which was left over from previous rescue. Since the left over disk is the original image from which the vm was started, both images have the same uuid for /. So the instance detects the vdb1 as the root disk every time. The problem of not getting rescue disk deleted during unrescue is reported at https://bugs.launchpad.net/nova/+bug/1478199 which is the root cause of all problems. Now we have two things to do. 1 - Urgent. Delete the rescue disk manually from ceph vms pool, rescue again with correct image and recover the vm. Any suggestions on how to delete it? 2 - Fix the bug in nova that causes unrescue not to delete rescue image. Another upstream report https://bugs.launchpad.net/nova/+bug/1511123 https://bugs.launchpad.net/nova/+bug/1478199 is a duplicate of https://bugs.launchpad.net/nova/+bug/1475652 with a fix merged and got backported to newton in nova 14.0.2 . We are using osp-9. Requesting a backport. (In reply to Sadique Puthen from comment #21) > We are using osp-9. Requesting a backport. ACK, nice catch, I'll do this now. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1508 |