Bug 2054781

Summary: Windows guest crash randomly: qemu-kvm: /builddir/build/BUILD/qemu-4.2.0/hw/rtc/mc146818rtc.c:201: periodic_timer_update: Assertion `lost_clock >= 0' failed
Product: Red Hat Enterprise Linux 8 Reporter: Luigi Tamagnone <ltamagno>
Component: qemu-kvmAssignee: Kostiantyn Kostiuk <kkostiuk>
qemu-kvm sub component: General QA Contact: Yanhui Ma <yama>
Status: NEW --- Docs Contact: Parth Shah <pashah>
Severity: high    
Priority: high CC: alifshit, chayang, coli, gveitmic, jherrman, jinzhao, juzhang, kkostiuk, mst, pbonzini, qizhu, timao, virt-maint, yama, yvugenfi, zhguo
Version: 8.2Keywords: CustomerScenariosInitiative
Target Milestone: rcFlags: kkostiuk: needinfo? (mst)
kkostiuk: needinfo? (pbonzini)
pashah: needinfo-
pashah: needinfo-
Target Release: ---   
Hardware: x86_64   
OS: Windows   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of:
: 2228406 (view as bug list) Environment:
Last Closed: Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2228406    

Description Luigi Tamagnone 2022-02-15 17:00:11 UTC
Description of problem:
- Windows instances crash time to time
- From windows OS they had seen power loss log written to all windows crashes
- Server certificate RHOSP16 and RHEL8 - https://catalog.redhat.com/hardware/servers/detail/2941651
- March to June 2021 the env had OSP upgrade from OSP 13 to 16 & Firmware BIOS update to Computes 
- Issue is on Windows 2012 & 2016, More reports are on 2016

Version-Release number of selected component (if applicable):
[redhat-release] Red Hat Enterprise Linux release 8.2 (Ootpa)
[rhosp-release] Red Hat OpenStack Platform release 16.1.3 GA (Train)
- the qemu-kvm and libvirtd is containerized, and this host is using :
  "url": "https://access.redhat.com/containers/#/registry.access.redhat.com/rhosp16/openstack-nova-libvirt/images/16.1.3-7.1614767861",
Which corresponds to this:
https://catalog.redhat.com/software/containers/rhosp-rhel8/openstack-nova-libvirt/5de6c2ddbed8bd164a0c1bbf?tag=16.1.3-7.1614767861&push_date=1615227731000&container-tabs=packages
So the qemu-kvm and libvirt versions:
-  qemu-kvm-4.2.0-29.module+el8.2.1+9791+7d72b149.6.x86_64    
-  libvirt-daemon-6.0.0-25.5.module+el8.2.1+8680+ea98947b.x86_64

How reproducible:
We didn't find a reason for reproduce, but it's happen randomly

Additional info:
gdb -e /usr/libexec/qemu-kvm -c ./core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.340182.1644135667000000
BFD: warning: /home/fdelorey/Desktop/./core.qemu-kvm.107.5c1789ec0e454a61a539f2120495cc87.340182.1644135667000000 is truncated: expected core file size >= 34764460032, found: 2147483648

MANY LINES DELETED:

Failed to read a valid object file image from memory.
Core was generated by `/usr/libexec/qemu-kvm -name guest=instance-00005xxx,debug-threads=on -S -object'.
Program terminated with signal SIGABRT, Aborted.
#0  0x00007ff84c48470f in ?? ()
[Current thread is 1 (LWP 340200)]
(gdb) bt
#0  0x00007ff84c48470f in ?? ()
Backtrace stopped: Cannot access memory at address 0x7ff83f7fd110

Comment 5 Yanhui Ma 2022-02-22 06:17:31 UTC
I tried some steps but didn't reproduce the issue by following steps:

1. Installed one rhel8.2.1 host and installed the packages and win2016 guest used by customer:

# rpm -q qemu-kvm
qemu-kvm-4.2.0-29.module+el8.2.1+9791+7d72b149.6.x86_64
# uname -r
4.18.0-193.29.1.el8_2.x86_64

2. Booted the win2016 guest for a whole night:

/usr/libexec/qemu-kvm \
-name guest=instance-00005f1d,debug-threads=on \
-S \
-machine pc-i440fx-rhel7.6.0,accel=kvm,usb=off,dump-guest-core=on \
-cpu SandyBridge-IBRS,vme=on,f16c=on,rdrand=on,hypervisor=on,arat=on,xsaveopt=on,abm=on \
-m 32768 \
-overcommit mem-lock=off \
-smp 6,sockets=6,dies=1,cores=1,threads=1 \
-uuid 773b0d15-a735-43bb-82cb-fdefcad28ea3 \
-smbios 'type=1,manufacturer=Red Hat,product=OpenStack Compute,version=20.4.1-1.20200917173450.el8ost,serial=773b0d15-a735-43bb-82cb-fdefcad28ea3,uuid=773b0d15-a735-43bb-82cb-fdefcad28ea3,family=Virtual Machine' \
-no-user-config \
-nodefaults \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-boot strict=on \
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 \
-blockdev '{"driver":"file","filename":"/home/win2016-64-virtio.raw","aio":"native","node-name":"libvirt-2-storage","cache":{"direct":true,"no-flush":false},"auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-2-format","read-only":false,"cache":{"direct":true,"no-flush":false},"driver":"raw","file":"libvirt-2-storage"}' \
-device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=libvirt-2-format,id=virtio-disk0,bootindex=1,write-cache=on,serial=9b2c8658-4b54-409d-93eb-f934a8540ceb \
-netdev tap,id=hostnet0,vhost=on \
-device virtio-net-pci,rx_queue_size=512,host_mtu=9000,netdev=hostnet0,id=net0,mac=00:16:3e:09:55:49,bus=pci.0,addr=0x3 \
-device usb-tablet,id=input0,bus=usb.0,port=1 \
-vnc :0 \
-device cirrus-vga,id=video0,bus=pci.0,addr=0x2 \
-device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 \
-sandbox on \
-msg timestamp=on -monitor stdio

3. Then changed the guest time backwards/forwards, after that, rebooted the guest.

Comment 6 Yanhui Ma 2022-03-10 03:02:47 UTC
I ran all our windows timer device cases with the test environment on comment 5 and the same qemu cmd line as customer. 
Still can't reproduce the issue.
Summary: 
Finshed=25, PASS=25

And here is the related code, does anyone have any suggestions on how to reproduce the bug?


 190     /*                                                                           
 191      * if the periodic timer's update is due to period re-configuration,         
 192      * we should count the clock since last interrupt.                           
 193      */                                                                          
 194     if (old_period && period_change) {                                           
 195         int64_t last_periodic_clock, next_periodic_clock;                        
 196                                                                                  
 197         next_periodic_clock = muldiv64(s->next_periodic_time,                    
 198                                 RTC_CLOCK_RATE, NANOSECONDS_PER_SECOND);         
 199         last_periodic_clock = next_periodic_clock - old_period;                  
 200         lost_clock = cur_clock - last_periodic_clock;                            
 201         assert(lost_clock >= 0);                                                                                          
 202     }