Bug 2138178

Summary: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)
Product: Red Hat Enterprise Linux 9 Reporter: Tom Tracy <ttracy>
Component: qemu-kvmAssignee: Virtualization Maintenance <virt-maint>
qemu-kvm sub component: PCI QA Contact: Yiqian Wei <yiwei>
Status: CLOSED MIGRATED Docs Contact:
Severity: low    
Priority: unspecified CC: alex.williamson, chayang, clegoate, coli, jinzhao, juzhang, laine, lvivier, rsibley, virt-maint, zhguo
Version: 9.2Keywords: MigratedToJIRA, Triaged
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2023-07-07 20:32:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tom Tracy 2022-10-27 14:08:23 UTC
Description of problem:
When trying to ADD any SRIOV Network Device, the guest hangs and does not come online

Version-Release number of selected component (if applicable):
RHEL9.2 
Libvirt 8.7.0-1.el9
How reproducible:
Every time

Steps to Reproduce:
1. Stop Guest
2. Add PCI Host Device
3. Select virtual function device
4. Boot guest

Actual results: Guest hangs and see the following messages in journalctl
Oct 27 09:55:24 perf184.perf.lab.eng.bos.redhat.com virtqemud[2860]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)



Expected results: guest should boot with SRIOV network device


Additional info: This happens with any network device (Intel,Mellanox) networks.
If you add a virtio bridge, that works without issue. Do not see this with RHEL9.1. 

XML looks like this which looks the same as a RHEL9.1 Guest

<hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x31' slot='0x0a' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x0c' slot='0x00' function='0x0'/>
    </hostdev>

Comment 1 Yanghang Liu 2022-11-01 06:11:56 UTC
I can start a vm with a VF successfully in the following test env:
5.14.0-178.el9.x86_64
qemu-kvm-7.1.0-3.el9.x86_64
libvirt-8.7.0-1.el9.x86_64
edk2-ovmf-20220826gitba0e0e4c6a-1.el9.noarch
seabios-bin-1.16.0-4.el9.noarch



The domain xml:
[1] Q35 + SEABIOS RHEL92
virt-install --machine=q35 --noreboot --name=rhel92 --memory=4096 --vcpus=4 --graphics type=vnc,port=5992,listen=0.0.0.0 --osinfo detect=on,require=off --check all=off --import --noautoconsole  --network bridge=switch,model=virtio,mac=52:54:00:00:92:92 --disk path=/home/images/RHEL92.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,size=20 --hostdev 0000:1a:08.2

[2] Q35 + OVMF RHEL92
virt-install --machine=q35 --noreboot --name=rhel92 --memory=4096 --vcpus=4 --graphics type=vnc,port=5992,listen=0.0.0.0 --osinfo detect=on,require=off --check all=off --import --noautoconsole  --network bridge=switch,model=virtio,mac=52:54:00:00:92:92 --disk path=/home/images/RHEL92.qcow2,bus=virtio,cache=none,format=qcow2,io=threads,size=20  --boot=uefi --hostdev 0000:1a:08.2

Comment 2 Yanghang Liu 2022-11-01 06:24:57 UTC
Hi Tom,


Could you please help provide the following info?

[1] what's your host env ?
# rpm -q qemu-kvm libvirt edk2-ovmf seabion-bin

# uname -r

[2] what's the domain xml you used ?

# virsh dumpxml $domain

[3] what is the detailed info of the virtual function and its physical function ?

# lspci 0000:31:0a.0

# lspci 0000:31:00.0

[4] what's the libvirtd log when you met this issue ?

# cat /var/log/libvirt/qemu/$domain.log

[5] which libvirt tool you used to start the domain ? virsh cmd,virt-manager or others ?

Comment 4 Tom Tracy 2022-11-01 19:00:19 UTC
Yanghang

               Little background here. I test multiple network cards with virtio and passthrough devices. I use from 10Gb,25gb,100gb,200gb,200Gb bluefield cards. I created 5 virtIO networks with no issues. When I try to add the 10gb passthrough or any other network card, the guest hangs. I have done this with multiple times with RHEL8 and RHEL9 in the past with no issues. The issue started with RHEL9.2

[1] what's your host env ?
# rpm -q qemu-kvm libvirt edk2-ovmf seabion-bin

qemu-kvm-7.1.0-3.el9.x86_64
libvirt-8.7.0-1.el9.x86_64
edk2-ovmf-20220826gitba0e0e4c6a-1.el9.noarch
seabios-bin-1.16.0-4.el9.noarch

# uname -r
5.14.0-182.el9.x86_64 Both Host and Guest

Host grub line BOOT_IMAGE=(hd0,msdos1)/vmlinuz-5.14.0-182.el9.x86_64 root=/dev/mapper/rhel92_kvm_perf184-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M resume=/dev/mapper/rhel92_kvm_perf184-swap rd.lvm.lv=rhel92_kvm_perf184/root rd.lvm.lv=rhel92_kvm_perf184/swap intel_iommu=on

[2] what's the domain xml you used ?

# virsh dumpxml $domain - See attached file


[3] what is the detailed info of the virtual function and its physical function ?


[root@perf184 ~]# lspci -vvvs 31:00.1 - Host
31:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02)
	DeviceName: Integrated NIC 1 Port 2-1
	Subsystem: Intel Corporation Ethernet Converged Network Adapter X710
	Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0
	Interrupt: pin A routed to IRQ 16
	NUMA node: 0
	IOMMU group: 28
	Region 0: Memory at a7000000 (64-bit, prefetchable) [size=16M]
	Region 3: Memory at a9800000 (64-bit, prefetchable) [size=32K]
	Expansion ROM at a6480000 [disabled] [size=512K]
	Capabilities: [40] Power Management version 3
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
		Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
	Capabilities: [50] MSI: Enable- Count=1/1 Maskable+ 64bit+
		Address: 0000000000000000  Data: 0000
		Masking: 00000000  Pending: 00000000
	Capabilities: [70] MSI-X: Enable+ Count=129 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00001000
	Capabilities: [a0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr+ FatalErr+ UnsupReq+
			RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 512 bytes, MaxReadReq 4096 bytes
		DevSta:	CorrErr+ NonFatalErr- FatalErr- UnsupReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <16us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed 8GT/s (ok), Width x8 (ok)
			TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 65ms to 210ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [e0] Vital Product Data
		Product Name: X710 10GbE Controller
		Read-only fields:
			[V0] Vendor specific: FFV20.5.16
			[PN] Part number: YJYK1
			[MN] Manufacture ID: 1028
			[V1] Vendor specific: DSV1028VPDR.VER2.1
			[V3] Vendor specific: DTINIC
			[V4] Vendor specific: DCM1001FFFFFF2101FFFFFF
			[V5] Vendor specific: NPY2
			[V6] Vendor specific: PMT7
			[V7] Vendor specific: NMVIntel Corp
			[V8] Vendor specific: L1D0
			[RV] Reserved: checksum good, 4 byte(s) reserved
		Read/write fields:
			[Y1] System specific: CCF1
		End
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt+ UnxCmplt+ RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP+ FCP+ CmpltTO+ CmpltAbrt+ UnxCmplt- RxOF+ MalfTLP+ ECRC+ UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
		CEMsk:	RxErr+ BadTLP+ BadDLLP+ Rollover+ Timeout+ AdvNonFatalErr+
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn+ ECRCChkCap+ ECRCChkEn+
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [140 v1] Device Serial Number 74-36-de-ff-ff-ca-05-68
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [160 v1] Single Root I/O Virtualization (SR-IOV)
		IOVCap:	Migration-, Interrupt Message Number: 000
		IOVCtl:	Enable+ Migration- Interrupt- MSE+ ARIHierarchy-
		IOVSta:	Migration-
		Initial VFs: 64, Total VFs: 64, Number of VFs: 4, Function Dependency Link: 01
		VF offset: 79, stride: 1, Device ID: 154c
		Supported Page Size: 00000553, System Page Size: 00000001
		Region 0: Memory at 00000000a9000000 (64-bit, prefetchable)
		Region 3: Memory at 00000000a9810000 (64-bit, prefetchable)
		VF Migration: offset: 00000000, BIR: 0
	Capabilities: [1a0 v1] Transaction Processing Hints
		Device specific mode supported
		No steering table available
	Capabilities: [1b0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: i40e
	Kernel modules: i40e


[root@perf184 ~]# lspci -vvvs 31:0a.0 - virtual
31:0a.0 Ethernet controller: Intel Corporation Ethernet Virtual Function 700 Series (rev 02)
	Subsystem: Intel Corporation Device 0000
	Control: I/O- Mem- BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	NUMA node: 0
	IOMMU group: 309
	Region 0: Memory at a9000000 (64-bit, prefetchable) [virtual] [size=64K]
	Region 3: Memory at a9810000 (64-bit, prefetchable) [virtual] [size=16K]
	Capabilities: [70] MSI-X: Enable- Count=5 Masked-
		Vector table: BAR=3 offset=00000000
		PBA: BAR=3 offset=00002000
	Capabilities: [a0] Express (v2) Endpoint, MSI 00
		DevCap:	MaxPayload 2048 bytes, PhantFunc 0, Latency L0s <512ns, L1 <64us
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0.000W
		DevCtl:	CorrErr- NonFatalErr- FatalErr- UnsupReq-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- FLReset-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 8GT/s, Width x8, ASPM L1, Exit Latency L1 <16us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes, Disabled- CommClk-
			ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
		LnkSta:	Speed unknown (downgraded), Width x0 (downgraded)
			TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR-
			 10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
			 EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
			 FRS- TPHComp- ExtTPHComp-
			 AtomicOpsCap: 32bit- 64bit- 128bitCAS-
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- OBFF Disabled,
			 AtomicOpsCtl: ReqEn-
		LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
			 EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
			 Retimer- 2Retimers- CrosslinkRes: unsupported
	Capabilities: [100 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
		AERCap:	First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
			MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
		HeaderLog: 00000000 00000000 00000000 00000000
	Capabilities: [150 v1] Alternative Routing-ID Interpretation (ARI)
		ARICap:	MFVC- ACS-, Next Function: 0
		ARICtl:	MFVC- ACS-, Function Group: 0
	Capabilities: [1a0 v1] Transaction Processing Hints
		Device specific mode supported
		No steering table available
	Capabilities: [1d0 v1] Access Control Services
		ACSCap:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
		ACSCtl:	SrcValid- TransBlk- ReqRedir- CmpltRedir- UpstreamFwd- EgressCtrl- DirectTrans-
	Kernel driver in use: vfio-pci
	Kernel modules: iavf


# lspci 0000:31:00.0

[4] what's the libvirtd log when you met this issue ?

# cat /var/log/libvirt/qemu/$domain.log
2022-11-01 14:26:54.108+0000: starting up libvirt version: 8.7.0, package: 1.el9 (Red Hat, Inc. <http://bugzilla.redhat.com/bugzilla>, 2022-09-06-06:21:04, ), qemu version: 7.1.0qemu-kvm-7.1.0-3.el9, kernel: 5.14.0-182.el9.x86_64, hostname: perf184.perf.lab.eng.bos.redhat.com
LC_ALL=C \
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin \
HOME=/var/lib/libvirt/qemu/domain-1-rhel9.2 \
XDG_DATA_HOME=/var/lib/libvirt/qemu/domain-1-rhel9.2/.local/share \
XDG_CACHE_HOME=/var/lib/libvirt/qemu/domain-1-rhel9.2/.cache \
XDG_CONFIG_HOME=/var/lib/libvirt/qemu/domain-1-rhel9.2/.config \
/usr/libexec/qemu-kvm \
-name guest=rhel9.2,debug-threads=on \
-S \
-object '{"qom-type":"secret","id":"masterKey0","format":"raw","file":"/var/lib/libvirt/qemu/domain-1-rhel9.2/master-key.aes"}' \
-machine pc-q35-rhel9.0.0,usb=off,dump-guest-core=off,memory-backend=pc.ram \
-accel kvm \
-cpu host,migratable=on \
-m 1031008 \
-object '{"qom-type":"memory-backend-ram","id":"pc.ram","size":1081090244608}' \
-overcommit mem-lock=off \
-smp 64,sockets=64,cores=1,threads=1 \
-uuid 5b562de8-5436-4817-a44e-5c79f53c80e7 \
-no-user-config \
-nodefaults \
-chardev socket,id=charmonitor,fd=29,server=on,wait=off \
-mon chardev=charmonitor,id=monitor,mode=control \
-rtc base=utc,driftfix=slew \
-global kvm-pit.lost_tick_policy=delay \
-no-hpet \
-no-shutdown \
-global ICH9-LPC.disable_s3=1 \
-global ICH9-LPC.disable_s4=1 \
-boot strict=on \
-device '{"driver":"pcie-root-port","port":16,"chassis":1,"id":"pci.1","bus":"pcie.0","multifunction":true,"addr":"0x2"}' \
-device '{"driver":"pcie-root-port","port":17,"chassis":2,"id":"pci.2","bus":"pcie.0","addr":"0x2.0x1"}' \
-device '{"driver":"pcie-root-port","port":18,"chassis":3,"id":"pci.3","bus":"pcie.0","addr":"0x2.0x2"}' \
-device '{"driver":"pcie-root-port","port":19,"chassis":4,"id":"pci.4","bus":"pcie.0","addr":"0x2.0x3"}' \
-device '{"driver":"pcie-root-port","port":20,"chassis":5,"id":"pci.5","bus":"pcie.0","addr":"0x2.0x4"}' \
-device '{"driver":"pcie-root-port","port":21,"chassis":6,"id":"pci.6","bus":"pcie.0","addr":"0x2.0x5"}' \
-device '{"driver":"pcie-root-port","port":22,"chassis":7,"id":"pci.7","bus":"pcie.0","addr":"0x2.0x6"}' \
-device '{"driver":"pcie-root-port","port":23,"chassis":8,"id":"pci.8","bus":"pcie.0","addr":"0x2.0x7"}' \
-device '{"driver":"pcie-root-port","port":24,"chassis":9,"id":"pci.9","bus":"pcie.0","multifunction":true,"addr":"0x3"}' \
-device '{"driver":"pcie-root-port","port":25,"chassis":10,"id":"pci.10","bus":"pcie.0","addr":"0x3.0x1"}' \
-device '{"driver":"pcie-root-port","port":26,"chassis":11,"id":"pci.11","bus":"pcie.0","addr":"0x3.0x2"}' \
-device '{"driver":"pcie-root-port","port":27,"chassis":12,"id":"pci.12","bus":"pcie.0","addr":"0x3.0x3"}' \
-device '{"driver":"pcie-root-port","port":28,"chassis":13,"id":"pci.13","bus":"pcie.0","addr":"0x3.0x4"}' \
-device '{"driver":"pcie-root-port","port":29,"chassis":14,"id":"pci.14","bus":"pcie.0","addr":"0x3.0x5"}' \
-device '{"driver":"pcie-root-port","port":30,"chassis":15,"id":"pci.15","bus":"pcie.0","addr":"0x3.0x6"}' \
-device '{"driver":"qemu-xhci","p2":15,"p3":15,"id":"usb","bus":"pci.2","addr":"0x0"}' \
-device '{"driver":"virtio-serial-pci","id":"virtio-serial0","bus":"pci.3","addr":"0x0"}' \
-blockdev '{"driver":"file","filename":"/var/lib/libvirt/images/rhel9.2.qcow2","node-name":"libvirt-1-storage","auto-read-only":true,"discard":"unmap"}' \
-blockdev '{"node-name":"libvirt-1-format","read-only":false,"discard":"unmap","driver":"qcow2","file":"libvirt-1-storage","backing":null}' \
-device '{"driver":"virtio-blk-pci","bus":"pci.4","addr":"0x0","drive":"libvirt-1-format","id":"virtio-disk0","bootindex":1}' \
-netdev tap,fd=30,vhost=on,vhostfd=32,id=hostnet0 \
-device '{"driver":"virtio-net-pci","netdev":"hostnet0","id":"net0","mac":"52:54:00:c7:61:48","bootindex":2,"bus":"pci.1","addr":"0x0"}' \
-netdev tap,fds=33:34:35:36,vhost=on,vhostfds=37:38:39:40,id=hostnet1 \
-device '{"driver":"virtio-net-pci","mq":true,"vectors":10,"netdev":"hostnet1","id":"net1","mac":"52:54:00:e6:37:de","bus":"pci.7","addr":"0x0"}' \
-netdev tap,fds=41:42:43:44:45:46:47:48,vhost=on,vhostfds=49:50:51:52:53:54:55:56,id=hostnet2 \
-device '{"driver":"virtio-net-pci","mq":true,"vectors":18,"netdev":"hostnet2","id":"net2","mac":"52:54:00:fa:b2:03","bus":"pci.8","addr":"0x0"}' \
-netdev tap,fds=57:58:59:60:61:62:63:64:65:66:67:68:69:70:71:72:73:74:75:76:77:78:79:80:81:82:83:84:85:86:87:88:89:90:91:92:93:94:95:96:97:98:99:100:101:102:103:104:105:106:107:108:109:110:111:112:113:114:115:116:117:118:119:120,vhost=on,vhostfds=121:122:123:124:125:126:127:128:129:130:131:132:133:134:135:136:137:138:139:140:141:142:143:144:145:146:147:148:149:150:151:152:153:154:155:156:157:158:159:160:161:162:163:164:165:166:167:168:169:170:171:172:173:174:175:176:177:178:179:180:181:182:183:184,id=hostnet3 \
-device '{"driver":"virtio-net-pci","mq":true,"vectors":130,"netdev":"hostnet3","id":"net3","mac":"52:54:00:b6:ee:17","bus":"pci.9","addr":"0x0"}' \
-netdev tap,fds=185:186:187:188:189:190:191:192:193:194:195:196:197:198:199:200:201:202:203:204:205:206:207:208:209:210:211:212:213:214:215:216:217:218:219:220:221:222:223:224:225:226:227:228:229:230:231:232:233:234:235:236:237:238:239:240:241:242:243:244:245:246:247:248,vhost=on,vhostfds=249:250:251:252:253:254:255:256:257:258:259:260:261:262:263:264:265:266:267:268:269:270:271:272:273:274:275:276:277:278:279:280:281:282:283:284:285:286:287:288:289:290:291:292:293:294:295:296:297:298:299:300:301:302:303:304:305:306:307:308:309:310:311:312,id=hostnet4 \
-device '{"driver":"virtio-net-pci","mq":true,"vectors":130,"netdev":"hostnet4","id":"net4","mac":"52:54:00:27:c2:e4","bus":"pci.10","addr":"0x0"}' \
-netdev tap,fds=313:314:315:316:317:318:319:320:321:322:323:324:325:326:327:328:329:330:331:332:333:334:335:336:337:338:339:340:341:342:343:344:345:346:347:348:349:350:351:352:353:354:355:356:357:358:359:360:361:362:363:364:365:366:367:368:369:370:371:372:373:374:375:376,vhost=on,vhostfds=377:378:379:380:381:382:383:384:385:386:387:388:389:390:391:392:393:394:395:396:397:398:399:400:401:402:403:404:405:406:407:408:409:410:411:412:413:414:415:416:417:418:419:420:421:422:423:424:425:426:427:428:429:430:431:432:433:434:435:436:437:438:439:440,id=hostnet5 \
-device '{"driver":"virtio-net-pci","mq":true,"vectors":130,"netdev":"hostnet5","id":"net5","mac":"52:54:00:1d:a2:d2","bus":"pci.11","addr":"0x0"}' \
-chardev pty,id=charserial0 \
-device '{"driver":"isa-serial","chardev":"charserial0","id":"serial0","index":0}' \
-chardev socket,id=charchannel0,fd=28,server=on,wait=off \
-device '{"driver":"virtserialport","bus":"virtio-serial0.0","nr":1,"chardev":"charchannel0","id":"channel0","name":"org.qemu.guest_agent.0"}' \
-device '{"driver":"usb-tablet","id":"input0","bus":"usb.0","port":"1"}' \
-audiodev '{"id":"audio1","driver":"none"}' \
-vnc 127.0.0.1:0,audiodev=audio1 \
-device '{"driver":"virtio-vga","id":"video0","max_outputs":1,"bus":"pcie.0","addr":"0x1"}' \
-device '{"driver":"vfio-pci","host":"0000:31:0a.0","id":"hostdev0","bus":"pci.12","addr":"0x0"}' \
-device '{"driver":"virtio-balloon-pci","id":"balloon0","bus":"pci.5","addr":"0x0"}' \
-object '{"qom-type":"rng-random","id":"objrng0","filename":"/dev/urandom"}' \
-device '{"driver":"virtio-rng-pci","rng":"objrng0","id":"rng0","bus":"pci.6","addr":"0x0"}' \
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny \
-msg timestamp=on
char device redirected to /dev/pts/2 (label charserial0)



[5] which libvirt tool you used to start the domain ? virsh cmd,virt-manager or others ?

virt-manager and virsh cmd -  guests hangs 

to clear issue I have to reboot the host

Tom

Comment 5 Yanghang Liu 2022-11-02 03:31:18 UTC
Hi Tom,

I have used the domain xml you provide and defined a similar domain (I need to update some domain cfg due to the hardware limit), but I can still start the domain successfully 


Could you please use the following domain cfg and test again to see if the bug still exists in your environment ?

[1] reduce the domain memory to a smaller size, like 4G

[2] use a recommended smp cpu number supported by KVM, like 24 or a smaller number 

[3] remove some optional domain devices , like the 5 virtio-net-pci devices


And if possible , you can upload the guest hang screenshot in the attachment as well.

:) I think these will be helpful for us to locate the root case.

Comment 6 Tom Tracy 2022-11-02 14:26:50 UTC
Yanghang

Experiment #1

Reduce guest cpu to 24 and reduced memory to 4gb

Removed virtio Networks

Only SRIOV device (10Gb card) 

Then guests boots

I knew the above config would boot as the issue is having both VirtIO network AND passthrough network causes the hang. I tried that before filing the BZ

Experiment #2

With passthrough network working 

add back all virtio network devices with above configuration

Guests boots with 5 virtio network and 1 passthrough with 24 cpu's / 4gb memory

Experiment #3

Add remaining passthrough devices needed for testing

Intel E810  25Gb
Intel E810 100Gb
Mellanox ConnectX6 200Gb

Guest boots with all required / desired network devices

Experiment #4

Increase CPU count from 24 to 64

Guest boots with all required / desired network devices

Experiment # 5

Increase memory from 4Gb to 512Gb 

Guest boots with all required / desired network devices

Experiment # 6

Increase memory from 512Gb to 640Gb

Guest boots

Experiment # 7

Decrease memory from 640Gb to 750Gb - Boots 


Note Adding memory past 256 Gb you get the following messages but the guests boots till you get close to host memory then guest will not boot

Nov 02 10:08:22 perf184.perf.lab.eng.bos.redhat.com virtqemud[2806]: Cannot start job (query, none, none) in API remoteDispatchConnectGetAllDomainStats for domain rhel9.2; current job is (async nested, none, start) owned by (2827 remoteDispatchDomainCreate, 0 <null>, 2827 remoteDispatchDomainCreate (flags=0x0)) for (30s, 0s, 31s)
Nov 02 10:08:22 perf184.perf.lab.eng.bos.redhat.com virtqemud[2806]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)


So definitely a memory allocation issue. I can start testing with 256Gb guest memory. I do not want to see the above errors

Note I used the same memory configuration with RHEL9.1 and never had the issue. We do not test migration so not worried about that

This is a Dell R750 IceLake with 1Tb total system memory



when the system hangs on power up. You see nothing from the virtmanager console

all you see is the guest paused

[root@perf184 ~]# virsh list --all
 Id   Name                  State
--------------------------------------
 1    rhel9.2               paused

Comment 8 Alex Williamson 2022-11-07 22:11:53 UTC
(In reply to Tom Tracy from comment #6)
> This is a Dell R750 IceLake with 1Tb total system memory

  <memory unit='KiB'>1055752192</memory>

This is over 98% of the total host memory!  Frankly I'm surprised 9.1 could handle this.  The host is more than likely hanging because it's thrashing trying to swap and OOM kill whatever it can.  How close was 9.1 to reaching this same condition, another couple GB?  Increased memory size anywhere in the kernel could have resulted in this.

Device assignment requires pinning all of VM RAM into host memory.  When we're cutting the host this thin on memory, the size of page structures, IOMMU page tables, and perhaps even memory fragmentation can come into play.

Does comparing /proc/meminfo on the host between 9.1 and 9.2 suggest a significant difference in memory?

Comment 9 Tom Tracy 2022-11-09 20:20:24 UTC
Zhiyi

          Just built 2 more guests on Cascade lake systems using the latest libvirt rpms you pointed to. I can start the guest with virtio and SRIOV networks but if I use more than 1/4 total memory still see theses errors

ov 09 13:13:44 perf2.perf.lab.eng.bos.redhat.com virtqemud[2984]: Cannot start job (query, none, none) in API remoteDispatchConnectGetAllDomainStats for domain rhel9.2; current job is (async nested, none, start) owned by (2988 remoteDispatchDomainCreate, 0 <null>, 2988 remoteDispatchDomainCreate (flags=0x0)) for (60s, 0s, 60s)
Nov 09 13:13:44 perf2.perf.lab.eng.bos.redhat.com virtqemud[2984]: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainCreate)

Note I only see these messages when adding an sriov network. Just having virtio network, you do not see the message above

Alex

       Yes your are absolutely correct as i miscalculated total memory on the IceLake. but what concerns me is the messages above. Thanks for pointing that out. 

Tom

Comment 10 Alex Williamson 2022-11-10 17:52:51 UTC
(In reply to Tom Tracy from comment #9)
> Zhiyi
> 
>           Just built 2 more guests on Cascade lake systems using the latest
> libvirt rpms you pointed to. I can start the guest with virtio and SRIOV
> networks but if I use more than 1/4 total memory still see theses errors
> 
> ov 09 13:13:44 perf2.perf.lab.eng.bos.redhat.com virtqemud[2984]: Cannot
> start job (query, none, none) in API remoteDispatchConnectGetAllDomainStats
> for domain rhel9.2; current job is (async nested, none, start) owned by
> (2988 remoteDispatchDomainCreate, 0 <null>, 2988 remoteDispatchDomainCreate
> (flags=0x0)) for (60s, 0s, 60s)
> Nov 09 13:13:44 perf2.perf.lab.eng.bos.redhat.com virtqemud[2984]: Timed out
> during operation: cannot acquire state change lock (held by
> monitor=remoteDispatchDomainCreate)
> 
> Note I only see these messages when adding an sriov network. Just having
> virtio network, you do not see the message above

Pinning memory is not instantaneous, it can certainly take more than 60s for a large VM.  Is this part a regression?  If this is a new watchdog in libvirt, maybe the timeout is not reasonable for such large VMs w/ assigned devices.

Comment 11 Laurent Vivier 2022-11-16 20:24:07 UTC
Setting subcomponent to PCI rather than Networking as it seems to be more appropriate.

Comment 12 Alex Williamson 2022-11-16 21:22:54 UTC
Hey Laine,

The below error is seen when starting VMs with an assigned device when the VM size is more than ~256GB.  Pinning VM memory at QEMU startup is an intensive process that scales linearly with VM RAM size, so I suspect we're exceeding a timeout in libvirt.  If so, is there anything we can do about this, such as scaling the timeout based on VM size?  Is there any residual fallout from these errors relative to libvirt managing the VM, or is this essential self correcting once QEMU becomes responsive?  Is any of this timeout behavior new to libvirt?  Unfortunately it's very difficult to characterize vfio page pinning throughput as it depends on factors such as hugepages/THP, memory fragmentation, and of course the processor performance and system load.  Thanks

> ov 09 13:13:44 perf2.perf.lab.eng.bos.redhat.com virtqemud[2984]: Cannot
> start job (query, none, none) in API remoteDispatchConnectGetAllDomainStats
> for domain rhel9.2; current job is (async nested, none, start) owned by
> (2988 remoteDispatchDomainCreate, 0 <null>, 2988 remoteDispatchDomainCreate
> (flags=0x0)) for (60s, 0s, 60s)
> Nov 09 13:13:44 perf2.perf.lab.eng.bos.redhat.com virtqemud[2984]: Timed out
> during operation: cannot acquire state change lock (held by
> monitor=remoteDispatchDomainCreate)

Comment 13 Laine Stump 2022-11-17 15:14:08 UTC
The timeout for that lock has been set to 30 seconds since the timeout was added in commit 9b6efcfef in November 2009. If this worked previously, then I'd guess something is making the operation take longer than it previously took.

I think it would be better to not lengthen the timeout just based on VM size; if its only needed in the case of device assignment, then we should only be allowing the longer timeout in those cases. (This would make the code more cluttered, but help to eliminate the frustration of waiting in cases when it was going to timeout anyway). But if it was previously working for the same configuration (can we verify this?), maybe it's an indication that something is broken, rather than that we need to increase the timeout (or maybe both :-))

Comment 14 Laine Stump 2022-11-17 16:41:51 UTC
Looking at this in more detail, it is only a stats request that's timing out while attempting to get the lock that is currently held by DomainCreate. Is the domain creation actually failing, or is it succeeding, but you see these error messages about virConnectGetAllDomainStats?? If the latter, then it's likely that whoever is requesting the stats (virt-manager?) silently ignores that error and recovers the next time it requests the stats. In that case lengthening timeouts when a DomainCreate is in progress is probably not the right thing to do, as it just causes the application requesting stats to hang for a longer period (actually that application might want a *shorter* timeout so it could avoiding becoming unresponsive during these times.

Comment 15 Alex Williamson 2022-11-17 22:24:20 UTC
(In reply to Laine Stump from comment #14)
> Looking at this in more detail, it is only a stats request that's timing out
> while attempting to get the lock that is currently held by DomainCreate. Is
> the domain creation actually failing, or is it succeeding, but you see these
> error messages about virConnectGetAllDomainStats?? If the latter, then it's
> likely that whoever is requesting the stats (virt-manager?) silently ignores
> that error and recovers the next time it requests the stats. In that case
> lengthening timeouts when a DomainCreate is in progress is probably not the
> right thing to do, as it just causes the application requesting stats to
> hang for a longer period (actually that application might want a *shorter*
> timeout so it could avoiding becoming unresponsive during these times.

Thanks, Laine.

Tom, can you please verify whether this is a regression, and in fact whether there's any misbehavior beyond unexpected warnings in the logs?

If this bug is only now describing a nuisance issue, no longer related to the bug title, please update the title, description, and severity, and/or close as appropriate.  Thanks.

Comment 16 Tom Tracy 2022-11-21 16:47:36 UTC
Alex 

        Would like to keep this ticket open as other team members create large vm's using all of the host resources. They just haven't gotten to RHEl9.2 yet. I never saw these messages with RHEL9.1 before. 
Tom

Comment 18 Yanghang Liu 2022-12-17 12:45:10 UTC
 (In reply to Alex Williamson from comment #15)
> (In reply to Laine Stump from comment #14)
> > Looking at this in more detail, it is only a stats request that's timing out
> > while attempting to get the lock that is currently held by DomainCreate. Is
> > the domain creation actually failing, or is it succeeding, but you see these
> > error messages about virConnectGetAllDomainStats?? If the latter, then it's
> > likely that whoever is requesting the stats (virt-manager?) silently ignores
> > that error and recovers the next time it requests the stats. In that case
> > lengthening timeouts when a DomainCreate is in progress is probably not the
> > right thing to do, as it just causes the application requesting stats to
> > hang for a longer period (actually that application might want a *shorter*
> > timeout so it could avoiding becoming unresponsive during these times.
> 
> Thanks, Laine.
> 
> Tom, can you please verify whether this is a regression, and in fact whether
> there's any misbehavior beyond unexpected warnings in the logs?
> 
> If this bug is only now describing a nuisance issue, no longer related to
> the bug title, please update the title, description, and severity, and/or
> close as appropriate.  Thanks.

Considering this bug is mainly related to start a domain with large memory, reset the QA Contact to default first.