Bug 163681

Summary:	asyncore.poll3 runs longer than timeout under heavy io load
Product:	Red Hat Enterprise Linux 3	Reporter:	Andre Schubert <andre>
Component:	python	Assignee:	Jeremy Katz <katzj>
Status:	CLOSED NOTABUG	QA Contact:	Brock Organ <borgan>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	3.0	CC:	katzj
Target Milestone:	---
Target Release:	---
Hardware:	i686
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-01-06 21:13:41 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Andre Schubert 2005-07-20 08:53:05 UTC

From Bugzilla Helper:
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.6) Gecko/20050323 Galeon/1.3.20

Description of problem:
We have production server running our network management system.
One subsystem is a script which runs every 5 minutes,
this script gets informations from around 1000 snmp-agents.
Because near of the half of these agents is offline we need to
get these informations asyncronously.
The script getting the data is written in python and uses the asyncore.poll3
function with a timeout of 1.0 second.
The server itself is running on a software raid-1 with 2 ide harddisk.
The average IO/Wait of the system is around 15%.
Sometimes under heavy IO-load some poll3 cycles takes much more time than they should, i saw cycles running up to 20 seconds. This very often happens when a daily vacuum of a large postgres-database is running, or other processes are writing a large amount of data to the disks.
This is really bad, since it sometimes is not possible to collect data from all agents.
It seems that the whole system freezes for several seconds. Thatswhy i think its not only a python problem.

I hope i can get some help on these problem.
I could give additional informations if someone needs it.

python: 2.2.3-6.1
kernel: 2.4.21-32.0.1.EL

Version-Release number of selected component (if applicable):

How reproducible:
Always

Steps to Reproduce:
To reproduce the problem i took a development machine and setup a software raid-1, then i wrote a little python script whith debugging output that trys to connect to 200 machines which are not responding. When running this script under heavy io writes i see poll3 cycles running longer than the timeout is.

Additional info:

Comment 1 Mihai Ibanescu 2005-09-30 16:40:02 UTC

Sorry it took so long to get to this bug report.

Can you please attach an strace of the process while it's under heavy I/O and
poll3 fails? I'd be curious what lower-level system call it uses.

Comment 2 Andre Schubert 2005-11-04 08:42:56 UTC

Sorry too for the late answer.

I think we haved solved this problem.
After several weeks of testing we have rewritten our script
which collects the informations asyncronously.

The hangs were caused by the underlying write to disk,
which is called directly after some data have arrived.

The new script first collects all the data, and after that
all the data is written completely out to the disk.
After we have changed to this new implementation,
we never saw a hang in a poll3 cycle.

Sorry for the false alarm.