60910 – Scalability: 4-CPU performace degradation

Bug 60910 - Scalability: 4-CPU performace degradation

Summary: Scalability: 4-CPU performace degradation

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Linux
Classification:	Retired
Component:	kernel
Sub Component:
Version:	7.2
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Assignee:	Arjan van de Ven
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2002-03-08 21:51 UTC by S Glukhov
Modified:	2008-08-01 16:22 UTC (History)
CC List:	0 users
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2004-09-30 15:39:25 UTC
Embargoed:

Attachments	(Terms of Use)
simple code that demonstrates the problem (1.47 KB, application/octet-stream) 2002-03-08 21:55 UTC, S Glukhov	no flags	Details
View All

Description S Glukhov 2002-03-08 21:51:44 UTC

From Bugzilla Helper:
User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; T312461)

Description of problem:
A simple 4-thread code runs significantly slower on a 4-CPU computer than on a 
1-CPU computer. Apparently a scalability problem. When the code runs CPUs are 
mostly in idle state (60%) while the rest is split between user and system. 

Version-Release number of selected component (if applicable):


How reproducible:
Always

Steps to Reproduce:
1. Take the code 
2. Compile 
3. Run on a 4-CPU computer with 1 thread and 4 threads.
4. Watch CPU states 
5. Compare the results (seconds per iteration)
	

Actual Results:  STL C++ code with standard alloc

Launching 1 threads
6.1344e-06 sec per iteration

With 4 threads 
9.64e-5 sec per iteration


Expected Results:  maximum 10-20% degradation due to lock contention, not order 
of magnitude.

Additional info:

/////////////////
////
////  test with: c++  -pthread  PTest.C  -lpthread
////  actual code does nothing but fills a hash map. Memory 
////  consumption is 80-100 M per thread for  NUM_ITERATION=500000

#include <pthread.h>
#include <cstdlib>
#include <unistd.h>
#include <sys/time.h>

#include <map>
#include <string>
#include <utility>
#include <iostream>
#include <iomanip>
#include <bits/stl_pthread_alloc.h>
using namespace  std;

const int NUM_ITERATIONS=500000;

void *f(void *arg)
{
  using std::cout;
  using std::endl;
  using std::string;
  
  struct timeval tv_start, tv_end;

//  typedef 
std::map<int,string,std::less<int>,std::pthread_allocator<std::pair<int, 
string> > > map_str_int;
  typedef std::map<int,string,std::less<int> > map_str_int;

  map_str_int msi;
  gettimeofday(&tv_start, NULL);
  for(int i = 0; i < NUM_ITERATIONS; i++) msi.insert(std::pair<int, string >
(i, "value"));
  gettimeofday(&tv_end, NULL);
  cout << ( (tv_end.tv_sec - tv_start.tv_sec) + (tv_end.tv_usec - 
tv_start.tv_usec)/1.e6)/NUM_ITERATIONS << " sec per iteration" << endl;
  map_str_int::const_iterator fi;
  if((fi=msi.find(1491)) != msi.end())
    cout << "found 1491 " << fi->first << ';' << fi->second << endl;
  return 0;
}



main( int argc, char ** argv) {


  int NUM_THREADS = atoi(argv[1]);
  cerr << "Launching " << NUM_THREADS << " threads" << endl;
  void* ret=0;
  pthread_t thread[NUM_THREADS];
  for(int i=0;i<NUM_THREADS;++i)
    pthread_create(&thread[i], NULL, f, (void*)i);
  for(int i=0 ; i < NUM_THREADS; ++i)
    pthread_join(thread[i], &ret);

  sleep(1);
}

Comment 1 S Glukhov 2002-03-08 21:55:17 UTC

Created attachment 47942 [details]
simple code that demonstrates the problem

Comment 2 Arjan van de Ven 2002-03-19 17:26:38 UTC

Welcome to the term "cache line bounces".
On first sight your program has a scalability problem in itself, not the kernel.
If you write to the same memory in 2 separate threads, you'll get the "cache
line bounce" effect, basically every access to it will be a cache miss, which
makes things very slow.

Comment 3 S Glukhov 2002-03-19 17:37:58 UTC

The program runs faster on a 2 CPU machine than on 4 CPU machine. THe program 
allocaltes memory for different objects, it does not write to "the same memory" 
whatever this is supposed to mean. The behaviour is basically the same for "per 
thread" allocator as with standard alloc allocators.

Comment 4 Bugzilla owner 2004-09-30 15:39:25 UTC

Thanks for the bug report. However, Red Hat no longer maintains this version of
the product. Please upgrade to the latest version and open a new bug if the problem
persists.

The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, 
and if you believe this bug is interesting to them, please report the problem in
the bug tracker at: http://bugzilla.fedora.us/

Note You need to log in before you can comment on or make changes to this bug.