From Bugzilla Helper: User-Agent: Mozilla/4.0 (compatible; MSIE 5.5; Windows NT 5.0; T312461) Description of problem: A simple 4-thread code runs significantly slower on a 4-CPU computer than on a 1-CPU computer. Apparently a scalability problem. When the code runs CPUs are mostly in idle state (60%) while the rest is split between user and system. Version-Release number of selected component (if applicable): How reproducible: Always Steps to Reproduce: 1. Take the code 2. Compile 3. Run on a 4-CPU computer with 1 thread and 4 threads. 4. Watch CPU states 5. Compare the results (seconds per iteration) Actual Results: STL C++ code with standard alloc Launching 1 threads 6.1344e-06 sec per iteration With 4 threads 9.64e-5 sec per iteration Expected Results: maximum 10-20% degradation due to lock contention, not order of magnitude. Additional info: ///////////////// //// //// test with: c++ -pthread PTest.C -lpthread //// actual code does nothing but fills a hash map. Memory //// consumption is 80-100 M per thread for NUM_ITERATION=500000 #include <pthread.h> #include <cstdlib> #include <unistd.h> #include <sys/time.h> #include <map> #include <string> #include <utility> #include <iostream> #include <iomanip> #include <bits/stl_pthread_alloc.h> using namespace std; const int NUM_ITERATIONS=500000; void *f(void *arg) { using std::cout; using std::endl; using std::string; struct timeval tv_start, tv_end; // typedef std::map<int,string,std::less<int>,std::pthread_allocator<std::pair<int, string> > > map_str_int; typedef std::map<int,string,std::less<int> > map_str_int; map_str_int msi; gettimeofday(&tv_start, NULL); for(int i = 0; i < NUM_ITERATIONS; i++) msi.insert(std::pair<int, string > (i, "value")); gettimeofday(&tv_end, NULL); cout << ( (tv_end.tv_sec - tv_start.tv_sec) + (tv_end.tv_usec - tv_start.tv_usec)/1.e6)/NUM_ITERATIONS << " sec per iteration" << endl; map_str_int::const_iterator fi; if((fi=msi.find(1491)) != msi.end()) cout << "found 1491 " << fi->first << ';' << fi->second << endl; return 0; } main( int argc, char ** argv) { int NUM_THREADS = atoi(argv[1]); cerr << "Launching " << NUM_THREADS << " threads" << endl; void* ret=0; pthread_t thread[NUM_THREADS]; for(int i=0;i<NUM_THREADS;++i) pthread_create(&thread[i], NULL, f, (void*)i); for(int i=0 ; i < NUM_THREADS; ++i) pthread_join(thread[i], &ret); sleep(1); }
Created attachment 47942 [details] simple code that demonstrates the problem
Welcome to the term "cache line bounces". On first sight your program has a scalability problem in itself, not the kernel. If you write to the same memory in 2 separate threads, you'll get the "cache line bounce" effect, basically every access to it will be a cache miss, which makes things very slow.
The program runs faster on a 2 CPU machine than on 4 CPU machine. THe program allocaltes memory for different objects, it does not write to "the same memory" whatever this is supposed to mean. The behaviour is basically the same for "per thread" allocator as with standard alloc allocators.
Thanks for the bug report. However, Red Hat no longer maintains this version of the product. Please upgrade to the latest version and open a new bug if the problem persists. The Fedora Legacy project (http://fedoralegacy.org/) maintains some older releases, and if you believe this bug is interesting to them, please report the problem in the bug tracker at: http://bugzilla.fedora.us/