Bug 841275

Summary:	Facts slow to generate
Product:	OKD	Reporter:	Mike McGrath <mmcgrath>
Component:	Containers	Assignee:	Mrunal Patel <mpatel>
Status:	CLOSED CURRENTRELEASE	QA Contact:	libra bugs <libra-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	2.x	CC:	jialiu, mpatel, twiest
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:	devenv_1933+	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-08-07 20:42:41 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Mike McGrath 2012-07-18 14:35:53 UTC

Description of problem:

Some of our custom facts take a long time to generate when there are lots of gears on a node.  (think 1000-2000 or so).

Version-Release number of selected component (if applicable):


How reproducible:

Every time

Steps to Reproduce:
1.  Create 2000 gears
2.  run facter active_capacity
3.
  
Actual results:

Takes 30+ seconds

Expected results:

I'd think this could be done in just a few seconds.  I'm thinking we should probably switch from using the git_repo as the count to just using straight gears for capacity.  It would greatly simplify these checks and is probably more accurate as we will likely have gears in the future that have no git repo.

Additional info:

Comment 1 Thomas Wiest 2012-07-19 21:03:41 UTC

I just watched how long /usr/libexec/mcollective/update_yaml.rb takes to run on a few of our nodes in PROD that have a lot of gears, and on average it takes around 7 minutes. The longest that I've seen it run for was 15 minutes.

Since this takes so long to run, I see sevaral of them running at the same time (which exacerbates the problem).

Comment 2 Mrunal Patel 2012-07-30 17:39:22 UTC

Can we get timing information to calculate each fact by using -t flag?

Comment 3 Mrunal Patel 2012-07-30 18:10:16 UTC

This script will get time for calculating each fact -
http://fpaste.org/oduh/

Comment 4 Thomas Wiest 2012-07-30 19:15:03 UTC

# time facter -t
kernel: 104.50ms
macaddress: 16.76ms
git_repos: 0.03ms
max_active_apps: 0.02ms
max_apps: 0.03ms
lsbdistid: 328.16ms
operatingsystem: 332.13ms
hardwaremodel: 49.01ms
architecture: 382.23ms
lib: 0.02ms
ec2_local_hostname: 0.02ms
selinux: 11.16ms
selinux_config_mode: 28.40ms
processorcount: 1.04ms
memoryfree: 1.83ms
cart_list: 0.03ms
mountpoints: 0.14ms
ipaddress: 22.77ms
netmask: 34.36ms
netmask_eth0: 31.10ms
cfkey: 0.12ms
ec2_profile: 0.03ms
sshdsakey: 0.52ms
embed_cart_list: 0.03ms
macaddress_lo: 19.11ms
processor1: 0.03ms
memorysize: 1.51ms
memorytotal: 1.80ms
district_uuid: 0.03ms
virtual: 132.11ms
ec2_public_keys_0_openssh_key: 0.03ms
timezone: 0.46ms
ec2_ami_id: 0.04ms
augeasversion: 2.35ms
public_ip: 0.02ms
facterversion: 0.02ms
uptime_seconds: 15.22ms
id: 32.68ms
path: 0.05ms
ec2_block_device_mapping_ami: 0.03ms
hardwareisa: 23.73ms
kernelrelease: 22.79ms
ec2_block_device_mapping_root: 0.04ms
lsbdistcodename: 196.37ms
ec2_instance_type: 0.04ms
lsbdistrelease: 151.37ms
lsbmajdistrelease: 151.87ms
selinux_policyversion: 0.43ms
capacity: 0.03ms
ipaddress6: 11.91ms
ipaddress6_eth0: 18.98ms
osfamily: 0.09ms
ec2_local_ipv4: 0.03ms
selinux_config_policy: 12.02ms
selinux_mode: 12.43ms
ipaddress_lo: 19.90ms
swapsize: 1.36ms
devices: 0.03ms
ec2_public_hostname: 0.04ms
sshecdsakey: 0.04ms
sshecdsakey: 0.04ms
sshecdsakey: 0.03ms
sshecdsakey: 0.03ms
arp_eth0: 35.94ms
netmask_lo: 21.30ms
ps: 0.03ms
ec2_reservation_id: 0.03ms
uniqueid: 26.60ms
district_active: 0.05ms
uptime_hours: 0.13ms
uptime_days: 0.36ms
ec2_ami_launch_index: 0.03ms
rubysitedir: 0.13ms
public_hostname: 0.02ms
is_virtual: 0.05ms
hostname: 37.08ms
domain: 28.94ms
fqdn: 66.73ms
lsbrelease: 139.62ms
ec2_block_device_mapping_ephemeral0: 0.04ms
network_eth0: 40.81ms
operatingsystemrelease: 0.17ms
swapfree: 1.18ms
kernelversion: 0.05ms
ec2_hostname: 0.03ms
interfaces: 11.90ms
ec2_kernel_id: 0.03ms
selinux_current_mode: 10.39ms
sshfp: 54.55ms
macaddress_eth0: 19.29ms
ec2_placement_availability_zone: 0.04ms
ipaddress6_lo: 18.81ms
processor0: 0.03ms
ec2_public_ipv4: 0.03ms
sshrsakey: 0.18ms
physicalprocessorcount: 50.81ms
puppetversion: 461.74ms
arp_lo: 110.10ms
arp: 41.38ms
ec2_security_groups: 0.04ms
uptime: 0.11ms
shortdomain: 28.31ms
ec2_ami_manifest_path: 0.04ms
rubyversion: 0.02ms
node_profile: 0.03ms
lsbdistdescription: 170.76ms
ec2_block_device_mapping_ephemeral1: 0.04ms
network_lo: 64.46ms
active_capacity: 0.04ms
vlans: 0.10ms
selinux_enforced: 0.53ms
kernelmajversion: 0.09ms
ec2_instance_id: 0.03ms
ipaddress_eth0: 31.31ms
[...snip...]


real	0m43.383s
user	0m3.112s
sys	0m33.628s




----------------------------------------------------------------





# time ./facter_timing_script
kernel: 32.24ms
macaddress: 10.59ms
git_repos: 0.03ms
max_active_apps: 0.02ms
max_apps: 0.03ms
lsbdistid: 147.57ms
operatingsystem: 151.02ms
hardwaremodel: 26.98ms
architecture: 179.28ms
ipaddress6_eth0: 69.42ms
selinux: 6.45ms
selinux_config_mode: 13.30ms
osfamily: 0.10ms
swapsize: 1.52ms
mountpoints: 0.03ms
ec2_local_ipv4: 0.03ms
processorcount: 0.39ms
ipaddress_lo: 20.81ms
sshecdsakey: 0.06ms
sshecdsakey: 0.05ms
sshecdsakey: 0.03ms
sshecdsakey: 0.03ms
domain: 47.05ms
cart_list: 0.04ms
ec2_public_hostname: 0.03ms
lsbrelease: 154.11ms
processor1: 0.03ms
embed_cart_list: 0.03ms
virtual: 69.32ms
netmask_lo: 20.01ms
timezone: 0.45ms
ipaddress6: 16.58ms
ec2_reservation_id: 0.03ms
lsbdistrelease: 141.87ms
lsbmajdistrelease: 142.36ms
district_uuid: 0.03ms
uptime_seconds: 14.51ms
ec2_ami_launch_index: 0.04ms
id: 41.66ms
network_eth0: 41.50ms
public_ip: 0.04ms
path: 0.03ms
kernelrelease: 22.09ms
hostname: 32.40ms
fqdn: 58.57ms
ec2_block_device_mapping_ephemeral0: 0.06ms
arp_eth0: 37.42ms
operatingsystemrelease: 0.22ms
swapfree: 1.22ms
kernelversion: 0.07ms
arp: 88.25ms
ec2_hostname: 0.04ms
interfaces: 13.81ms
selinux_policyversion: 0.38ms
ec2_kernel_id: 0.03ms
memorysize: 1.16ms
capacity: 0.04ms
macaddress_eth0: 31.49ms
selinux_config_policy: 23.39ms
selinux_mode: 23.81ms
ec2_placement_availability_zone: 0.04ms
lib: 0.02ms
devices: 0.03ms
ipaddress6_lo: 20.05ms
sshrsakey: 0.54ms
physicalprocessorcount: 74.32ms
ec2_public_ipv4: 0.06ms
lsbdistdescription: 195.58ms
ps: 0.03ms
cfkey: 0.11ms
uniqueid: 23.94ms
district_active: 0.04ms
uptime_hours: 0.12ms
uptime_days: 0.34ms
ec2_security_groups: 0.03ms
rubysitedir: 0.15ms
is_virtual: 0.05ms
ec2_ami_manifest_path: 0.03ms
network_lo: 49.35ms
public_hostname: 0.03ms
ec2_block_device_mapping_ephemeral1: 0.03ms
arp_lo: 40.60ms
ec2_instance_id: 0.04ms
kernelmajversion: 0.13ms
ipaddress_eth0: 21.18ms
selinux_current_mode: 8.86ms
ec2_local_hostname: 0.03ms
memoryfree: 1.21ms
sshfp: 51.11ms
netmask_eth0: 19.98ms
ec2_profile: 0.05ms
lsbdistcodename: 203.44ms
processor0: 0.03ms
memorytotal: 0.07ms
macaddress_lo: 20.52ms
sshdsakey: 0.19ms
ipaddress: 11.28ms
ec2_public_keys_0_openssh_key: 0.04ms
puppetversion: 475.06ms
uptime: 0.07ms
ec2_ami_id: 0.02ms
shortdomain: 28.15ms
netmask: 16.35ms
facterversion: 0.03ms
rubyversion: 0.04ms
ec2_block_device_mapping_ami: 0.02ms
hardwareisa: 40.59ms
node_profile: 0.03ms
vlans: 0.13ms
ec2_block_device_mapping_root: 0.02ms
augeasversion: 1.95ms
active_capacity: 0.03ms
selinux_enforced: 0.53ms
ec2_instance_type: 0.02ms

real	0m42.341s
user	0m3.286s
sys	0m35.333s

Comment 5 Mrunal Patel 2012-07-31 00:08:45 UTC

The numbers that we got from loaded nodes don't really indicate any issues
in calculating active capacity. However, the Dir.glob used in generating
git repos count when ran by hand took 2-3 minutes. This has already been
replaced by a more efficient version this sprint (takes 2-3 seconds).

Also, https://bugzilla.redhat.com/show_bug.cgi?id=841681 fixed the other issue
of preventing multiple scripts running at the same time and facts file getting
wiped out for long periods.

Comment 6 John Poelstra 2012-07-31 17:14:57 UTC

expect to close today

Comment 7 Mrunal Patel 2012-08-01 01:11:27 UTC

Added an optimization - https://github.com/openshift/crankcase/pull/310

Comment 8 Johnny Liu 2012-08-03 07:50:08 UTC

Verified this bug on devenv-stage_228, and PASS.

Check the code of patch, it still counts git repo in gear dir as the return value of git_repos. The main optimization is that glob directories counting only happened once. (In old logic, it happened twice for git_repos and active_capacity facter)


1. Create a lot of gears, here I copied a lot of gear dirs (like ~2000) in /var/lib/stickshift dir, and increase mem and cup stress via stress tools to create a dummy env (stress --vm 1 --vm-bytes 1G --vm-keep -c 10).

Using new code:
# for i in `seq 1 6`; do time facter active_capacity; done
2501.25

real	0m12.022s
user	0m0.435s
sys	0m0.405s
2501.25

real	0m11.081s
user	0m0.437s
sys	0m0.331s
2501.25

real	0m12.090s
user	0m0.469s
sys	0m0.361s
2501.25

real	0m12.327s
user	0m0.488s
sys	0m0.382s
2501.25

real	0m12.281s
user	0m0.438s
sys	0m0.397s
2501.25

real	0m10.622s
user	0m0.401s
sys	0m0.350s

The average value is about 11.737s.


Using old code:
# for i in `seq 1 6`; do time facter active_capacity; done
2501.25

real	0m13.423s
user	0m0.505s
sys	0m0.429s
2501.25

real	0m14.739s
user	0m0.541s
sys	0m0.476s
2501.25

real	0m13.620s
user	0m0.476s
sys	0m0.473s
2501.25

real	0m11.868s
user	0m0.412s
sys	0m0.409s
2501.25

real	0m12.074s
user	0m0.424s
sys	0m0.417s
2501.25

real	0m11.481s
user	0m0.404s
sys	0m0.389s

The average value is about 12.867s.


Compare the two value, found that this optimization is not very obvious, at least in my testing env. Whatever, it indeed is an optimization.

Comment 9 Johnny Liu 2012-08-03 07:56:30 UTC

In the initial report, I think the root cause of taking 30+ seconds to active_capacity is heavy load on that node.


In testing reslut (comment 8), it takes about 12 sconds to get active_capacity facts under heavy mem load, once release mem load, it only takes about 1.2 seconds.

# for i in `seq 1 6`; do time facter active_capacity; done
2501.25

real	0m1.103s
user	0m0.401s
sys	0m0.341s
2501.25

real	0m1.206s
user	0m0.433s
sys	0m0.370s
2501.25

real	0m1.177s
user	0m0.387s
sys	0m0.365s
2501.25

real	0m1.172s
user	0m0.404s
sys	0m0.345s
2501.25

real	0m1.102s
user	0m0.402s
sys	0m0.341s
2501.25

real	0m1.203s
user	0m0.398s
sys	0m0.345s