durusmail: durus-users: building a large BTree efficiently
building a large BTree efficiently
2005-10-14
2005-10-14
2005-10-14
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-19
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-26
2005-10-26
2005-10-26
building a large BTree efficiently
mario ruggier
2005-10-17
On Oct 14, 2005, at 10:44 PM, David Binger wrote:
>
> It might shed some light on this to run thus using the
> hotshot module.

OK, I have a few stats files if anyone cares to eyeball thru them.

As the index generation process starts out ok, using as much cpu that
it can, but after the first 100K objects it runs into memory problems,
I have run it under hotshot profile, stopping at at different points.
Each run starts with a freah db connection, and for all of them I am
using the following functions (see file: build_stocks_db_indices.py):

#--

def profile(func, *args):
     import hotshot, os
     import hotshot.stats
     stopat = 3 # <<<
     filename = 'hotshot%s.stats'%(stopat)
     prof = hotshot.Profile(filename)
     try:
         prof.runcall(func, *args)
     except Exception, e:
         print e
     prof.close()

     stats = hotshot.stats.load(filename)

     # Print reports including all code, even dependencies.
     stats.sort_stats('cumulative', 'calls')
     stats.print_stats(50)
     stats.sort_stats('time', 'calls')
     stats.print_stats(50)

     # Print reports showing only moellus code
     import moellus
     package_path = os.path.dirname(moellus.__file__)
     # Hotshot stores a lowercase form of the path so we need to
     # lowercase our path or the regular expression will fail.
     package_path = package_path.lower()

     stats.sort_stats('cumulative', 'calls')
     stats.print_stats(package_path, 50)
     stats.sort_stats('time', 'calls')
     stats.print_stats(package_path, 50)

def rebuild_indices():
     symbols.rebuild_indices()
     pet('Symols')
     quotes.rebuild_indices()
     pet('Quotes')

profile(rebuild_indices)

#--

The hotshot stats report for the 4 runs are:
- hotshot_s.txt : symbols index (8088 objects)
- hotshot_sq100K.txt : symbols + first 100K of quotes index
- hotshot_sq200K.txt : symbols + first 200K of quotes index
- hotshot_sq300K.txt : symbols + first 300K of quotes index

All files referred to here are at:
http://ruggier.org/software/scratch/durustress/

The more interesting is the second ordering, by tottime. Most of the
time seems to be consumed by _p_load_state, load_state, get_state, by
get_position, and __new__ (this last one is surprising, as to my
understanding there should be zero new persistent objects resulting
from this process). get_position's share of tottime is high for the
100K and 200K objects run, but surprisingly goes down for the 300K
(where logically it should be doing more work).

As the number of objects indexed increases, then get_state and __new__
share of time consumption grows to close to all of it.

Not sure if anyone can see anything more useful... pointing to ideas
for improvement changes.

mario



reply