On Oct 14, 2005, at 10:44 PM, David Binger wrote: > > It might shed some light on this to run thus using the > hotshot module. OK, I have a few stats files if anyone cares to eyeball thru them. As the index generation process starts out ok, using as much cpu that it can, but after the first 100K objects it runs into memory problems, I have run it under hotshot profile, stopping at at different points. Each run starts with a freah db connection, and for all of them I am using the following functions (see file: build_stocks_db_indices.py): #-- def profile(func, *args): import hotshot, os import hotshot.stats stopat = 3 # <<< filename = 'hotshot%s.stats'%(stopat) prof = hotshot.Profile(filename) try: prof.runcall(func, *args) except Exception, e: print e prof.close() stats = hotshot.stats.load(filename) # Print reports including all code, even dependencies. stats.sort_stats('cumulative', 'calls') stats.print_stats(50) stats.sort_stats('time', 'calls') stats.print_stats(50) # Print reports showing only moellus code import moellus package_path = os.path.dirname(moellus.__file__) # Hotshot stores a lowercase form of the path so we need to # lowercase our path or the regular expression will fail. package_path = package_path.lower() stats.sort_stats('cumulative', 'calls') stats.print_stats(package_path, 50) stats.sort_stats('time', 'calls') stats.print_stats(package_path, 50) def rebuild_indices(): symbols.rebuild_indices() pet('Symols') quotes.rebuild_indices() pet('Quotes') profile(rebuild_indices) #-- The hotshot stats report for the 4 runs are: - hotshot_s.txt : symbols index (8088 objects) - hotshot_sq100K.txt : symbols + first 100K of quotes index - hotshot_sq200K.txt : symbols + first 200K of quotes index - hotshot_sq300K.txt : symbols + first 300K of quotes index All files referred to here are at: http://ruggier.org/software/scratch/durustress/ The more interesting is the second ordering, by tottime. Most of the time seems to be consumed by _p_load_state, load_state, get_state, by get_position, and __new__ (this last one is surprising, as to my understanding there should be zero new persistent objects resulting from this process). get_position's share of tottime is high for the 100K and 200K objects run, but surprisingly goes down for the 300K (where logically it should be doing more work). As the number of objects indexed increases, then get_state and __new__ share of time consumption grows to close to all of it. Not sure if anyone can see anything more useful... pointing to ideas for improvement changes. mario