durusmail: durus-users: building a large BTree efficiently
building a large BTree efficiently
2005-10-14
2005-10-14
2005-10-14
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-19
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-26
2005-10-26
2005-10-26
building a large BTree efficiently
mario ruggier
2005-10-17
On Oct 17, 2005, at 4:20 PM, David Binger wrote:
> On Oct 17, 2005, at 8:23 AM, mario ruggier wrote:
>
>> OK, I have a few stats files if anyone cares to eyeball thru them.
>>
>> The more interesting is the second ordering, by tottime. Most of the
>> time seems to be consumed by _p_load_state, load_state, get_state, by
>> get_position, and __new__ (this last one is surprising, as to my
>> understanding there should be zero new persistent objects resulting
>> from this process). get_position's share of tottime is high for the
>> 100K and 200K objects run, but surprisingly goes down for the 300K
>> (where logically it should be doing more work).
>>
>> As the number of objects indexed increases, then get_state and
>> __new__ share of time consumption grows to close to all of it.
>
> When an object is loaded, the __new__ method is called.
> All of the persistent instances are "new" to this process.

Ah, ok.

> I think your tests are measuring read() and paging times
> more than anything else.  Your rebuild_index function apparently
> requires every object to be loaded and kept in RAM.  I don't think
> you have enough RAM to do that.
>
> You might see some change (could be better or could be worse)
> by calling Connection.shrink_cache() occasionally, or by using a BTree
> with a
> different degree.

Will experiment with the shrink_cache() call. I am using the default
8000 as cache_size.

To play with the BTree degree, I'd need to regenerate the db in the
first place... maybe on another machine ;) I assume I should try to use
a higher degree?

As for requiring all objects in memory, here's the pertinent code in
the container class for the build_index function. Here the variables
self._items and index are BTrees of the same degree.

The 3-line iteration over self._items is where it all happens... and it
is (almost) pure BTree code. self.get_item_key() just builds a tuple of
attr values from item. Looking at BTree's and BNode's __iter__ methods,
not clear to me why this will require all items to be in memory.

class PersistentContainer(Persistent):
     ...
     def __init__(self, minimum_degree=16):
         self._minimum_degree = minimum_degree
         self._items = self.mBTree() # uses class BNode%(minimum_degree)s
         self._indices = PersistentDict()
         ...

     def rebuild_index(self, index_key):
         if index_key == 'id':
             return None # not allowed to rebuild 'id' index
         print 'Rebuilding %s index: %s' %(self.__class__.__name__,
index_key)
         index = self._indices[index_key]
         index.clear()
         for idkey in self._items:
             item = self._items[idkey]
             index.add(self.get_item_key(item, index_key), item)
         assert len(self._items.keys()) == len(index.keys())


mario

reply