durusmail: durus-users: building a large BTree efficiently
building a large BTree efficiently
2005-10-14
2005-10-14
2005-10-14
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-19
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-26
2005-10-26
2005-10-26
building a large BTree efficiently
mario ruggier
2005-10-14
Hi,

I am wondering if anyone can suggest any reasonable way to optimize
building a BTree: the objects (values) already exist, and are added to
the BTree with a calculated key.

In particular, I am referring to the method below, that builds a unique
index given an index key (that is a tuple of attribute names, on the
item value), and where:
- when an item is first added to the database, it is added to the its
container's "reference" data BTree (container._items), using an
auto-generated integer id key, that will never change
- the reference data BTree is also one of the indices: container._items
== container._indices['id']

The container method below is used to rebuild any supplementary unique
index defined on the container:

     def rebuild_index(self, index_key):
         if index_key == 'id':
             return None # not allowed to rebuild 'id' index
         print 'Rebuilding %s index: %s' %(self.__class__.__name__,
index_key)
         index = self._indices[index_key]
         index.clear()
         for idkey in self._items:
             item = self._items[idkey]
             index.add(self.get_item_key(item, index_key), item)
         assert len(self._items.keys()) == len(index.keys())

It has worked just fine, until recently... ;)  Now, letting it rip on a
BTree with 2.2 million objects, takes a while. Given my machine is so
slow, it is pointless to give actual time values, but, one remarkable
metric is that the time required to process 100K objects seems to
roughly double with each 100K processed.

In this particular case, I am using degree 16. The constant re-ordering
when an item is added probably accounts for the increasing times...
plus, something that is elusive to me, is that the container's
__cmp__() (I mailed this in another thread yesterday) should give the
same item ordering across all indices?

Can anything be done to alleviate this?

mario

reply