Re: [Durus-users] building a large BTree efficiently

On Oct 19, 2005, at 6:14 PM, mario ruggier wrote:

>
> On Oct 17, 2005, at 5:15 PM, David Binger wrote:
>
>
>>
>> On Oct 17, 2005, at 10:56 AM, mario ruggier wrote:
>>
>>
>>> The 3-line iteration over self._items is where it all happens...
>>> and it is (almost) pure BTree code. self.get_item_key() just
>>> builds a tuple of attr values from item. Looking at BTree's and
>>> BNode's __iter__ methods, not clear to me why this will require
>>> all items to be in memory.
>>>
>>
>> It is because you can't get the attr values from the item without
>> the item being loaded,
>> and Durus does not automatically flush object state from memory
>> except when shrink_cache()
>> is called.  (Note that commit() and abort() both call shrink_cache
>> () so this does method
>> is not normally called directly by an application.)
>>
>
> I have tinkered further with this. There seems to be very different
> behaviors between shrink_cache() and commit(). Here's a specific
> illustration. The 2 functions below (rebuild and repair index) have
> each been run on a container with 2.2 million objects. I am running
> durus with:
> -  logging level 10, to write out all the info messages
> - cache size of 20000
> - chunk size of cache_size/2 (frequency of commits/shrinking => 10000)
>
> connection.commit() calls connection.shrink_cache()... however, as
> can be seem from the info logging from the shrink_cache() in each
> case. In the case of commit() the "size" of the cache (len
> (cache.objects)) just keeps  increasing with every 10K iteration,
> at the same rate, causing a machine with limited ram to quickly
> become memory-bound. The "loaded" amount (len(loaded_oids))
> stabilizes in both cases at around 60000 (but, with cache_size set
> to 20000).

shrink_cache() won't remove any object that has been changed since
the last commit().  It has the best chance to actually remove objects
when it is called at the end of a commit() or abort() call.
Each you call shrink_cache(), it only looks at a fraction of the
total available
objects to see if they can be removed:  this avoids long delays.

Do the items in your indices have uncommitted changes during these
loops?
You might also consider calling _p_set_status_ghost() on each item,
if you
are sure that you don't have anything there that needs saving.