durusmail: durus-users: building a large BTree efficiently
building a large BTree efficiently
2005-10-14
2005-10-14
2005-10-14
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-19
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-26
2005-10-26
2005-10-26
building a large BTree efficiently
mario ruggier
2005-10-20
On Oct 20, 2005, at 4:13 PM, David Binger wrote:
> On Oct 19, 2005, at 6:14 PM, mario ruggier wrote:
>>
>> connection.commit() calls connection.shrink_cache()... however, as
>> can be seem from the info logging from the shrink_cache() in each
>> case. In the case of commit() the "size" of the cache
>> (len(cache.objects)) just keeps  increasing with every 10K iteration,
>> at the same rate, causing a machine with limited ram to quickly
>> become memory-bound. The "loaded" amount (len(loaded_oids))
>> stabilizes in both cases at around 60000 (but, with cache_size set to
>> 20000).
>
> shrink_cache() won't remove any object that has been changed since
> the last commit().  It has the best chance to actually remove objects
> when it is called at the end of a commit() or abort() call.

That seems an important point to be aware of, that I was not.

> Each you call shrink_cache(), it only looks at a fraction of the total
> available
> objects to see if they can be removed:  this avoids long delays.

OK.

> Do the items in your indices have uncommitted changes during these
> loops?
> You might also consider calling _p_set_status_ghost() on each item, if
> you
> are sure that you don't have anything there that needs saving.

For this case this should work brilliantly, as there are no uncommitted
changes...
However, this whole exercise is to understand better what and how much
can be done. And, I would say that normally I cannot assume that
there'd be no uncommitted changes, but I guess that could easily be
made a condition for launching a rebuild index (but this may not be
true for other situations, see below).

Can this be made safer by simply checking if the objects's
_p_status==SAVED, and if so then I can freely GHOST the object ?
Furthermore, can this behaviour be kicked in selectively, such as when
the cache approaches its configured capacity ?

Can such an algorithm also be applied to common other situations, that
require iterating over a whole btree or other container item ? Example
of this is calling BTree.get_count(), that, given a big enough BTree
(or a small enough machine), will never return.... Another example in
my case is the rebuilding of inverses of many-to-one relationships.

mario

reply