durusmail: durus-users: building a large BTree efficiently
building a large BTree efficiently
2005-10-14
2005-10-14
2005-10-14
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-19
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-26
2005-10-26
2005-10-26
building a large BTree efficiently
mario ruggier
2005-10-17
On Oct 17, 2005, at 5:15 PM, David Binger wrote:
> On Oct 17, 2005, at 10:56 AM, mario ruggier wrote:
>
>> The 3-line iteration over self._items is where it all happens... and
>> it is (almost) pure BTree code. self.get_item_key() just builds a
>> tuple of attr values from item. Looking at BTree's and BNode's
>> __iter__ methods, not clear to me why this will require all items to
>> be in memory.
>
> It is because you can't get the attr values from the item without the
> item being loaded,
> and Durus does not automatically flush object state from memory except
> when shrink_cache()
> is called.

Ah, I was not aware of that!

I am playing with shrink_cache()... first of all I am setting a
iteration chunk size of:

         chunk = int(self._p_connection.cache.get_size()/2)

It is half the cache_size, as the objects that must be loaded are both
the values, but also persistent objects used in the key that is a
tuple, for each key,value added to the BTree.

Then inside the iteration loop, when I hit a chunk count, I am doing:

                 cache_count = self._p_connection.cache.get_count()
                 self._p_connection.shrink_cache()
                 shrunk_count = self._p_connection.cache.get_count()

However, cache_count and shrunk_count are almost always the same, and
well above the value of the connection's cache_size (even the first
time around). After a few chunks, some objects seem to be removed from
the cache, but very few... The Virtual Memory size of the process
continues to increase slowly.

Here's some specific numbers (running under profile actually), for a
connection with a cache_size of 50000. It seems that around the 200K
object mark (iteration) it really starts to flail... The data columns
printed below are:
num chunk * chunk_size : cache_count ~ shrunk_count : chunk time ~ tot
time

$ python build_stocks_db_indices.py
....
Rebuilding Quotes index: ('symbol', 'date')
1 * 25000 : 26804 ~ 26804 : 106.8546 ~ 106.8593
2 * 25000 : 53482 ~ 53482 : 129.3596 ~ 236.2189
3 * 25000 : 80191 ~ 80191 : 139.9835 ~ 376.2024
4 * 25000 : 106875 ~ 106875 : 151.0538 ~ 527.2562
5 * 25000 : 133570 ~ 133093 : 158.1503 ~ 685.4065
6 * 25000 : 159771 ~ 158076 : 203.4847 ~ 888.8912
7 * 25000 : 184775 ~ 183403 : 224.9891 ~ 1113.8803
8 * 25000 : 210126 ~ 208997 : 248.6730 ~ 1362.5533
9 * 25000 : 235657 ~ 234545 : 401.5871 ~ 1764.1404
10 * 25000 : 261258 ~ 259992 : 661.5155 ~ 2425.6560


mario

reply