durusmail: durus-users: building a large BTree efficiently
building a large BTree efficiently
2005-10-14
2005-10-14
2005-10-14
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-17
2005-10-19
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-20
2005-10-26
2005-10-26
2005-10-26
building a large BTree efficiently
mario ruggier
2005-10-19
On Oct 17, 2005, at 5:15 PM, David Binger wrote:

>
> On Oct 17, 2005, at 10:56 AM, mario ruggier wrote:
>
>> The 3-line iteration over self._items is where it all happens... and
>> it is (almost) pure BTree code. self.get_item_key() just builds a
>> tuple of attr values from item. Looking at BTree's and BNode's
>> __iter__ methods, not clear to me why this will require all items to
>> be in memory.
>
> It is because you can't get the attr values from the item without the
> item being loaded,
> and Durus does not automatically flush object state from memory except
> when shrink_cache()
> is called.  (Note that commit() and abort() both call shrink_cache()
> so this does method
> is not normally called directly by an application.)

I have tinkered further with this. There seems to be very different
behaviors between shrink_cache() and commit(). Here's a specific
illustration. The 2 functions below (rebuild and repair index) have
each been run on a container with 2.2 million objects. I am running
durus with:
-  logging level 10, to write out all the info messages
- cache size of 20000
- chunk size of cache_size/2 (frequency of commits/shrinking => 10000)

connection.commit() calls connection.shrink_cache()... however, as can
be seem from the info logging from the shrink_cache() in each case. In
the case of commit() the "size" of the cache (len(cache.objects)) just
keeps  increasing with every 10K iteration, at the same rate, causing a
machine with limited ram to quickly become memory-bound. The "loaded"
amount (len(loaded_oids)) stabilizes in both cases at around 60000
(but, with cache_size set to 20000).

Here are the 2 functions, and the shrink_cache() log of some of the
first 10k iterations for each:

     def rebuild_index(self, index_key):
         if index_key == 'id':
             return None # not allowed to modify 'id' index
         log(20, 'Rebuilding %s index: %s' %(self.__class__.__name__,
index_key))
         index = self._indices[index_key]
         index.clear()
         chunk = int(self._p_connection.cache.get_size()/2)
         for i,idkey in enumerate(self._items):
             item = self._items.get(idkey)
             index.add(self.get_item_key(item, index_key), item)
             if (i%chunk)==0:
                 self._p_connection.shrink_cache()
                 log(20, 'Shrunk, chunk %s'%(i))
         assert self._items.get_count() == index.get_count()

     def repair_index(self, index_key):
         if index_key == 'id':
             return None # not allowed to modify 'id' index
         log(20, 'Repairing %s index: %s'%(self.__class__.__name__,
index_key))
         log(20, 'Maximal key: %s'%(self._items.get_max_item()[0]))
         index = self._indices[index_key]
         chunk = int(self._p_connection.cache.get_size()/2)
         for i in xrange(0, 1+self._items.get_max_item()[0]):
             item = self._items.get(i)
             if item is None:
                 continue
             item_key = self.get_item_key(item, index_key)
             if not item is index.get(item_key):
                 index.add(item_key, item)
             if (i%chunk)==0:
                 self._p_connection.commit()
                 log(20, 'Committed, chunk %s'%(i))
         self._p_connection.commit()
         log(20, 'Committed, chunk %s'%(self._items.get_max_item()[0]))
         assert self._items.get_count() == index.get_count()

#### shrink_cache() log output

Rebuilding Quotes index: ('symbol', 'date')
[840] cache size 10980 loaded 10686
Shrunk, chunk 10000
[840] shrink 0.028878s aged 1638 removed 0 ghosted 0 loaded 21358 size
21638
Shrunk, chunk 20000
[840] shrink 0.140190s aged 7359 removed 1 ghosted 719 loaded 31312
size 32326
Shrunk, chunk 30000
[840] shrink 0.390878s aged 8202 removed 0 ghosted 1834 loaded 40148
size 42980
Shrunk, chunk 40000
[840] shrink 0.300888s aged 8456 removed 16 ghosted 4886 loaded 45934
size 53651
Shrunk, chunk 50000
[840] shrink 0.375727s aged 10179 removed 55 ghosted 5814 loaded 50790
size 64252
Shrunk, chunk 60000
[840] shrink 0.623882s aged 11259 removed 0 ghosted 7460 loaded 54005
size 74974
Shrunk, chunk 70000
[840] shrink 0.715707s aged 12838 removed 655 ghosted 7881 loaded 56197
size 84991
Shrunk, chunk 80000
[840] shrink 0.809534s aged 9028 removed 401 ghosted 10438 loaded 56044
size 95248
Shrunk, chunk 90000
[840] shrink 5.728094s aged 8206 removed 185 ghosted 6897 loaded 59667
size 105748
Shrunk, chunk 100000
[840] shrink 4.636201s aged 11545 removed 278 ghosted 8520 loaded 61677
size 116124
Shrunk, chunk 110000
[840] shrink 0.974375s aged 16053 removed 677 ghosted 8178 loaded 63916
size 126134
Shrunk, chunk 120000
[840] shrink 1.104684s aged 14608 removed 586 ghosted 7957 loaded 66481
size 136206
Shrunk, chunk 130000
[840] shrink 2.781937s aged 9486 removed 294 ghosted 10408 loaded 66686
size 146630
Shrunk, chunk 140000
....

Repairing Quotes index: ('symbol', 'date')
Maximal key: 2200000
[667] shrink 0.068419s aged 7253 removed 0 ghosted 0 loaded 20561 size
29013
Committed, chunk 10000
[667] shrink 0.345206s aged 8331 removed 198 ghosted 1014 loaded 30682
size 40125
Committed, chunk 20000
[667] shrink 0.267090s aged 9482 removed 283 ghosted 2532 loaded 39249
size 51187
Committed, chunk 30000
[667] shrink 0.540040s aged 8758 removed 1496 ghosted 4420 loaded 45133
size 61003
Committed, chunk 40000
[667] shrink 0.555359s aged 9536 removed 947 ghosted 6357 loaded 49268
size 71403
Committed, chunk 50000
[667] shrink 0.751708s aged 8240 removed 5448 ghosted 5279 loaded 50609
size 77267
Committed, chunk 60000
[667] shrink 0.698542s aged 8379 removed 7552 ghosted 2842 loaded 54132
size 81090
Committed, chunk 70000
[667] shrink 0.974925s aged 8105 removed 7499 ghosted 5456 loaded 55953
size 84908
Committed, chunk 80000
[667] shrink 0.620938s aged 7610 removed 8687 ghosted 5311 loaded 58470
size 87535
Committed, chunk 90000
[667] shrink 0.933104s aged 6873 removed 12988 ghosted 3287 loaded
54961 size 85896
Committed, chunk 100000
[667] shrink 0.496229s aged 7372 removed 11629 ghosted 4013 loaded
52873 size 85586
Committed, chunk 110000
[667] shrink 0.624497s aged 5664 removed 13171 ghosted 4742 loaded
50292 size 83768
Committed, chunk 120000
[667] shrink 0.494200s aged 6898 removed 9909 ghosted 4605 loaded 51103
size 85183
Committed, chunk 130000
[667] shrink 0.836211s aged 6437 removed 9592 ghosted 5593 loaded 51572
size 86976
Committed, chunk 140000
....


I may either be missing the logic behind this, or it is a problem....

mario

reply