durusmail: durus-users: Updated: Comparison of File, Shelf, Postgresql storages.
Comparison of File, Shelf, Postgresql storages.
2007-09-13
Re: Comparison of File, Shelf, Postgresql storages.
2007-09-13
2007-09-13
2007-09-13
2007-09-13
Re: Comparison of File, Shelf, Postgresql storages.
2007-09-13
Updated: Comparison of File, Shelf, Postgresql storages.
2007-09-14
Updated: Comparison of File, Shelf, Postgresql storages.
Michael Watkins
2007-09-14
Here's a last version of this (I hope). Changes include:

- Adding SqliteStorage into the mix
- Providing more accurate numbers for BerkeleyDBStorage,
  PostgresqlStorage, SqliteStorage
- Adding access time results using ClientStorage against a
  StorageServer using each of the storages. The intent was to see
  how well they play together.
- Running "stress.py" from durus.test against StorageServer with
  each of the storage backends.

If I have any conclusion to draw from here its that StorageServer /
ClientConnection tends to smooth out differences.

With all these alternatives to FileStorage2, noting especially
ShelfStorage which will be the new default, perhaps its time to
update the Durus README and change "best suited to collections of
less than a million instances" to reflect the new situation. Its
certainly now possible to have millions of instances, running on
fairly cheap commodity hardware without gobbling up all the server's
RAM, although with all storages you will have to devise a sane
strategy to get the information in there if a bulk load is required.


1. Create a new storage
-----------------------

Create a new Durus db with 500,000 "NewsItem" instances (contained
in a NewsDatabase which is built around BTree):

                    Seconds RAM consumed (at max)
FileStorage2        341     341MB
ShelfStorage        535     375
PgsqlStorage        807     296 (Python) 36MB (Postgres process)
SqliteStorage       607     490
BerkeleyDBStorage   951     346

Note on file space consumed: Shelf and File storages consume the
least diskspace as one might imagine; SqliteStorage adds a little
more overhead. I didn't attempt to measure Postgresql. When
performing the initial import BerkeleyDBStorage filled up my /tmp
partition - during the initial import a number of transaction log
files are still present bringing the storage disk consumption to
~ 250MB, as transaction logs were rolled up this dropped to ~ 150MB.
By way of comparison file storages consume approx 98 - 105MB.

2. Time to Pack
---------------

Following the initial commit of 500,000 new object instances; does
not include start up time (see the table in section 3 for startup
times).

                    Seconds     RAM Consumed During
FileStorage2        52          214MB
*ShelfStorage       248          75
PgsqlStorage        122          44 (Postgresql server)
                                 36 (Python process)
SqliteStorage       84          254
**BerkeleyDBStorage  0.014        - negligible
                    59           30 (6655 'garbage objects')

*ShelfStorage - it was observed that pack times (even for a just
packed Storage) vary more than one would expect.
**Note that the BerkeleyDBStorage storage tracks objects for garbage
collection during normal operation; pack() has nothing to do unless
there is garbage to clean up - it does not examine every record in
the storage as all the other storage examples do. I deleted several
thousand objects to give it something to do, and the second number
reflects that.


3. Start Up Times
-----------------
(time to get "root")

                    Seconds     RAM Consumed
FileStorage2
  Before pack       12.316      75MB
  After pack        3.923      104
ShelfStorage
  Before pack       18.696      75
  After pack         0.006      11
PgsqlStorage         0.029      15 (+ Postgres 18MB)
SqliteStorage
  Before pack        0.011      22
  After pack         0.011      22
BerkeleyDBStorage
  Before pack        0.49*      26
  After pack         0.081      20
  *Not so sure about this number - may have been my system


4. Time to Access Objects
-------------------------

Best of three runs; Return one constant (NewsItem 123); random
objects are selected from within the 500,000 news items returned.

                    (All times after pack, in seconds)
                    K       |--- Random ---------
                    1       10      100     1000
FileStorage2        0.006   0.054   0.304   2.005
ShelfStorage        0.007   0.059   0.323   2.119
PgsqlStorage        0.009   0.072   0.414   2.690
SqliteStorage       0.021   0.060   0.398   2.491
BerkeleyDBStorage   0.008   0.042   0.343   2.229

ClientStorage accessing StorageServer running:
 FileStorage2       0.010   0.074   0.430   2.853
 ShelfStorage       0.011   0.075   0.499   3.171
 PostgresqlStorage  0.025   0.084   0.565   3.647
 SqliteStorage      0.012   0.089   0.514   3.483
 BerkeleyDBStorage  0.024   0.105   0.946   6.725

Editorial note: Durus caching seems to level the playing field for
most storages.


5. StorageServer - stress.py
----------------------------

Results of durus/test/stress.py - 50 loops, 2 runs:
    (% time python stress.py --max-loops=50)

FileStorage2
 run 1 - 8.479u 3.503s 0:17.34 69.0%69.01011+61333k 2+0io 0pf+0w
 run 2 - 1.382u 0.388s 0:04.89 35.9%35.91012+7491k 0+0io 0pf+0w
ShelfStorage
 run 1 - 8.277u 3.509s 0:17.44 67.4%67.41003+60590k 2+0io 0pf+0w
 run 2 - 1.578u 0.635s 0:06.88 31.9%31.91011+8185k 0+0io 0pf+0w
PostgresqlStorage
 run 1 - 8.187u 3.417s 0:24.09 48.1%48.11006+59575k 0+0io 0pf+0w
 run 2 - 1.684u 0.604s 0:06.99 32.6%32.6982+7865k 0+0io 0pf+0w
SqliteStorage
 run 1 - 8.376u 3.783s 0:25.17 48.2%48.21002+59796k 2+0io 0pf+0w
 run 2 - 1.845u 0.792s 0:10.68 24.6%24.61009+8466k 0+0io 0pf+0w
BerkeleyDBStorage
 run 1 - 8.431u 3.448s 1:04.15 18.5%18.5999+60854k 0+0io 0pf+0w
 run 2 - 1.726u 0.551s 0:08.51 26.6%26.61048+7933k 16+0io 0pf+0w
reply