==============
Durus Overview
==============
:Author: Mike Orr
:Date: 2006-10-18

Durus is a simple object database for Python.  This analysis covers Durus 3.5.

Database theory
===============
A Durus database behaves like a Python dictionary. You can store any picklable
value in it: strings, numbers, dicts, lists, class instances, etc. [#pickle]_ 
This makes it much easier to work with than SQL databases with their peculiar
language, rigid data structure, and foreign keys, and the performance is often
the same if not better. (Durus is recommended only for "read often, write
occasionally" databases of less than a million records.) Here's a typical
interactive use::

     1  from durus.connection import Connection
     2  from durus.file_storage import FileStorage
     3  from durus.persistent import Persistent
     4  from durus.persistent_dict import PersistentDict
     5  class MyClass(Persistent):
     6       ...
     7  storage = FileStorage("test.durus")
     8  conn = Connection(storage)
     9  root = conn.get_root()
    10  root["nonpersist_dict"] = {"A": 1, "B": 2}
    11  root["persist_dict"] = PersistentDict({"C": 3, "D": 4})
    12  root["persist_obj"] = MyClass()
    13  conn.commit()

All database management is via connection methods (lines 9 & 13); all data
access is via the root object (lines 10-12). Durus is transactional so you must
commit your changes (line 13) in order to save them.  (If you decide not to
make the changes after all, call ``conn.abort()`` to undo them -- if you
haven't committed them yet.)

Durus makes a distinction between "persistent" objects (anything that inherits
from Persistent) and "nonpersistent" objects (anything else).  The value in
line 10 is an ordinary Python dict (nonpersistent). This means it's stored in
its parent's pickle and loaded into memory along with it. The value in line 11
is a PersistentDict, which behaves the same but is stored in a separate pickle
and loaded only when one of its instance attributes is accessed. Persistent is
often preferable but not always; see "When to Use Persistent Objects" below for
the tradeoffs.

A persistent object has three states: SAVED, UNSAVED, and GHOST. State refers
to the relationship between the object's .__dict__ in memory (the cached value)
and the pickle on disk.  SAVED means the two are identical.  UNSAVED means the
cache contains unsaved changes.  GHOST means the cache is empty (the memory has
been released): the state will be loaded from disk at the next access. Every
object is initially in GHOST state. [#state]_

Every persistent object has a unique object ID ("OID"). When a transaction is
committed, a new pickle is appended for every modified persistent object, and
an OID-to-pickle index is written. Old pickles and indexes remain in the file
but are unused, though they can be accessed by "time travelling". Running the
"pack" operation deletes all old pickles and indexes and releases the disk
space they occupied. Durus is thus a typical "append-only" database. This gives
it excellent performance, reliability and recoverability, but it also mean the
file size can become large if objects are frequently modified, such as
appending to a list or changing a dict value.

When to use Persistent objects
==============================
*These are just the opinions of one user; they may not take into account all
situations.*

Persistent is great for large objects that are used less often than their
parent because they can be ghosted separately, saving memory. Persistent is
also good when an object is modified more frequently than its parent because a
smaller pickle is written. However, in both cases a persistent object
will be loaded more frequently than a nonpersistent one, which may slow
the system down. A persistent object also takes more disk space than a
nonpersistent one. However, a mutable nonpersistent object has the disadvantage
that *Durus won't notice if it's modified*. If you modify an attribute of a
nonpersistent instance, or add/modify/delete a value in a nonpersistent list or
dict, you must call ``._p_note_change()`` on the persistent parent or Durus
may not save the change.

With dict-like collections there are two levels of persistency, the dict itself
and its values. You can have a:

- Nonpersistent dict
- PersistentDict
- BTree with nonpersistent values
- BTree with persistent values

The nonpersistent dict and PersistentDict have the same (dis)advantages as any
(non)persistent object. They load the values values all at once. A BTree loads
the values in small groups, which saves memory if you access only a few values
at a time or iterate the values, though it may slow the application due to the
additional loads.  Between a BTree of nonpersistent values vs a BTree of
persistent values, it would depend on the size of the values, the redundancy
between them, and whether you frequently iterate them or access only a few
disparate ones. Every pickle is compressed, meaning that duplicate strings
between values will compress better if in the same pickle. There is also some
overhead in both size and load speed for every pickle. On the other hand, why
unpickle 16 large values when you'll only access one of them?

If memory, speed, or file size become critical for an application, try
replacing some persistent objects with nonpersistent or vice-versa. If the
object is large or there are many of them, it may make a measurable difference.
Python's .__slots__ feature can also cut memory consumption significantly if
you have thousands of the same type of object in memory of once. (An object
using .__slots__ must be nonpersistent in the current version of Durus.)

Module relationship
===================
Durus consists of the following modules:

btree
    The **BTree** class is like a dict except its values are loaded in chunks
    of 16 (**BNode**'s) rather than all at once.  Other chunk sizes
    are available in multiples of 2: from **BNode2**, **Bnode4**, **Bnode8**,
    ... to **BNode512**.  BTree keys are ordered so they iterate
    alphabetically/numerically.

client
    Implementation of the "durus -c" option in the command-line tool. It
    starts an interactive Python session with a Durus database open.

client_storage
    **ClientStorage** is a Storage subclass that connects to a remote Durus
    server.

connection
    The **Connection** class is the mediator between your Python code and a
    Storage object. Its superclass ConnectionBase is defined in the
    'persistent' module.  The constructor has two arguments:
    
    1. storage: a Storage subclass instance.
    2. cache_size=100000: the maximum number of Persistent objects to hold in
       the memory cache even if they aren't being used.

    See "Connection Methods" below for the interesting methods you can call.
    **Cache** is an internal class that manages the memory cache.

convert_file_storage
    This module is for backward compatibility. It contains functions to convert
    between older and newer FileStorage formats.

durus
    The top-level script for the "durus" command-line tool. If you
    install Durus the old way ("python setup.py install") it will be the actual
    bin/ script. If you install Durus as an egg ("easy_install durus"), it will
    be located in
    "Durus-VERSION-PYTHON_VERSION-ARCH.egg/EGG-INFO/scripts/durus", and
    a stub calling it will be created for the bin/ script.

error
    Exceptions.

file_storage
    **FileStorage** is the most commonly-used Storage subclass. It manages a
    database file located on disk.  The constructor has three arguments:

    1. filename: the database to open or create. Default is ``None``
       meaning it will create a temporary file.
    2. readonly: true to open the database read-only. This activates certain
       optimizations and allows other callers to open the database read-only
       simultaneously. Default ``False``.
    3. repair: true to repair the database if it's corrupt. Repairing usually
       involves truncating the file to eliminate incomplete transactions. This
       just sets an instance attribute; I don't see how the actual repairing is
       done. Default ``False``.

    **FileStorage1** and **FileStorage2** are concrete subclasses that define
    the structure of the physical file. **TempFileStorage** handles the case
    of a temporary file.  Do not instantiate these classes directly:
    FileStorage will choose the appropriate one and switch your instance to it.
    The FileStorage2 docstring explains the layout of the pickles, indexes, and
    transactions in the database file.

history
    **HistoryConnection** is a special read-only connection that lets you
    time-travel to previous versions of the database's state. You can step
    through previous/next transactions, or jump to the nearest transaction in
    which a certain persistent object was modified. Internal classes:
    **HistoryFileStorage** and **_HistoryIndex**. (Note: packing a
    database erases all previous history.)

logger
    Front end to Python's logging facility.

pack_storage
    Implementation of the "durus -c" option in the command-line tool. It packs
    a database.

persistent
    **Persistent** is the base class of persistent objects. A persistent
    object is stored in a separate pickle and loaded into memory only when
    accessed. See "When to Use Persistent Objects" below.  Every persistent
    object has a unique object ID (OID) which is an int or long.
    
    Persistent inherits from **PersistentBase**, which contains __slots__ for
    for required attributes, both for speed and to avoid having them clutter
    your object's __dict__ (they shouldn't be stored anyway).  There's also a
    **ComputedAttribute** class, which is explained below.

persistent_dict, persistent_list, persistent_set
    **PersistentDict**, **PersistentList**, and **PersistentSet** act like
    the corresponding Python types but are persistent.  See "When to Use 
    Persistent Objects" below.

run_durus
    Implementation of the "durus -s" option in the command-line tool. It runs a
    daemon that serves a database to multiple clients and/or remote clients
    simultaneously. It listens via a TCP socket or Unix domain socket. This is
    the only way that multiple clients can open a database for writing. (If all
    clients are read-only, they should use FileStorage directly for
    efficiency.)

serialize
    Internal module, not meant to be used directly.  **ObjectReader** and
    **ObjectWriter** transfer a Persistent object between memory and the
    database.

storage
    **Storage** is the abstract base class for FileStorage et al.
    **MemoryStorage** is a concrete class that saves the database in memory.
    The data is lost when the database goes out of scope, so it's useful only
    for testing and to demonstrate how easy it is to write a Storage backend.

storage_server
    **StorageServer** is used by run_durus. Contains classes to encapsulate a
    host/port address and a Unix domain socket address.

utils
    Internal functions. The current ones pack OID numbers into C-language
    byte strings.


Connection methods
==================
Only methods useful in typical applications are listed.

abort()
    Discard (roll back) all changes to the data since the last commit.

commit()
    Save all changes permanently. If you modify the data without calling
    .commit(), the changes will be lost. Also, others won't see your changes
    until you commit.

get_storage()
    Return the actual Storage object.

get_root()
    Return the root data object.  Almost every application needs to call this.

get_crawler(start_oid=ROOT_OID, batch_size=100)
    Iterate all persistent objects in an efficient manner.  'batch_size' is
    a hint to the Storage.

pack()
    Shrink the database file by removing old versions of the data -- anything
    not necessary to represent the current state.  Calls .abort() implicitly.


There is also a useful **function** in the ``connection`` module:

gen_every_instance(conn, \*classes)
    Iterate every persistent object that's an instance of any of the
    specified classes.

ComputedAttribute
=================
ComputedAttribute is a way to have an attribute in a persistent object whose
value is calculated from other persistent objects, and then cached but not
stored. It's a little cumbersome to use and only really useful if you have
multiple clients sharing a read-write storage. It works something like this::

    from durus.persistent import ComputedAttribute, Persistent

    def get_attrib_value():
        return "SOME_VALUE_CALCULATED_FROM_OTHER_PERSISTENT_OBJECTS"

    class MyClass(Persistent):
        def __init__(self):
            self.my_attrib = ComputedAttribute()

    obj = MyClass()
    print obj.my_attrib.get(get_attrib_value)   # Calls the function.
    print obj.my_attrib.get(get_attrib_value)   # Uses the cached value.
    print obj.my_attrib.value                   # Uses the cached value only.
    obj.my_attrib.invalidate()
    # If we access obj.my_attrib.value now we'd get AttributeError.
    print obj.my_attrib.get(get_attrib_value)
    print obj.my_attrib.value

For simpler cases you're probably better off using a Python property.

Database file format
====================
The database file format is described in the file_storage.FileStorage2
docstring. It essentially looks like this:

Database file
    header + transactions + index + transactions

Transaction
    List of persistent objects: all objects added/changed since the last
    transaction. (Deleted objects are simply not referenced anymore.)

Object
    OID + pickled class + pickled state + list of referenced OIDs

    - OID is a positive int packed in a "C" byte string
    - Class is the object's .__class__
    - State is the pickled .__dict__
    - The list tells which presistent objects are contained in this one

Index
    Dict of oid : byte offset of object in file

The history connection reads all transactions and reconstructs an index for
every transaction.


.. [#pickle] Pickle is a way to store Python objects on disk. See the
   ``pickle`` module in the Python Library Reference for more info.

.. [#state] When a persistent object's parent is loaded, the child object is in
   GHOST state, with only a tiny stub in memory. Accessing an instance
   attribute loads the .__dict__ from disk and changes the state to SAVED.
   Modifying an attribute changes it to UNSAVED.  Committing the transaction
   writes a new pickle and changes the state to SAVED.  When the cache manager
   needs memory it discards some unused .__dicts__ and changes the state to
   GHOST.