============== Durus Overview ============== :Author: Mike Orr :Date: 2006-10-18 Durus is a simple object database for Python. This analysis covers Durus 3.5. Database theory =============== A Durus database behaves like a Python dictionary. You can store any picklable value in it: strings, numbers, dicts, lists, class instances, etc. [#pickle]_ This makes it much easier to work with than SQL databases with their peculiar language, rigid data structure, and foreign keys, and the performance is often the same if not better. (Durus is recommended only for "read often, write occasionally" databases of less than a million records.) Here's a typical interactive use:: 1 from durus.connection import Connection 2 from durus.file_storage import FileStorage 3 from durus.persistent import Persistent 4 from durus.persistent_dict import PersistentDict 5 class MyClass(Persistent): 6 ... 7 storage = FileStorage("test.durus") 8 conn = Connection(storage) 9 root = conn.get_root() 10 root["nonpersist_dict"] = {"A": 1, "B": 2} 11 root["persist_dict"] = PersistentDict({"C": 3, "D": 4}) 12 root["persist_obj"] = MyClass() 13 conn.commit() All database management is via connection methods (lines 9 & 13); all data access is via the root object (lines 10-12). Durus is transactional so you must commit your changes (line 13) in order to save them. (If you decide not to make the changes after all, call ``conn.abort()`` to undo them -- if you haven't committed them yet.) Durus makes a distinction between "persistent" objects (anything that inherits from Persistent) and "nonpersistent" objects (anything else). The value in line 10 is an ordinary Python dict (nonpersistent). This means it's stored in its parent's pickle and loaded into memory along with it. The value in line 11 is a PersistentDict, which behaves the same but is stored in a separate pickle and loaded only when one of its instance attributes is accessed. Persistent is often preferable but not always; see "When to Use Persistent Objects" below for the tradeoffs. A persistent object has three states: SAVED, UNSAVED, and GHOST. State refers to the relationship between the object's .__dict__ in memory (the cached value) and the pickle on disk. SAVED means the two are identical. UNSAVED means the cache contains unsaved changes. GHOST means the cache is empty (the memory has been released): the state will be loaded from disk at the next access. Every object is initially in GHOST state. [#state]_ Every persistent object has a unique object ID ("OID"). When a transaction is committed, a new pickle is appended for every modified persistent object, and an OID-to-pickle index is written. Old pickles and indexes remain in the file but are unused, though they can be accessed by "time travelling". Running the "pack" operation deletes all old pickles and indexes and releases the disk space they occupied. Durus is thus a typical "append-only" database. This gives it excellent performance, reliability and recoverability, but it also mean the file size can become large if objects are frequently modified, such as appending to a list or changing a dict value. When to use Persistent objects ============================== *These are just the opinions of one user; they may not take into account all situations.* Persistent is great for large objects that are used less often than their parent because they can be ghosted separately, saving memory. Persistent is also good when an object is modified more frequently than its parent because a smaller pickle is written. However, in both cases a persistent object will be loaded more frequently than a nonpersistent one, which may slow the system down. A persistent object also takes more disk space than a nonpersistent one. However, a mutable nonpersistent object has the disadvantage that *Durus won't notice if it's modified*. If you modify an attribute of a nonpersistent instance, or add/modify/delete a value in a nonpersistent list or dict, you must call ``._p_note_change()`` on the persistent parent or Durus may not save the change. With dict-like collections there are two levels of persistency, the dict itself and its values. You can have a: - Nonpersistent dict - PersistentDict - BTree with nonpersistent values - BTree with persistent values The nonpersistent dict and PersistentDict have the same (dis)advantages as any (non)persistent object. They load the values values all at once. A BTree loads the values in small groups, which saves memory if you access only a few values at a time or iterate the values, though it may slow the application due to the additional loads. Between a BTree of nonpersistent values vs a BTree of persistent values, it would depend on the size of the values, the redundancy between them, and whether you frequently iterate them or access only a few disparate ones. Every pickle is compressed, meaning that duplicate strings between values will compress better if in the same pickle. There is also some overhead in both size and load speed for every pickle. On the other hand, why unpickle 16 large values when you'll only access one of them? If memory, speed, or file size become critical for an application, try replacing some persistent objects with nonpersistent or vice-versa. If the object is large or there are many of them, it may make a measurable difference. Python's .__slots__ feature can also cut memory consumption significantly if you have thousands of the same type of object in memory of once. (An object using .__slots__ must be nonpersistent in the current version of Durus.) Module relationship =================== Durus consists of the following modules: btree The **BTree** class is like a dict except its values are loaded in chunks of 16 (**BNode**'s) rather than all at once. Other chunk sizes are available in multiples of 2: from **BNode2**, **Bnode4**, **Bnode8**, ... to **BNode512**. BTree keys are ordered so they iterate alphabetically/numerically. client Implementation of the "durus -c" option in the command-line tool. It starts an interactive Python session with a Durus database open. client_storage **ClientStorage** is a Storage subclass that connects to a remote Durus server. connection The **Connection** class is the mediator between your Python code and a Storage object. Its superclass ConnectionBase is defined in the 'persistent' module. The constructor has two arguments: 1. storage: a Storage subclass instance. 2. cache_size=100000: the maximum number of Persistent objects to hold in the memory cache even if they aren't being used. See "Connection Methods" below for the interesting methods you can call. **Cache** is an internal class that manages the memory cache. convert_file_storage This module is for backward compatibility. It contains functions to convert between older and newer FileStorage formats. durus The top-level script for the "durus" command-line tool. If you install Durus the old way ("python setup.py install") it will be the actual bin/ script. If you install Durus as an egg ("easy_install durus"), it will be located in "Durus-VERSION-PYTHON_VERSION-ARCH.egg/EGG-INFO/scripts/durus", and a stub calling it will be created for the bin/ script. error Exceptions. file_storage **FileStorage** is the most commonly-used Storage subclass. It manages a database file located on disk. The constructor has three arguments: 1. filename: the database to open or create. Default is ``None`` meaning it will create a temporary file. 2. readonly: true to open the database read-only. This activates certain optimizations and allows other callers to open the database read-only simultaneously. Default ``False``. 3. repair: true to repair the database if it's corrupt. Repairing usually involves truncating the file to eliminate incomplete transactions. This just sets an instance attribute; I don't see how the actual repairing is done. Default ``False``. **FileStorage1** and **FileStorage2** are concrete subclasses that define the structure of the physical file. **TempFileStorage** handles the case of a temporary file. Do not instantiate these classes directly: FileStorage will choose the appropriate one and switch your instance to it. The FileStorage2 docstring explains the layout of the pickles, indexes, and transactions in the database file. history **HistoryConnection** is a special read-only connection that lets you time-travel to previous versions of the database's state. You can step through previous/next transactions, or jump to the nearest transaction in which a certain persistent object was modified. Internal classes: **HistoryFileStorage** and **_HistoryIndex**. (Note: packing a database erases all previous history.) logger Front end to Python's logging facility. pack_storage Implementation of the "durus -c" option in the command-line tool. It packs a database. persistent **Persistent** is the base class of persistent objects. A persistent object is stored in a separate pickle and loaded into memory only when accessed. See "When to Use Persistent Objects" below. Every persistent object has a unique object ID (OID) which is an int or long. Persistent inherits from **PersistentBase**, which contains __slots__ for for required attributes, both for speed and to avoid having them clutter your object's __dict__ (they shouldn't be stored anyway). There's also a **ComputedAttribute** class, which is explained below. persistent_dict, persistent_list, persistent_set **PersistentDict**, **PersistentList**, and **PersistentSet** act like the corresponding Python types but are persistent. See "When to Use Persistent Objects" below. run_durus Implementation of the "durus -s" option in the command-line tool. It runs a daemon that serves a database to multiple clients and/or remote clients simultaneously. It listens via a TCP socket or Unix domain socket. This is the only way that multiple clients can open a database for writing. (If all clients are read-only, they should use FileStorage directly for efficiency.) serialize Internal module, not meant to be used directly. **ObjectReader** and **ObjectWriter** transfer a Persistent object between memory and the database. storage **Storage** is the abstract base class for FileStorage et al. **MemoryStorage** is a concrete class that saves the database in memory. The data is lost when the database goes out of scope, so it's useful only for testing and to demonstrate how easy it is to write a Storage backend. storage_server **StorageServer** is used by run_durus. Contains classes to encapsulate a host/port address and a Unix domain socket address. utils Internal functions. The current ones pack OID numbers into C-language byte strings. Connection methods ================== Only methods useful in typical applications are listed. abort() Discard (roll back) all changes to the data since the last commit. commit() Save all changes permanently. If you modify the data without calling .commit(), the changes will be lost. Also, others won't see your changes until you commit. get_storage() Return the actual Storage object. get_root() Return the root data object. Almost every application needs to call this. get_crawler(start_oid=ROOT_OID, batch_size=100) Iterate all persistent objects in an efficient manner. 'batch_size' is a hint to the Storage. pack() Shrink the database file by removing old versions of the data -- anything not necessary to represent the current state. Calls .abort() implicitly. There is also a useful **function** in the ``connection`` module: gen_every_instance(conn, \*classes) Iterate every persistent object that's an instance of any of the specified classes. ComputedAttribute ================= ComputedAttribute is a way to have an attribute in a persistent object whose value is calculated from other persistent objects, and then cached but not stored. It's a little cumbersome to use and only really useful if you have multiple clients sharing a read-write storage. It works something like this:: from durus.persistent import ComputedAttribute, Persistent def get_attrib_value(): return "SOME_VALUE_CALCULATED_FROM_OTHER_PERSISTENT_OBJECTS" class MyClass(Persistent): def __init__(self): self.my_attrib = ComputedAttribute() obj = MyClass() print obj.my_attrib.get(get_attrib_value) # Calls the function. print obj.my_attrib.get(get_attrib_value) # Uses the cached value. print obj.my_attrib.value # Uses the cached value only. obj.my_attrib.invalidate() # If we access obj.my_attrib.value now we'd get AttributeError. print obj.my_attrib.get(get_attrib_value) print obj.my_attrib.value For simpler cases you're probably better off using a Python property. Database file format ==================== The database file format is described in the file_storage.FileStorage2 docstring. It essentially looks like this: Database file header + transactions + index + transactions Transaction List of persistent objects: all objects added/changed since the last transaction. (Deleted objects are simply not referenced anymore.) Object OID + pickled class + pickled state + list of referenced OIDs - OID is a positive int packed in a "C" byte string - Class is the object's .__class__ - State is the pickled .__dict__ - The list tells which presistent objects are contained in this one Index Dict of oid : byte offset of object in file The history connection reads all transactions and reconstructs an index for every transaction. .. [#pickle] Pickle is a way to store Python objects on disk. See the ``pickle`` module in the Python Library Reference for more info. .. [#state] When a persistent object's parent is loaded, the child object is in GHOST state, with only a tiny stub in memory. Accessing an instance attribute loads the .__dict__ from disk and changes the state to SAVED. Modifying an attribute changes it to UNSAVED. Committing the transaction writes a new pickle and changes the state to SAVED. When the cache manager needs memory it discards some unused .__dicts__ and changes the state to GHOST.