[Durus-users] Re: A newcomer and BerkeleyDB

Hi Jesus,

A few comments.

Jesus Cea  wrote:
>  While you are packing a database, access to objects are blocked.

Not true in version 3.

>  - BerkeleyDB doesn't need packing to remove outdated instances.
>
>  - With a proper code, BerkeleyDB doesn't need packing to remove
> unreachable instances. For example, keeping a reference count in
> objects. If this problem were solved, you couldn't need to pack the
> storage, ever.

With an object database, there must be some sort of garbage
collection in order to determine which objects are no longer
reachable.  It might be possible to make the gc more efficient when
BerkeleyDB is used but there is no way to eliminate it.

For your application it sounds like reference counting would work
best.  However, other applications may have reference cycles within
the persistent set of objects.  If those applications have similar
space efficiency requirements then they probably need a non-copying
collector (e.g. mark and sweep).  I suspect some current gc research
would be applicable since a recent objective is to minimize virtual
memory paging.

>  - BerkeleyDB is not affected by write ratio.

I find that hard to believe.  In any case, Durus client will be
affected by a high write rate.  If the write rate is high enough
then it's probably more efficient to have a database that can
locking.

> Moreover BerkeleyDB has several advantages:
>
> * BerkeleyDB is transactional, so you can garantee no corruption (except
> harddisk failure, of course). I know that the append-only file that
> Durus currently uses is resiliant to most failures too.

Durus filestorage is transactional and no less safe then BerkeleyDB.
IMHO, it's more safe because the storage format is so simple that
you have a chance of recovering database even when things go
horribly wrong (e.g. bad memory that causes random scribbling over
the file).

> * Backing the log files, BerkeleyDB allows you hot and incremental
> backups. Yes, I know that you can use rsync with current Durus to
> incrementally backup the file.

Again no advantage to BerkeleyDB, AFAIK.  There is no tool to do
incremental backups included with Durus but writing one would be
trivial.

> 500 GB of harddisk :-p. I'm planning to migrate my mail system
> (currently about 300 GB of user mailboxes, stored under a UGLY mbox
> format) to Durus, but I need better write behaviour, since mailbox
> read/write is 1:1. I was planning to migrate the mboxes to BerkeleyDB
> but developing a new backend for Durus could be more general, more
> useful to other people, more "pythonic", and nearly equal cost in
> development time.

Sounds like a fun project.  You might also look into
DirectoryStorage as an alternative to BerkeleyDB.  In any case, the
storage backend will not be the only limitation you run into; 500 GB
is far more data than Durus is designed to handle.  Durus was
designed to be simple and we sacrificed on scalability to achieve
that.  I see no reason why a Python object database couldn't handle
that much data but you are going to have to make some different
design tradeoffs.

Good luck with your project.

  Neil