[Durus-users] Re: OODB vs SQL

* Patrick K. O'Brien wrote [2005-10-11 16:58:52 -0500]:
> Michael Watkins wrote:
> >
> > Perhaps I should resurrect the stock data experiment first; I've kept the
raw
> > data and experiment around; I guess its time to dust it off again...
>
> If you'd be willing to share the data, I'd be willing to populate a
> Schevo database with it and compare performance against your
> hand-crafted Durus database.  You up for the challenge?  :-)

Ooh, a challenge. Let me have a look... the compressed sql dumps are 200+MB;
am restoring to Postgres now to have a look at them.

.... somewhat later ...

OK, I've created a sample data set for you - see further on in this note.

.... while I waited for data ...

In this example, each Symbol has a List of quote records that will be
approximately 240 days * the number of years the stock has been listed. Assuming
the average symbol has 8 years of data = 1960 some odd records (rings a bell
for me - I believe I grabbed all the data Yahoo had at the time from 1998
on-ward) times 10,000 symbols = 19.6 million quote record instances, approx.

Here's a question:

I assume that over time the "quote" records will be all over the durus file and
that alone will hurt performance in large files. Is that a reasonable
assumption? I've always assumed that databases employ tricks like keeping
related sets of data physically closer together on disk such that common
queries can benefit from the head traveling over the same space.

Ah, the restore is done. Before doing any complex scans that might load data up
in to cache, I did a simple query:

qs=# select id, symbol from symbol where symbol = 'IBM';
  id  | symbol
------+--------
 1742 | IBM
(1 row)

qs=# select count(*) from quote where symbol_id = 1742;
 count
-------
  2517

Near instantaneous response.

To be fair,

qs=# select count(*) from quote;

Takes a number of seconds to process. I guess it doesn't rely upon counters in
its indexes either!

If you are game to build a schevo example, here's what I've done - created two
tar.bz2 files, located at:

http://mikewatkins.net/files/quotes.tar.bz2  115,373,745 Oct 11 16:43
http://mikewatkins.net/files/symbols.tar.bz2      89,498 Oct 11 16:53

Once you've downloaded the files, or if you do not have time to, please let me
know and I'll move the files so that random list-scanning bots and other
downloaders don't use up my bank account and bandwidth downloading the sucker.

It looks to be taking about 10 minutes at 230k/second to get from the server in
New Jersey to the wet west coast here in Vancouver. In either case, once folks
have downloaded the files please send me a note so I can move them off-line.

symbols.tar.bz2 contains a file containing 8088 (how odd) symbols, in a tab
delimited format that looks like:

id              symbol  description
171     GG      Goldcorp Inc
172     MBD     MBIA Inc
173     MYK     Mykrolis Corp
179     ABMD    Abiomed Inc
180     ALCD    Alcide Corporation
181     BNCC    Bnccorp Inc
182     CARSP   Capital Automotive Reit
183     FFDB    Firstfed Bancorp
184     FTDI    Ftd, Inc.

quotes.tar.bz2 contains a file containing 13,353,602 quote records (ok, so I
was off in my guesstimate) that looks like:

symbol_id date                  open    high    low             close   volume
6086    1994-02-23      2.060   2.060   1.880   1.880   41500
6086    1994-02-24      1.880   1.970   1.880   1.880   9800
6086    1994-02-25      1.880   1.970   1.880   1.880   25200
6086    1994-02-28      1.940   1.940   1.880   1.880   12000
6086    1994-03-01      1.880   1.940   1.810   1.880   21600
6086    1994-03-02      1.880   1.880   1.810   1.810   14600
6086    1994-03-03      1.810   1.940   1.810   1.880   5100
6086    1994-03-04      1.940   1.940   1.810   1.910   48300
6086    1994-03-07      1.810   1.880   1.810   1.810   13600
6086    1994-03-08      1.810   1.880   1.690   1.810   25200

So there you go... some real live data, a big swack of it. Perhaps too
simplistic an example to build a test case on but hey, simple is good
sometimes.