* Patrick K. O'Brien wrote [2005-10-11 16:58:52 -0500]: > Michael Watkins wrote: > > > > Perhaps I should resurrect the stock data experiment first; I've kept the raw > > data and experiment around; I guess its time to dust it off again... > > If you'd be willing to share the data, I'd be willing to populate a > Schevo database with it and compare performance against your > hand-crafted Durus database. You up for the challenge? :-) Ooh, a challenge. Let me have a look... the compressed sql dumps are 200+MB; am restoring to Postgres now to have a look at them. .... somewhat later ... OK, I've created a sample data set for you - see further on in this note. .... while I waited for data ... In this example, each Symbol has a List of quote records that will be approximately 240 days * the number of years the stock has been listed. Assuming the average symbol has 8 years of data = 1960 some odd records (rings a bell for me - I believe I grabbed all the data Yahoo had at the time from 1998 on-ward) times 10,000 symbols = 19.6 million quote record instances, approx. Here's a question: I assume that over time the "quote" records will be all over the durus file and that alone will hurt performance in large files. Is that a reasonable assumption? I've always assumed that databases employ tricks like keeping related sets of data physically closer together on disk such that common queries can benefit from the head traveling over the same space. Ah, the restore is done. Before doing any complex scans that might load data up in to cache, I did a simple query: qs=# select id, symbol from symbol where symbol = 'IBM'; id | symbol ------+-------- 1742 | IBM (1 row) qs=# select count(*) from quote where symbol_id = 1742; count ------- 2517 Near instantaneous response. To be fair, qs=# select count(*) from quote; Takes a number of seconds to process. I guess it doesn't rely upon counters in its indexes either! If you are game to build a schevo example, here's what I've done - created two tar.bz2 files, located at: http://mikewatkins.net/files/quotes.tar.bz2 115,373,745 Oct 11 16:43 http://mikewatkins.net/files/symbols.tar.bz2 89,498 Oct 11 16:53 Once you've downloaded the files, or if you do not have time to, please let me know and I'll move the files so that random list-scanning bots and other downloaders don't use up my bank account and bandwidth downloading the sucker. It looks to be taking about 10 minutes at 230k/second to get from the server in New Jersey to the wet west coast here in Vancouver. In either case, once folks have downloaded the files please send me a note so I can move them off-line. symbols.tar.bz2 contains a file containing 8088 (how odd) symbols, in a tab delimited format that looks like: id symbol description 171 GG Goldcorp Inc 172 MBD MBIA Inc 173 MYK Mykrolis Corp 179 ABMD Abiomed Inc 180 ALCD Alcide Corporation 181 BNCC Bnccorp Inc 182 CARSP Capital Automotive Reit 183 FFDB Firstfed Bancorp 184 FTDI Ftd, Inc. quotes.tar.bz2 contains a file containing 13,353,602 quote records (ok, so I was off in my guesstimate) that looks like: symbol_id date open high low close volume 6086 1994-02-23 2.060 2.060 1.880 1.880 41500 6086 1994-02-24 1.880 1.970 1.880 1.880 9800 6086 1994-02-25 1.880 1.970 1.880 1.880 25200 6086 1994-02-28 1.940 1.940 1.880 1.880 12000 6086 1994-03-01 1.880 1.940 1.810 1.880 21600 6086 1994-03-02 1.880 1.880 1.810 1.810 14600 6086 1994-03-03 1.810 1.940 1.810 1.880 5100 6086 1994-03-04 1.940 1.940 1.810 1.910 48300 6086 1994-03-07 1.810 1.880 1.810 1.810 13600 6086 1994-03-08 1.810 1.880 1.690 1.810 25200 So there you go... some real live data, a big swack of it. Perhaps too simplistic an example to build a test case on but hey, simple is good sometimes.