Titus Brown wrote: >-> My CMS project has a need for a defered-request system, but I want it to >-> scale and distribute across multiple processes/servers, so I'm going to >-> try the tuple-space approach. It makes development a bit tougher -- it's >-> hard (for me) to re-imagine a 'monolithic' program in terms of multiple >-> processes running on multiple machines -- but I have hopes that it will >-> work for me in the long run. > >Hi, Graham, > >in Cartwheel, a bioinformatics framework that's been running for several >years, I use a PostgreSQL database as the hub of a system that distributes >jobs across a Beowulf cluster. The underlying mechanism used is a tuple >space. > > Thanks very much for this, Titus. It's especially good to hear of other real-world tuple-space examples. >Jobs are submitted via several mechanisms: either a Web site, or Web services, >or direct database access. The jobs then get picked up by nodes polling >for available jobs. I use PostgreSQL to deal with transactions and database >locking: certain overkill, but quite effective nonetheless. > > Yes. I expect to have both multiple producers and multiple consumers as well, not all of which are known at this time. I also persist the space; in the very short term, I'm using ZODB, since I'm also using it in other parts of the project. But since tuple spaces are rather write-intensive, ZODB is a poor choice for the long term: it's just a stop-gap measure. >Overall, the system works quite well, with only one exception: most of >the jobs involve calling out to an external binary program (I need to run >several closed source binaries, groan), and if that program uses up a lot >of memory or otherwise dies badly, the job can die w/o any record. There >are ways to control for that, but they all involve adding complexity; >controlling and monitoring the remote processes seems to be one of the >places where Linda tuple spaces restrict your options. > > As I feared. :-( I also will need calls out to remote binaries. The first part of my nefarious plan was for each agent (worker/consumer) to add a tuple representing itself (and its purpose) to the space, with a short lease. (I'm implementing leases, similar to JavaSpaces' idea; a tuple has a lifespan in the space, after which it "vanishes".) If an agent hangs, it won't renew its lease in time, will disappear from view. In my implementation, I can count the number of tuples matching a given pattern; so I could find out how many worker agents I have running, and could detect if one or more went down. Part of each agent's tuple would be a name representing the function it fulfils, so I should be able to find out, for example, how many image-conversion agents I have running at any given time. But this doesn't solve the "lost job" problem. That's a bad one. I've considered having the agent process, which communicates with the tuple space, spawn a second process to do the actual work. If anything is going to die, my reasoning goes, it will be the heavy-lifting process. If that process doesn't return an OK to the communications process within an expected time, then the comms process could push the job's tuple back into the space, and perhaps kill itself. Essentially it's a "rollback on timeout" in a long-running transaction. I don't have a whole lot of heavily synchronized stuff going on, so this is about as transactional as I would need to get. The last idea I had was to do "real transactions" on the tuple-server side. Takes are non-permanent until the taker commits the transaction within a specified timeout period. For example, a worker takes a tuple, and promises a commit within two minutes. If the worker fails, the server itself returns the tuple to the space, and invalidates the transaction (preventing the worker from writing any potentially incorrect response-tuples). I could see how this might be a bad thing for systems whose agents don't return their output to the tuple space itself -- sometimes, work would get done, but the agent hangs at the end, so the system would not think it's done -- but it should work for systems where the space really acts as the "shared memory" of the system. I'm a bit leery of my ideas, and that's why I'm sharing them with you; feedback is most welcome! The whole tuple space idea is clean and elegant, and my ideas are messy and roughshod. I feel like a Visigoth draping bearskins in the temple of Pallas, to make it feel more like home... >I've thought about using Pyro to do some inter-node communication, but >I haven't had great luck with Pyro and am unwilling to add it in. You >might consider taking a look at it, though; I suspect I just haven't put >in the time to understand it properly. > > I really would like to leave the door open for agents to be written in any language (though certainly in Python for the near future), so Pyro might be a step in the wrong direction for me. And I'm really taken with the simplicity factor of tuple spaces. If I can write or find a more efficient implementation down the road, it should be relatively easy to integrate into my other work, since the tuple-space semantics (and API) should be almost unchanged. >Hope this helps ;). I'm very happy with the tuple space mechanism and >I think it's a nice lightweight way to distribute jobs. > > It helps a lot! Thanks again. I hope to post a reply a few months from now, agreeing completely with your last statement. ;-) -- Graham