durusmail: quixote-users: Re: Deferrer: helper class for spawning single use threads
Deferrer: helper class for spawning single use threads
2004-02-28
Re: Deferrer: helper class for spawning single use threads
2004-03-04
Re: Deferrer: helper class for spawning single use threads
2004-03-04
2004-03-04
2004-03-05
Re: Deferrer: helper class for spawning single use threads
2004-03-05
Re: Deferrer: helper class for spawning single use threads
2004-03-05
Re: Deferrer: helper class for spawning single use threads
Graham Fawcett
2004-03-04
Titus Brown wrote:

>-> My CMS project has a need for a defered-request system, but I want it to
>-> scale and distribute across multiple processes/servers, so I'm going to
>-> try the tuple-space approach. It makes development a bit tougher -- it's
>-> hard (for me) to re-imagine a 'monolithic' program in terms of multiple
>-> processes running on multiple machines -- but I have hopes that it will
>-> work for me in the long run.
>
>Hi, Graham,
>
>in Cartwheel, a bioinformatics framework that's been running for several
>years, I use a PostgreSQL database as the hub of a system that distributes
>jobs across a Beowulf cluster.  The underlying mechanism used is a tuple
>space.
>
>

Thanks very much for this, Titus. It's especially good to hear of other
real-world tuple-space examples.

>Jobs are submitted via several mechanisms: either a Web site, or Web services,
>or direct database access.  The jobs then get picked up by nodes polling
>for available jobs.  I use PostgreSQL to deal with transactions and database
>locking: certain overkill, but quite effective nonetheless.
>
>

Yes. I expect to have both multiple producers and multiple consumers as
well, not all of which are known at this time. I also persist the space;
in the very short term, I'm using ZODB, since I'm also using it in other
parts of the project. But since tuple spaces are rather write-intensive,
ZODB is a poor choice for the long term: it's just a stop-gap measure.

>Overall, the system works quite well, with only one exception: most of
>the jobs involve calling out to an external binary program (I need to run
>several closed source binaries, groan), and if that program uses up a lot
>of memory or otherwise dies badly, the job can die w/o any record.  There
>are ways to control for that, but they all involve adding complexity;
>controlling and monitoring the remote processes seems to be one of the
>places where Linda tuple spaces restrict your options.
>
>
As I feared. :-(   I also will need calls out to remote binaries.

The first part of my nefarious plan was for each agent (worker/consumer)
to add a tuple representing itself (and its purpose) to the space, with
a short lease. (I'm implementing leases, similar to JavaSpaces' idea; a
tuple has a lifespan in the space, after which it "vanishes".) If an
agent hangs, it won't renew its lease in time, will disappear from view.

In my implementation, I can count the number of tuples matching a given
pattern; so I could find out how many worker agents I have running, and
could detect if one or more went down. Part of each agent's tuple would
be a name representing the function it fulfils, so I should be able to
find out, for example, how many image-conversion agents I have running
at any given time.

But this doesn't solve the "lost job" problem. That's a bad one.

I've considered having the agent process, which communicates with the
tuple space, spawn a second process to do the actual work. If anything
is going to die, my reasoning goes, it will be the heavy-lifting
process. If that process doesn't return an OK to the communications
process within an expected time, then the comms process could push the
job's tuple back into the space, and perhaps kill itself. Essentially
it's a "rollback on timeout" in a long-running transaction. I don't have
a whole lot of heavily synchronized stuff going on, so this is about as
transactional as I would need to get.

The last idea I had was to do "real transactions" on the tuple-server
side. Takes are non-permanent until the taker commits the transaction
within a specified timeout period. For example, a worker takes a tuple,
and promises a commit within two minutes. If the worker fails, the
server itself returns the tuple to the space, and invalidates the
transaction (preventing the worker from writing any potentially
incorrect response-tuples). I could see how this might be a bad thing
for systems whose agents don't return their output to the tuple space
itself -- sometimes, work would get done, but the agent hangs at the
end, so the system would not think it's done -- but it should work for
systems where the space really acts as the "shared memory" of the system.

I'm a bit leery of my ideas, and that's why I'm sharing them with you;
feedback is most welcome! The whole tuple space idea is clean and
elegant, and my ideas are messy and roughshod. I feel like a Visigoth
draping bearskins in the temple of Pallas, to make it feel more like home...

>I've thought about using Pyro to do some inter-node communication, but
>I haven't had great luck with Pyro and am unwilling to add it in.  You
>might consider taking a look at it, though; I suspect I just haven't put
>in the time to understand it properly.
>
>
I really would like to leave the door open for agents to be written in
any language (though certainly in Python for the near future), so Pyro
might be a step in the wrong direction for me. And I'm really taken with
the simplicity factor of tuple spaces. If I can write or find a more
efficient implementation down the road, it should be relatively easy to
integrate into my other work, since the tuple-space semantics (and API)
should be almost unchanged.

>Hope this helps ;).  I'm very happy with the tuple space mechanism and
>I think it's a nice lightweight way to distribute jobs.
>
>

It helps a lot! Thanks again. I hope to post a reply a few months from
now, agreeing completely with your last statement. ;-)

-- Graham



reply