On Jul 2, 2009, at 8:06 AM, Peter Wilkinson wrote:
> Hi David,
>
> Thanks for having a look and sorry for the slow response.
>
> Underlying a lot of my ongoing experimentation is growing databases,
> many GBs so far and continuing to grow. Getting fool proof fast
> replication running against big databases is a priority. I currently
> use full rsyncs but have been trying thinking of ways to be more
> efficient.
>
> One of my hats in my day job is as a sysadmin and we use rsync
> heavily on busy machines and I've grown to be wary of the storm of
> IO it can generate on lots of data. Getting rsync to not read all of
> the source and destination requires the append option that has been
> spoken about before which leads to the issue of ensuring that the
> destination file is a strict subset of the source one.
Agreed, and we are dealing with that by monitoring for changes in the
stat of the prepack file
on the remote machine, and by using --append-verify instead of --
append, always or occasionally.
This does not seem to generate excessive traffic.
#!/usr/bin/env python
"""
Backup remote Durus database.
This assumes that you have configured your system so that
this user can ssh to the remote system without entering a
password.
"""
from commands import getoutput
import sys, os
try:
remote_host, remote_path, local_path = sys.argv[1:]
except:
print("%s " % sys.argv[0])
raise SystemExit
local_prepackstat_path = local_path + '.prepackstat'
local_prepackstat = ''
rsync_flag = '' # By default, just copy the whole thing.
if os.path.exists(local_prepackstat_path):
local_prepackstat = open(local_prepackstat_path).read()
stat_cmd = "ssh %s stat %s.prepack" % (remote_host, remote_path)
#print(stat_cmd)
remote_prepackstat = getoutput(stat_cmd)
if local_prepackstat:
if local_prepackstat == remote_prepackstat:
rsync_flag = '--append-verify' # --append if you need faster.
f = open(local_prepackstat_path, 'w')
f.write(remote_prepackstat)
f.close()
command = 'rsync %s --rsh=ssh %s:%s %s' % (
rsync_flag, remote_host, remote_path, local_path)
#print(command)
os.system(command)
>
> This is where my experimentation started; how can we know that the
> destination is that strict subset? My first thought was to just make
> a unique header for each file and compare those but then realised
> that to append anything to the slave the data has to come from the
> same point in the master which is hard to know after any
> interruption of the appending occurs, eg. slave server is getting an
> update and is offline for 10 minutes, from where in the master does
> the data get read?
>
> The two options to this issue, as I see it, are to run a full rysnc
> for each replication run so that the slave state can be anything at
> all and it will be cleaned up or to keep track of some structure in
> the slave and master and compare where the slave is at with the
> master and therefore be able to append cleanly. I was very
> pleasantly surprised at how little change to the shelf code was
> needed to get enough structure to use.
>
> I hope that clears up where I'm coming from, if nothing else I
> continue to learn a lot about Durus which is delightfully small for
> what it does, small enough for the concepts and code to fit in my
> head ;-)
I understand.