[Quixote-users] RFC: SCGI graceful restart

FWIW, I switched LWN over to the SCGI mode of operation (from mod_python) a
week or so ago.  I've been well pleased with the change; the performance
improvement is significant.  The 1.0b1 release works well.

One thing that has been missing, for me, is a "graceful restart"
mechanism.  I've not found a reliable way of doing the "kill and restart"
routine that does not return server errors for a second or two.  On LWN, a
second's downtime fails 5-20 requests, and that kind of bugs me.  It's not
the image we like to project....normally we may not know what we're talking
about, but at least we can say it reliably...

In thinking about this, it occurred to me that you don't really have to
kill the main SCGI server process - you just have to make it kill off its
children and start over.  As long as nothing of important has been imported
into the dispatcher process, the new children will pick up any code
changes.  And the whole thing goes forward with a brief hiccup, and no
failed requests.

Here's the patch, in case anybody can think of a better way.  I'm not
running this production yet, but it performs nicely under heavy load on my
test box.  I doubt it's something that anybody would want to merge
into a 1.0beta release, but, if it's useful to others, maybe it could go
in post 1.0.

(As an aside, does anybody know why the signal module doesn't have
sigaction() and sigprocmask()?  Is it just that nobody has ever implemented
it, or is there a deeper reason?)

jon

Jonathan Corbet
Executive editor, LWN.net
corbet@lwn.net


--- scgi/scgi_server.py.1.0b1   2003-02-19 16:18:59.000000000 -0700
+++ scgi/scgi_server.py 2003-02-20 10:29:18.000000000 -0700
@@ -12,6 +12,7 @@
 import select
 import errno
 import fcntl
+import signal
 from scgi import passfd

 # netstring utility functions
@@ -48,11 +49,11 @@

     def serve(self):
         while 1:
-            os.write(self.parent_fd, "1") # indicates that child is ready
             try:
+                os.write(self.parent_fd, "1") # indicates that child is ready
                 fd = passfd.recvfd(self.parent_fd)
-            except IOError:
-                # parent probably exited
+            except (IOError, OSError):
+                # parent probably exited  (EPIPE comes thru as OSError)
                 raise SystemExit
             conn = socket.fromfd(fd, socket.AF_INET, socket.SOCK_STREAM)
             # Make sure the socket is blocking.  Appearently, on FreeBSD the
@@ -100,7 +101,14 @@
         self.max_children = max_children
         self.children = {} # { pid : fd }
         self.spawn_child()
-
+        self.restart = 0
+
+    #
+    # Deal with a hangup signal.  All we can really do here is
+    # note that it happened.
+    #
+    def hup_signal(self, signum, frame):
+        self.restart = 1

     def spawn_child(self, conn=None):
         parent_fd, child_fd = passfd.socketpair(socket.AF_UNIX,
@@ -128,6 +136,27 @@
             os.close(self.children[pid])
             del self.children[pid]

+    def do_restart(self):
+        #
+        # First close connections to the children, which will cause them
+        # to exit after finishing what they are doing.
+        #
+        for fd in self.children.values():
+            os.close(fd)
+        #
+        # Then do a blocking wait on each until we have cleared the
+        # slate.
+        #
+        for pid in self.children.keys():
+            (pid, status) = os.waitpid(pid, 0)
+        self.children = {}
+        #
+        # Fire off a new child, we'll be wanting it soon.
+        #
+        self.spawn_child()
+        self.restart = 0
+
+
     def delegate_request(self, conn):
         """Pass a request fd to a child process to handle.  This method
         blocks if all the children are busy and we have reached the
@@ -147,7 +176,12 @@
         timeout = 0

         while 1:
-            r, w, e = select.select(self.children.values(), [], [], timeout)
+            try:
+                r, w, e = select.select(self.children.values(), [], [],
timeout)
+            except select.error, e:
+                if e[0] == errno.EINTR:  # got a signal, try again
+                    continue
+                raise
             if r:
                 # One or more children look like they are ready.  Sort
                 # the file descriptions so that we keep preferring the
@@ -209,10 +243,17 @@
         s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
         s.bind((self.host, self.port))
         s.listen(40)
+        signal.signal(signal.SIGHUP, self.hup_signal)
         while 1:
-            conn, addr = s.accept()
-            self.delegate_request(conn)
-            conn.close()
+            try:
+                conn, addr = s.accept()
+                self.delegate_request(conn)
+                conn.close()
+            except socket.error, e:
+                if e[0] != errno.EINTR:
+                    raise  # something weird
+            if self.restart:
+                self.do_restart()


 def main():