g.raphaelli's weblog

Entries in the Category “Technical”

Circumventing security policy with gearman

written by g, on Jan 12, 2009 8:42:00 PM.

Gearman can be handy for defeating security in situations like the one pictured:

Here, Host A can initiate connections to Host B but Host B is blocked by a firewall/router ACL/etc from initiating communications with Host A.

With gearman, you can call a service on Host A from Host B with a setup like this:

  1. Run a gearmand job server on host B (or any host that A and B can both reach).
  2. Register a gearman worker running on Host A with that job server.
  3. Submit work from a client running on Host B to that job server.

This idea can be taken another step by chaining workers together such that calling abc() as depicted above creates a job that Host C, which can't reach or be reached by Host B at all, can ultimately execute. That kind of setup makes it increasingly difficult to track an individual task's real status but it can be handy in a pinch.

Gearmand Rewrite Released

written by g, on Jan 9, 2009 5:43:00 PM.

There is some big news for Gearman fans out today - the first release of the C-rewrite of the gearmand job server has been announced.

Gearman is an excellent framework for farming out tasks to pools of machines, parallelizing tasks, and making calls between programming languages that speak gearman. Like memcached, it's extremely easy to get started using it. Once you do start using it, you'll never understand how you lived without it.

The new release does not provide python bindings but the code in sixapart's svn should be compatible with the new server*. The sample code provided with the both the perl and C versions is pretty trivial as they simply echo or reverse some strings. To pass interesting data between client and worker you'll need to create your own convention. YAML or simply JSON is very handy for this.

Let's take a contrived example. Of course, error handling will be omitted for clarity.

This is a gearman worker that expects a list of urls and will do something to each of them.

""" A Sample Gearman Worker """
import logging

from functools import wraps
import os
import simplejson as json
import urllib

from gearman import GearmanWorker

def job_in(fn):
    """ Decorates worker functions by calling them with a job's arguments """
    @wraps(fn)
    def new(job):
        # do something with the job object
        return fn(job.arg)
    return new

def json_in(fn):
    """ Decorates a function that may be called with a
        JSON-formatted string but expects a python object """
    @wraps(fn)
    def new(arg):
        # convert the args in JSON to a python object
        arg = json.loads(arg)
        return fn(arg)
    return new

@job_in
@json_in
def fetch(urls):
    success = 0
    for url in urls:
        logging.debug("fetching %s" % url)
        # do something with the url
        success += 1

    return json.dumps({'fetched': success})

worker = GearmanWorker(jobservers)
worker.register_function("fetch", fetch)
worker.work()

A client for this worker can look like:

import simplejson as json
from gearman import GearmanClient, Task

urls = ['http://www.flickr.com', 'http://www.yahoo.com']

client = GearmanClient(jobservers)
response = client.do_task(Task("fetch", arg=json.dumps(urls)))

print "%i urls fetched successfully" % json.loads(response)['fetched']

While this example is still quite simple, it does illustrate the idea of the convention necessary for passing real information between clients and workers. The python libraries for gearman also support tasksets which are a set of tasks submitted at once to a job server for parallel execution. This is a simple yet powerful way to speed up work and make the most of hardware investments.

I hope that this rewrite of gearmand renews interest in the python community to enhance the current bindings. I'd be very interested in a gearman protocol implementation for twisted (and might just start working on it soon).

* on Mac OSX 10.4.11 I'm getting a bus error from gearmand after successfully running a job. I haven't tracked it down or submitted a bug yet. update: bug reported via IRC (yes, I'm gilad on freenode) and fixed with a two line patch. A new release should be announced shortly.

Ganglia for MySQL Metrics

written by g, on Jan 5, 2009 12:43:00 PM.

I just made some minor updates to the metric collection module I use for monitoring MySQL with Ganglia. It collects over 100 metrics and includes a basic report like:

MySQL selects, inserts, updates, deletes on one graph
This blog is the only thing using this MySQL server and I don't generate much traffic (hi mom!)

It works either as a ganglia 3.1-style gmond python module or as a gmetric script, for use with any decent version of ganglia. These are the relevant files:

Comments and improvements are welcome.

Packaging Python Modules

written by g, on Jan 1, 2009 4:05:00 PM.

Tools like EasyInstall and PIP are excellent for frequently changing environments. I use virtualenv extensively in my development environments where I can easy_install or pip -E install away. These tools, however, are not appropriate for production environments. In production, repeatability and minimal dependency on development tools are essential. When rolling out hundreds of servers, installing MySQL-python bindings with easy_install would require setting up a build environment on each host. The operating system's package manager is sufficient for managing the installation of these modules.

What if you need different versions of the same package on a particular host? Assuming there are two applications running on a host that require different versions of the same package: move one of the applications to a separate host or VM. If one application needs two versions of a particular package: fix the application. In an emergency, running an application out of a temporary virtualenv can do the job at the expense of operational headaches.

I run Redhat based systems so RPMs are the way to go for me. python setup.py bdist_rpm is handy but there are a number of things I don't like about it:

  • Packages are not python-versioned. pyOpenSSL-0.8-1.i386.rpm does not indicate which python it is built against. Also, what if you want that package for both python2.4 and python2.5?
  • Similarly, I prefer that python library packages be clearly identified in the list of installed packages. A 'python-' prefix is effective. rrdtool-1.2.27-5.i386.rpm is the application package, python25-rrdtool.i386.rpm are the python2.5 bindings.

Since those things are not easy to override in the stock bdist_rpm, I use a lightly modified distutils/command/bdist_rpm.py arbitrarily named bdist_rpm_ver.py. Building production packages is usually as easy as:

  1. easy_install --build-directory ~/rpm/BUILD/ --editable <pkg>
  2. cd ~/rpm/BUILD/<pkg>
  3. pythonXX setup.py bdist_rpm_ver --fix-python --binary-only

This certainly is not perfect yet - sometimes I have to break down and do a --spec-only and tweak things by hand. Fortunately, this happens once and only takes a few minutes to fix and then I have a stable package ready for deployment. Also, I handle dependencies separately from the rpms currently but they can be added in step 3 with --requires, --conflicts, etc.

I'm interested in hearing more about:

I’ve been meaning to write a post on why I think using system packaging for libraries is counter-productive, but that’ll wait for another time.

Ian Bicking - A Few Corrections To “On Packaging”.

In my experience, system packaging is the way to go for simple, repeatable installations to a single host or 1000s of hosts.