Tom Moertel's Blog

Finally! I have blogged 100 thousand words.

Posted by Tom Moertel Mon, 30 Jan 2012 04:48:00 GMT

I have finally done it! With my recent post on tree traversals, I have managed to write 100 thousand words for my blog:

>> Article.find(:all).inject(0) { |sum,a| sum +=
?>        (a.body + a.extended.to_s).split(/\s+/).length }
=> 100334

That sounds impressive until you realize that my first blog post, Fun with Asterisk, was about nine years ago. So we’re only talking, on average, about 11 thousand words per year. And that’s not hard, if you stick with it.

For me, the trick has been sticking with it. I joined a startup at the end of 2007, and my blogging abruptly lost about four fifths of its pace:

So I need to discipline myself to blog more frequently. I hope the next 100 thousand words won’t take so long to write.

Finally, I’d like to take this opportunity to thank you for reading and commenting. You’re the reason I wrote those words in the first place. You made the first 100 thousand words fun.

Thank you!

Your pal,
Tom Moertel

Posted in site news
Tags 100k, blog, blogging, statistics, writing
1 comment
no trackbacks

The inner beauty of tree traversals

Posted by Tom Moertel Fri, 27 Jan 2012 04:24:00 GMT

This has been done a million times before, but if you haven’t seen it, it’s pretty neat. Let’s say you have a simple recursive data structure like a binary tree:

module Tree where

data Tree a = Empty
            | Node a (Tree a) (Tree a)
            deriving (Eq, Show)

-- some binary trees

t0 = Empty
t1 = Node 1 Empty Empty
t3 = Node 2 (Node 1 Empty Empty) (Node 3 Empty Empty)

Now say you want to traverse the nodes, in “preorder.” So you write a traversal function. It folds a binary operator f through the values in the tree, starting with an initial accumulator value of z. First, it visits the current node, then the left subtree, and finally the right subtree, combining each value it encounters with the current accumulator value to get the next accumulator value. The final accumulator value is the result.

Spelling out the steps, it looks like this:

preorder_traversal f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = let z'   = f v z     -- current node
                            z''  = go l z'   -- left subtree
                            z''' = go r z''  -- right subtree
                        in z'''

But if you wanted an “inorder” traversal instead, you’d write a slightly different function, visiting the left subtree before the current node:

inorder_traversal f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = let z'   = go l z    -- left subtree
                            z''  = f v z'    -- current node
                            z''' = go r z''  -- right subtree
                        in z'''

To see how the two traversal functions work, let’s use both to flatten the 3-node tree t3 we defined earlier.

flatten traversal = reverse . traversal (:) []

test0i = flatten inorder_traversal t3   -- [1,2,3]
test0p = flatten preorder_traversal t3  -- [2,1,3]

Great.

But now you start thinking about post-order traversal. Do you want to write a third traversal function that’s almost the same as the first two?

So you go back to the inorder traversal and start staring real hard at that let expression. Can you rewrite it? Yes:

inorder_traversal2 f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = go r . f v . go l $ z

And now you’ve got it. Just by reordering the applications of (go l), (f v), and (go r), you can control the traversal.

Which leads to a general traversal function that lets some other function control that ordering:

traverse step f z tree = go tree z
  where
    go Empty        z = z
    go (Node v l r) z = step (f v) (go l) (go r) z

Doesn’t that look beautiful?

Now you can just spell out your desired traversals:

preorder   = traverse $ \n l r -> r . l . n
inorder    = traverse $ \n l r -> r . n . l
postorder  = traverse $ \n l r -> n . r . l

A test drive:

test1p = flatten preorder t3   -- [2,1,3]
test1i = flatten inorder t3    -- [1,2,3]
test1o = flatten postorder t3  -- [1,3,2]

And you’re happy.

But can you go further? What if your trees are binary search trees and you just want to find the minimum or maximum value? Can you just traverse to the left or right?

Yes:

leftorder  = traverse $ \n l r -> l . n
rightorder = traverse $ \n l r -> r . n

treemin = leftorder min maxBound
treemax = rightorder max minBound

test2l = treemin t3 :: Int  -- 1
test2r = treemax t3 :: Int  -- 3

Isn’t that neat?

Posted in functional programming
Tags fun, haskell, recursive, refactoring, trees
5 comments
no trackbacks

A flyweight mocking helper for Python

Posted by Tom Moertel Mon, 07 Nov 2011 05:05:00 GMT

Recently, I needed a lightweight mocking solution when adding unit tests to some existing Python code. I could have used one of the many Python mocking libraries, but I only had to replace a few module functions during testing, so I just wrote a tiny mocking helper rather than add a dependency to the project.

The helper is simple and surprisingly versatile:

import contextlib
import functools

@contextlib.contextmanager
def mocked(func,
           expecting=None, returning=None, raising=None,  # specs
           replacement_logic=None, called=1):
    """Stub out and mock a function for a yield's duration.""" 

    if (returning, raising, replacement_logic).count(None) < 2:
        raise ValueError("returning, raising, and replacement_logic " 
                         "are incompatible with each other")

    # default logic for implementing mock fn: return or raise per specs
    def default_logic(*_args, **_kwds):
        if raising:
            raise raising
        return returning

    # prepare wrapper to replace mocked function for duration of yield
    invocations = [0]
    @functools.wraps(func)
    def replacement(*args, **kwds):
        if expecting is not None:
            assert expecting == (args, kwds)  # did we get expected args?
        invocations[0] += 1
        return (replacement_logic or default_logic)(*args, **kwds)

    # replace mocked function, yield to test, and then check & clean up
    module = sys.modules.get(func.__module__)
    setattr(module, func.__name__, replacement)
    try:
        yield  # give control back to test for a while
        assert invocations[0] == called  # was mock called enough?
    finally:
        setattr(module, func.__name__, func)

def uncalled(func):
    """Require that a function not be called for a yield's duration.""" 
    return mocked(func, called=0)

The idea is to wrap test code that requires mocked external functions with a mocking helper. Here’s an example:

def test_ml_errors_must_be_reported(self):
    """When an error occurs, it must be reported; nothing must be sent.""" 
    data = self._request_data()
    exception = mls.MlError()
    exception.exc = '[[[error description]]]'
    with mocked(mls.subscriber_count, raising=exception):
        with uncalled(sms.send):
            resp = self.app.request("/mailings", method='POST', data=data)
    self.assertEqual(resp.status, '200 OK')
    self.assertIn(exception.exc, resp.data)  # must be reported

In this test method, I’m making a simulated POST request to a web service that’s supposed to check how many users are in a mailing list and then, possibly, send some SMS messages. In this case, however, I want to simulate that an error occurs when talking to the mailing-list service. The test must verify that the error is reported and, crucially, that no messages are sent.

The library modules for talking to the mailing-list service and the SMS service are called mls and sms. So, for the duration of the simulated request, I’m replacing two functions in these modules with mock versions. The mock version of mls.subscriber_count, when called, will raise the exception I’m trying to simulate. The mock version of sms.send, however, must not be called; if it is, the uncalled mocking helper will alert me by raising an exception.

So the mocking helpers not only temporarily install mock implementations of functions but also assert that those mock implementations are (or are not) called as expected. In the following code, for example, I use this capability to make sure that the mailing-list service is asked to get the subscriber count for the mailing list having the right key. I also simulate the service returning a subscriber count of 123.

def test_unconfirmed_msgs_must_be_confirmed(self):
    """An unconfirmed msg must be confirmed and not sent.""" 
    mlkey = 'TEST_ML_KEY'
    body = 'Test message!'
    data = self._request_data(mlkey=mlkey, body=body)
    with mocked(mls.subscriber_count, expecting=((mlkey,),{}), returning=123):
        with uncalled(sms.send):
            resp = self.app.request("/mailings", method='POST', data=data)
    self.assertEqual(resp.status, '200 OK')
    self.assertIn('Reply CONFIRM', resp.data)    # we must ask for confirmation
    self.assertIn('123 subscribers', resp.data)  # and supply subscriber count
    self.assertIn(body, resp.data)               # and the msg to be sent

The mocked helper lets you do a few more things, too, but you get the idea: Sometimes a short helper function can take you a long way.

Tags mocks, python, testing
no comments
no trackbacks

The most surprisingly helpful thing I have written

Posted by Tom Moertel Wed, 02 Nov 2011 02:55:00 GMT

Back in 2007, I repaired an aging and fairly obscure A/V receiver that had lost the ability to respond to its remote control. This I did by re-soldering some hard-to-find solder joints that had broken on its circuit board.

On the chance that someone else had a similar problem, I posted some instructions and photos on my blog. I didn’t think much of it at the time.

But since then, every week or so, another comment shows up, thanking me for writing it. Some typical examples:

Fixed my Kenwood V6030D 10 minutes ago. Life is good again, Thanks mate….

Ditto! Worked on my VR-209 like a champ! Thanks!

Thank you for this posting, which I stumbled upon when I was researching the problem of my remote control no longer working for this receiver (VR-507). These pictures were invaluable to locate the faulty pins. (They sure are small.) Re-soldering them restored full functionality to the receiver and the original remote control. Good job!

Thanks, Fixed my KRF-8010D with this, been 5 years fighting with remote working now and then.

There are now about 60 comments like that. I never would have imagined that 60 people would have read the post let alone get out a soldering iron because of it. But they did! And it helped them!

Now, every time I’m feeling down, I Google up that post and read the thank-yous. It cheers me up.

So here’s the lesson: Write it down. If you’ve figured something out, even if it seems unimportant, write it down. Maybe someone else will find it helpful. Maybe a lot of someone elses will find it helpful.

You never know. It might even cheer you up someday.

Posted in interesting stuff
Tags helping, lessons, surprising, writing
5 comments
no trackbacks

Almost there: 100K words

Posted by Tom Moertel Thu, 21 Jul 2011 00:16:00 GMT

In 2007, I counted how many words I had written for my blog and was surprised to discover it was 60 thousand, about a short novel’s worth. Just now, I stumbled upon that earlier count and wondered what the count was today.

>> Article.find(:all).inject(0) { |sum,a| sum +=
?>        (a.body + a.extended.to_s).split(/\s+/).length }
=> 98702

I’m about 1300 words shy of 100 thousand.

Now I’ve got to write a few more posts for the blog.

:-)

Posted in writing
Tags blog, writing
no comments
no trackbacks

Good enough to steal

Posted by Tom Moertel Sun, 10 Jul 2011 22:06:00 GMT

Recently, I wrote a simple plagiarism detector as a fun programming exercise. Then, merely a few days later, some company gave me cause to use it.

This company, it seems, was looking to hire a programmer. So they posted a job ad that was more or less word-for-word copied from a job ad that I had written for the company where I work. The duplication, being so extensive, was hard to miss. (The offending company, to its credit, promptly removed the copied ad from its web site when we let them know about it.)

I had originally written the text used in that ad back in December 2010, when we were starting another round of hiring. I had hoped that when the right kind of programmers read it, they would discern that we were programmers just like them, programmers who cared for their craft enough and who cared for their team enough to take hiring other programmers seriously. I didn’t want our ads to seem anything like those spat out by people just mouthing the words that everyone else was mouthing to “get talent.” So I worked on getting the words right, thinking the investment would somehow help us stand apart, if just a bit, when hiring.

But I had failed to consider that authenticity in job ads can be faked by just copying what seems authentic. So, a few days ago, when Googling for statistically unlikely phrases from the text I had written, I was actually surprised to discover that a number of companies and recruiters were now using my words, more or less unchanged, to signal how “authentic” they were.

At first, I was annoyed. But, upon reflection, I realized that the plagiarism was telling me that I had written something worth stealing. That’s a good thing, right?

After all, in a society where all too many people are willing to claim your words as their own, which is worse: to write something and have it stolen or to write something and have it not?

P.S. For the pedants who would point out that nothing was actually “stolen” here, please understand that steal has a well-established sense that means basically to plagiarize, as in T.S. Eliot’s quip that “Mediocre writers borrow. Great writers steal.”

Posted in writing
Tags hiring, plagiarism, recruiters, weaselry, writing
5 comments
no trackbacks

Writing a simple plagiarism detector in Haskell

Posted by Tom Moertel Thu, 16 Jun 2011 04:13:00 GMT

Recently, I wrote a simple plagiarism detector in Haskell using n-grams. The problem it solves is this: You have a document, the suspect, which you believe may be derived from other documents, the sources. You want to find regions in the suspect that are identical to those in the sources.

One simple solution is to use n-grams, which are just the n-word neighborhoods taken from some sequence of words. For example, the 4-grams of the sequence a b c d e f are a b c d, b c d e, and c d e f.

The idea for plagiarism detection with n-grams is that, as you increase n, the probability that a given n-gram will be shared between two arbitrary documents rapidly vanishes. For example, Google says, right now, that the 2-gram “rapidly vanishes” appears in about 11,000 documents, but the 5-gram “two different documents rapidly vanishes” occurs in no documents. (Once Google indexes this post, however, that will change.)

But when one document is copied partly from another, they will probably share many long sequences of words. So, the sharing of longer n-grams is evidence of copying.

It’s not proof, however. If two papers about U.S. history both contain the 5-gram “The United States of America,” that coincidence is not surprising. But when lots of unlikely sharing occurs, it’s strong evidence of some form of common ancestry.

Fortunately, when copying occurs, it’s often easy to see, once you know where to look. So my detector just annotates the suspect document to indicate which portions of it share n-grams with the source documents. Shared regions are converted to ALL CAPS. For example, looking for shared 4-grams, my detector finds ample evidence that the following paragraph was copied from the text above:

$ ./Plag 4 suspect.txt blog-post-plagiarism-detector.txt

But it's not rock solid. If two documents ABOUT U.S. HISTORY
BOTH CONTAIN THE 5-GRAM "THE UNITED STATES OF AMERICA," 
that's not shocking. BUT WHEN LOTS OF sharing occurs, that's
pretty good EVIDENCE OF SOME FORM OF COMMON ANCESTRY.

In fact, it was copied: I duplicated an earlier paragraph and then changed a few words.

Using n-grams to detect sharing is not a perfect system, but it’s a simple system, and, because its output is easily interpreted, it’s an effective system. When your eyeballs see a lot of ALL CAPS TEXT, it’s easy to judge what it represents.

Anyway, here’s the code.

{-

Simple plagiarism detector that detects n-gram duplication.

Usage: ./Plag N suspect.txt source.txt... > annotated-suspect.txt
(where N is the n-gram size; try 3, 4, and 5).

Tom Moertel <tmoertel@gmail.com>
2011-06-11

-}

module Main where

import Control.Applicative ((<$>))
import Data.Char (toLower, toUpper, isAlphaNum, isSpace)
import Data.List (tails)
import qualified Data.Set as S
import System (getArgs)

main = do
  n:suspect:sources <- getArgs
  putStrLn =<< findMatches (read n) suspect sources

findMatches n suspect sources = do
  db <- S.fromList . concat  <$> mapM (loadNgrams n) sources
  stxt <- readFile suspect
  return $ concat (match n db (nGrams n stxt) (whiteWords stxt))

match :: Int -> S.Set [String] -> [[String]] -> [String] -> [String]
match n db ngs wws = mapAt (map toUpper) matchlocs wws
  where
    matchlocs = uniq . concat $ zipWith isMatch [0..] ngs
    isMatch loc ng | ng `S.member` db = [loc .. loc + n - 1]
                   | otherwise        = []

mapAt :: (a -> a) -> [Int] -> [a] -> [a]
mapAt f locs ws = go 0 locs ws
  where
    go _   [