How Many C Programs Are There?

If I choose a size S, can you tell me how many valid C programs exist that are no larger than that size? I’m actually interested in the answer — it’ll help me make a point in a paper I’m writing. Shockingly, the Internet (or at least, the part of it that I looked at based on a few searches) does not provide a good answer.

Let’s start with a few premises:

Since it would be exceedingly difficult to construct the exact answer, we’re looking for a respectably tight lower bound.
S is measured in bytes.
Since it seems obvious that there’s an exponential number of C programs, our answer will be of the form N^S.

But how to compute N? My approach (and I’m not claiming it’s the best one) is to consider only programs of this form:

int main() {
int id1;
int id2;
....
}

The C99 standard guarantees that the first 63 characters of each identifier are significant, so (to drop into Perl regex mode for a second) each identifier is characterized as: [a-zA-Z_][a-zA-Z0-9_]{62}. The number of different identifiers of this form is 53*63⁶², but we spend 69 bytes in order to generate that many programs, in addition to 15 bytes of overhead to declare main(). The 69^th root of the large number is between 43 and 44, so I claim that 43^(N-15) is a not-incredibly-bad lower bound on the number of C programs of size S. Eventually we’ll hit various other implementation limits and be forced to declare additional functions, but I suspect this can be done without invalidating the lower bound.

Of course, the programs I’ve enumerated are not very interesting. Perhaps it would have been better to ask:

How many non-alpha-equivalent C programs are there?

or even:

How many functionally inequivalent C programs are there?

But these sound hard and don’t serve my narrow purpose of making a very blunt point.

Anyway, I’d be interested to hear people’s thoughts on this, and in particular on the 43^S (claimed) lower bound.

2012 02 20 Compilers
Computer Science
Random Comments (19) Permalink

Bonneville Salt Flats and Silver Island Mountains

2012 02 18 Outdoors
Utah Comments (3) Permalink

C Puzzle: Double Trouble

I ran into an interesting C program that both of my C oracles (KCC and Frama-C) consider to be well-defined, but that are inconsistently compiled by the current development versions of GCC and Clang on x86-64.

int printf (const char *, ...);

void fn2 (int p1)
{
  printf ("%d\n", p1);
}

union U0 {
  short f0;
  int f1;
};

union U0 b;
int *c = &b.f1;

int main ()
{
  short *d = &b.f0;
  *d = 0;
  *c = 1;
  fn2 (b.f0);
  return 0;
}

The results:

[regehr@gamow 1]$ current-gcc -O1 small.c ; ./a.out
1
[regehr@gamow 1]$ current-gcc -O2 small.c ; ./a.out
0
[regehr@gamow 1]$ clang -O1 small.c ; ./a.out
1
[regehr@gamow 1]$ clang -O2 small.c ; ./a.out
0

GCC is revision 183992 and Clang is 150180.

The question is: Is this program well-defined (in which case I get to report two compiler bugs) or is it undefined (in which case I get to report two bugs in C analysis tools)?

UPDATE:

Just to confuse things a bit more, I’ll add this program. GCC prints 1 at all optimization levels but Clang changes its answer to 0 at higher levels.

int printf (const char *, ...);

union U0 {
  short f0;
  int f1;
} b;

union U0 *c = &b;

int main ()
{
  b.f0 = 0;
  c->f1 = 1;
  printf ("%d\n", b.f0);
  return 0;
}

I decided to report this as an LLVM bug just to make sure that the folks there have thought this kind of example over. I suspect it’ll get closed as a WONTFIX, which is OK — real compiler design involves a lot of hard tradeoffs.

2012 02 09 Compilers
Computer Science
Software Correctness Comments (12) Permalink

C99 Language Lawyer Needed

The program below came up during some tests. The question is, is it well-defined by the C99 language? It appears to clearly be undefined behavior by C11 and C++11.

struct {
  int f0:4;
  int f1:4;
} s;

int main (void) {
  (s.f0 = 0) + (s.f1 = 0);
  return 0;
}

2012 02 06 Compilers Comments (21) Permalink

Randomly Testing a Static Analyzer

Static analyzers are intended to find bugs in code, and to show that certain kinds of bugs don’t exist. However, static analyzers are themselves large, complicated programs, leading to a “who watches the watchmen” problem. Pascal Cuoq, one of the people behind the excellent Frama-C analyzer, took it upon himself to run the fuzz-fix cycle for Frama-C and Csmith all the way to its fixpoint — where no bugs can be found.

Fuzz testing relies on knowing some invariants for the system under test; often, this is just the trivial “program shouldn’t crash.” Luckily, for compiler testing we can do much better and Pascal did some hacking to turn Frama-C into a (relatively) efficient interpreter for C programs, making it possible to compare Frama-C’s interpretation of a program against regular C compilers.

The aspect of this exercise that I found most interesting was that Frama-C found some nasty bugs in Csmith at the same time that Csmith was finding problems in Frama-C. You might say that Csmith fuzzed itself, but without Frama-C’s deep inspection of the generated programs’ behavior, we couldn’t see the bugs. Recall that Csmith’s key guarantee is that its output is well-defined by the C standard. Frama-C found five bugs where we failed to provide that guarantee.

The bugs found in Frama-C are listed here and more details can be found in a short paper that we (but mainly Pascal) wrote. We hypothesize that random testing would reveal a similar set of issues in other static analysis tools for C. If these tools are being used as part of safety arguments for critical systems, somebody should do this fuzzing.

2012 02 03 Compilers
Computer Science
Software Correctness Comments (4) Permalink

Avoidable Failures of Peer Review

This piece is about a specific kind of peer review failure where a paper is rejected despite there being sufficient evidence to warrant acceptance. In other words, all the facts are available but the wrong decision gets made anyway. In my experience this is extremely common at selective computer science conferences. The idea here is to identify and discuss some of the causes for this kind of peer review failure, in order to make these easier to identify and fight.

Not in the Current Problem Set

In most academic communities, most of the researchers are attacking a fairly small set of problems at any given time. Moving as a herd is good because it increases the problem-solving power of a research community as people compete and build upon each others’ solutions. Drawbacks of this style of research are that the research topic may remain alive well beyond the point of diminishing returns and also the area can suffer mission drift into irrelevant subtopics.

Hot topics can also crowd out other work. At best, papers not on a hot topic are implicitly held to a higher standard because the reviewers must perform a more significant context switch in order to understand the nuances of the problem being solved. Or, to flip that around, the authors are unable to exploit the fact that everyone already considers the problem to be worth attacking and understands not only its structure but also the state of its major solutions.

A pernicious variant of “not in the current problem set” occurs when the paper being reviewed addresses a problem that the community views as solved. It should not come as a shock if I say that we see some non-standard definitions of “solved” in academia. One of my most jaw-dropping moments in the research world (and this is saying a lot) happened at an NSF workshop some years ago where a well-known formal verification researcher got up in front of a group of industrial people, government people, and academics and said more or less: “We have solved all of the key problems. Why aren’t you people adopting our work?” As another example, a researcher working on verified compilation has told me horror stories about reviews that attempt to invalidate his work by denying the existence of compiler bugs.

Solutions:

Conferences should actively pursue eclectic work. This is easier said than done, but I often feel that ASPLOS is reasonably eclectic and USENIX ATC used to be.
Reviewers need to be sensitive to the “we solved that problem” phenomenon. Of course, in some fraction of cases this reaction is justified — but not always.

Vocal Detractor

Sometimes the reviewers pretty much like a paper — perhaps it even has one or two strong supporters — but one person just wants to kill it. If this person is loud, determined, and forceful this tactic will usually succeed. A frequent vocal detractor will subvert the peer review process. Furthermore, one has to suspect that these people are sometimes politically or ideologically motivated.

Solutions:

The committee as a whole (and particularly the chair) needs to create an environment in which abusive behavior is not tolerated. Frequent vocal detractors should not be invited to be on conference committees.
Electronic meetings probably make it more difficult to be a vocal detractor since force of personality is blunted by the medium. On the other hand, electronic meetings tend to fragment the discussion, making systematic use of this tactic harder to detect.

Conservatism

Ground-breaking research requires boldness and risk-taking, at least in some cases. The NSF explicitly wants research to be “transformative” as opposed to incremental. It follows that just a bit of boldness and risk-taking is also required on the part of the program committees. But too often, we see exactly the opposite: professors display a huge amount of conservatism in the peer review process. Nobody wants to accept a paper that might be embarrassing or wrong, nor do they want to accept a paper where the authors have displayed a less-than-masterful command of the related work, nor do they want to accept a paper where the evaluation section uses a geometric mean instead of arithmetic, or fails to explain Figure 17 in sufficient detail.

A common rhetorical tactic seen at program committee meetings is for someone to say: “Of course we can’t accept this paper, since two reviewers gave negative recommendations.” This happens so frequently that I have a few stock responses:

On the contrary — we can do whatever we like with this paper.
Can you tell me again why we all flew out here? To not discuss this paper?
Might we consider the possibility that some of these reviewers are inexpert, biased, or wrong?

Although conservatism can come from an individual, just as often it emerges out of group dynamics. Someone makes an offhand remark during the discussion of a paper and suddenly the conversation enters a death spiral from which it is impossible to reach a positive consensus.

Conservatism is rooted in the fact that so much of academia is about status. When playing status games, we fear above all losing face. Conservative decisions minimize the risk of looking like idiots, but of course they also minimize the risk of learning something new.

Would it really be that bad to risk publishing some wrong results, if this kind of risk-taking had other benefits such as accepting a larger amount of interesting new work? I don’t think so. Informal post-publication peer review should be able to deal with these problems — and it’s not like conservative acceptance can catch all instances of bad research.

A sub-problem of conservatism is that papers that fail to conform get rejected. In computer systems, paper submissions from non-PhDs in industry are not uncommon. Usually these papers doesn’t quite look right and they often get rejected. In a number of cases I’ve seen reviews so blisteringly ugly that nobody in their right mind who received them would ever submit another paper to that community.

Solutions:

Accept more papers. At a 10% acceptance rate, conservatism dominates all other considerations since only papers that look perfect will be accepted. At a 40% acceptance rate this is much less likely. Given the trend away from paper conference proceedings, there is little downside.
Give each program committee member a limited number of whiteballs: unilateral acceptances for papers that have merit, but that are dying in committee. The OOPSLA 2010 frontmatter has interesting thoughts on this.

Conclusion

Avoidable rejections have two main root causes: human nature and the artificial scarcity of accepted papers at prestigious conferences and journals. Human nature can’t be fixed but it can be worked around, for example using the “unilateral accept” mechanism. Artificial scarcity can be fixed.

I wrote up this piece for two reasons. First, in the last couple of years I have found myself on the losing side of many arguments that a paper should be accepted to a good conference; I’m trying to better understand why this happens. Second, I’m co-chairing an embedded software conference this year and need to figure out how I can keep my program committee from rejecting too many papers that could otherwise be accepted.

2012 01 30 Academia
Computer Science Comments (2) Permalink

Tricking a Whitebox Testcase Generator

The awesome but apparently discontinued Underhanded C Contest invited us to write C code that looks innocent but acts malicious. Today’s post is an alternate take on the same idea: we don’t really care what the malicious code looks like, but it needs to escape detection by an automated whitebox testcase generator. In some respects this is much easier than tricking a person, since whitebox tools are (at best) idiot savants. On the other hand, a number of tried-and-true methods for tricking humans will be stopped cold by this kind of tool.

As a simple example, we’ll start with two solutions to my bit puzzle from last month: the slow reference implementation and a much faster optimized version. We’ll also include a little test harness.

uint64_t high_common_bits_reference(uint64_t a, uint64_t b)
{
  uint64_t mask = 0x8000000000000000LLU;
  uint64_t output = 0;
  int i;
  for (i = 63; i >= 0; i--) {
    if ((a & mask) == (b & mask)) {
      output |= (a & mask);
    } else {
      output |= mask;
      goto out;
    }
    mask >>= 1;
  }
 out:
  return output;
}

uint64_t high_common_bits_portable(uint64_t a, uint64_t b)
{
  uint64_t x = a ^ b;
  x |= x >> 1;
  x |= x >> 2;
  x |= x >> 4;
  x |= x >> 8;
  x |= x >> 16;
  x |= x >> 32;
  return (a & ~x) | (x & ~(x >> 1));
}

int main(void)
{
  srand(time(NULL));
  int i;
  for (i=0; i < 100000000; i++) {
    uint64_t a = rand64();
    uint64_t b = mutate(a);
    assert (high_common_bits_reference (a,b) ==
            high_common_bits_portable (a,b));
  }
  return 0;
}

The code in main() is just a trivial fuzz tester: rand64() gives uniformly distributed 64-bit random integers and mutate() randomly flips 0..63 random bits in its argument (making both arguments independently random does not do a very good job testing this code).

Can we add malicious code that this fuzzer will not detect? Of course this is easy, for example we can add this code to the first line of either function:

if (a==0x12345678 && b==0x9abcdef0) return -77;

Why return -77? Let’s assume that the return value from this function (which is guaranteed to be in the range 0..63) is used as an array index, and offset -77 lets us turn an array write into a stack smash.

First, will our fuzz tester discover this modification? Nope… even with 100 million random trials, the probability of detection is on the order of 10^-30.

Next, let’s try to discover the modification using Klee, an excellent open source whitebox test case generator. We’ll change the test harness to this:

int main(void)
{
  uint64_t a,b;
  klee_make_symbolic (&a, sizeof (a), "a");
  klee_make_symbolic (&b, sizeof (b), "b");
  assert (high_common_bits_reference (a,b) == high_common_bits_portable (a,b));
  return 0;
}

As expected, Klee discovers the problem almost immediately:

KLEE: ERROR: /mnt/z/svn/code/bitdiff-klee.c:129: ASSERTION FAIL: high_common_bits_reference (a,b) == high_common_bits_portable (a,b)

How did it find these inputs? The point of a whitebox testcase generator is to take advantage of the structure of the code while generating test cases. So here, it simply asked “What values of the arguments lead to the assertion-failed branch being taken?” The answer, in this case, is trivial.

Do tools like Klee pose a serious problem for our effort to insert a backdoor? Let’s try to stop Klee from seeing our bug. The canonical way to fool a whitebox testcase generator is to exploit the fact that these tools solve hard inverse problems — so we insert a computation that is hard to invert. A checksum computation should do nicely. Our bug code now reads:

if (obfuscate_crc(a)==0xdf590898 && obfuscate_crc(b)==0x970b5f84) return -77;

where:

uint64_t obfuscate_crc (uint64_t x)
{
  return crc32 ((unsigned char *)&x, sizeof (x));
}

crc32() is simply a standard 32-bit checksum function that my stupid WordPress plugin for rendering C code cannot render, but you can find it in the full source code for this example.

Now we run Klee again and — damn! — after about 30 minutes (on my slowish home machine) it finds an input that triggers our buggy code. This is impressive; I did not expect Klee to be able to invert a couple of CRC32 operations. While it’s clear that we could push this line of attack further, and eventually confuse Klee to the point where it gives up, let’s try a different tactic. Our new bug code is:

 if (obfuscate_pipe(a)==0x12345678 && obfuscate_pipe(b)==0x9abcdef0) return -77;

where:

uint64_t obfuscate_pipe (uint64_t x)
{
  int pipefd[2];
  int res = pipe (pipefd);
  // assert (res==0);
  res = write (pipefd[1], &x, sizeof(x));
  // assert (res==sizeof(x));
  uint64_t ret;
  res = read (pipefd[0], &ret, sizeof(x));
  // assert (res==sizeof(x));
  close (pipefd[0]);
  close (pipefd[1]);
  return ret;
}

This time, rather than trying to overcome Klee using brute force, we’re counting on its inability to follow data through certain system calls. This time we get the desired result: not only does Klee not find anything amiss with our code, but it reports the same number of total paths (65) as it reports for the unmodified code.

A third way to trick Klee is to exploit a weakness in its goal of achieving full path coverage of the system under test. It is not hard to create a code path that is malicious for particular values of variables, but not for other choices. For example, Klee does not see the problem in this code:

int indices[] = { 0, 2, 1, 5, 3, 0, -77, 0, 3, 6 };

value_t storage[7];

void stash (int i, value_t v)
{
  if (i<0 || i>=10) {
     error();
  } else {
    storage[indices[i]] = v;
  }
}

When Klee tests the path containing the array store, it won’t notice the malicious index unless it happens to generate inputs that result in stash() being invoked with i==6 (it doesn’t, at least for me).

I feel certain there are additional ways to fool this kind of tool and will be interested to hear people’s thoughts.

Note that Microsoft does massive whitebox testing of its products at the binary level (Klee, in contrast, operates on LLVM bitcodes). Anyone who wants to put a backdoor into these products had better be covering their tracks carefully using techniques like the ones in this post (of course other strategies, such as ensuring that the affected modules are never tested by SAGE, would seem simpler and more reliable).

Finally, I might be able to come up with a new underhanded C contest where the goal is to fool both humans and a whitebox tester; let me know if that sounds interesting.

[Notes: Source code is available here. While preparing this piece I used this self-contained Klee distribution which is super easy to use and is described in a bit more detail on the Klee "getting started" page.]

Update from Jan 25:

As arrowdodger points out below, Klee should find the problem with the third method in this post. The fact that I was unable to get it to do so stemmed from a silly mistake I made: giving Klee the “–optimize” flag, which runs LLVM’s optimizers, which deleted the array write as dead code. Thanks to the people on the Klee mailing list for setting me straight!

2012 01 24 Compilers
Computer Science
Software Correctness Comments (8) Permalink

Discovering New Instructions

Sometimes I wonder what instruction sets are supposed to look like. That is, what instructions would there be if computers were redesigned by smart people who understood our fabrication capabilities and who knew what we wanted to accomplish using computers, but who didn’t care about backwards compatibility and who haven’t seen our architectures? We can get little glimpses into that world by looking at network processors, DSPs, GPUs, ISA extensions like SSE4 and NEON, extensible processors like Tensilica’s, and others. But still, these are too rooted in the past.

Although the machine code emitted by our compilers is inextricably tied to our processors, perhaps this code can still be leveraged to discover new instructions. As a thought experiment, let’s start with a collection of executables whose performance we care about. Preferably, some of these will have been compiled from programs in Haskell, OCaml, and other languages that are not well-represented in today’s benchmark suites. We’ll run these programs in a heavily instrumented execution environment that creates a dynamic dataflow graph for the computation; the excellent Redux paper shows how this can be done. Next, we’ll need to clean up the dataflow graphs. First, we rewrite processor-specific operations (condition code dependencies, CISC instructions, etc.) into a simpler, portable form. Next, we optimize away as much dead, redundant, and vacuous code as possible; including, hopefully, all irrelevancies such as stack frame manipulation, dynamic linking, and garbage collection. The result — perhaps — will be something beautiful: the essence of the original computation, stripped of all sequential constraints, processor-specific operations, and bookkeeping. Of course this dataflow graph has some limitations. First, it only encodes the meaning of a single execution of the program. Second, it encodes a lot of incidental facts about the computation such as the bitwidths of all integer operations, the specific hashing methods used, etc. We’ll just have to live with these problems. The Redux paper contains a great example where factorial codes written in C, in Haskell, and in a stack machine are shown to all produce basically the same dataflow graph.

So now we have a collection of giant dataflow graphs: one for each execution of each program that we’re interested in. Our goal is to design an instruction set that can compute these dataflow graphs. Trivially, this can be done by partitioning the graphs into very small units of computation corresponding to a RISC instruction set. But that ISA is boring and won’t show any performance wins. To do better we’ll use a search-based strategy to find subgraphs that:

occur a large number of times across the collection of dataflow graphs — these are operations that are executed frequently by our workloads
contain a lot of parallelism — making them good candidates for hardware acceleration
contain a lot of internal symmetry — supporting SIMD-style execution
have a small number of dynamic inputs and outputs
rely on a small number of constants
do not contain dependency chains that are too long — we don’t want to create instructions that are too slow

I think this can be done; none of these properties seems particularly difficult to test for. The major problem necessitating cleverness will be the huge size of the dataflow graphs. We’ll end up with a list of candidate instructions ranked by some fitness function, such as performance or code density. We can build as many of these into our new ISA as we have hardware budget for.

Would this method discover saturating arithmetic instructions when applied to signal processing codes? Would it find clever ways to optimize bounds checks and exception handling in high-level programming programming languages? It’s possible (though I’d be super disappointed) that the new machines are just incidental variations on existing RISC and CISC designs. If this happened, I would suspect that we had failed to abstract away a sufficient number of processor artifacts. Or, perhaps it was a mistake to compile our computations to an existing machine architecture before building the dataflow graphs. Rather, perhaps we should start with a source-level program and its operational semantics, unrolling it into a dataflow graph without any compilation to machine code. This avoids ties to our existing processors, but also risks coming up with graphs that are very hard to map back onto actual hardware. Of course, many languages don’t even have a real semantics, but researchers are diligently working on that sort of thing. An even more difficult option would build up the functional representation of a source program (or executable) without running it, but this has the disadvantage of losing the “profile data” that is built into a dynamic dataflow graph — we’d need to add that in separately.

An aspect of this exercise that I find interesting is that gives insight into what our processors really do. Many years ago (I don’t have a reference handy, unfortunately) a study showed that computers spend most of their time in sorting algorithms. That cannot be true anymore — but what does the average mobile phone do? What about the average supercomputer chip? The average machine in Google or Amazon’s cloud? Of course we know the answers at a high level, but are there surprises lurking inside the computations? I would expect so — it’s tough to take a close look at a complicated system and not learn something new. Are there new instructions out there, waiting to be discovered, that can help these workloads execute more efficiently? I have to think so, at least for for workloads that are not well-represented in Intel’s, AMD’s, and ARM’s benchmark suites.

2012 01 23 Computer Science
Futurist Comments (17) Permalink

Wanted: Epitaphs for Hot Topics

Any given research community always has a few hot topics that attract an inordinate number of paper submissions. Sometimes these are flashes in the pan, other times they mature into full-fledged areas having their own workshops and such — but most often they endure for a few years, result in a pile of PhDs, and then slowly die off. After this happens I often find myself with unanswered questions:

Were the underlying research problems solved or were they too hard?
Did the solutions migrate into practice or are they withering on the academic vine?
What lessons were learned?

For example, consider software-based distributed shared memory, where MMU tricks are used to provide a shared address space for a cluster of machines. My former colleague John Carter wrote some of the basic papers in this area and they are hugely cit

Embedded in Academia

How Many C Programs Are There?

Bonneville Salt Flats and Silver Island Mountains

C Puzzle: Double Trouble

C99 Language Lawyer Needed

Randomly Testing a Static Analyzer

Avoidable Failures of Peer Review

Not in the Current Problem Set

Vocal Detractor

Conservatism

Conclusion

Tricking a Whitebox Testcase Generator

Discovering New Instructions

Wanted: Epitaphs for Hot Topics