Dont Repeat Yourself

Context:

Duplication (inadvertent or purposeful duplication) can lead to maintenance nightmares, poor factoring, and logical contradictions.

Duplication, and the strong possibility of eventual contradiction, can arise anywhere: in architecture, requirements, code, or documentation. The effects can range from mis-implemented code and developer confusion to complete system failure.

One could argue that most of the difficulty in Y2K remediation is due to the lack of a single date abstraction within any given system; the knowledge of dates and date-handling is widely spread.

The Problem:

But what exactly counts as duplication? CloneAndModifyProgramming is generally cited as the chief culprit (see OnceAndOnlyOnce, etc.), but there is more to it than that. Whether in code, architecture, requirements documents, or user documentation, duplication of knowledge - not just text - is the real culprit.

Therefore:

The DRY (Don't Repeat Yourself) Principle states:

: Every piece of knowledge must have a single, unambiguous, authoritative representation within a system.

It's okay to have mechanical, textual duplication (the equivalent of caching values: a repeatable, automatic derivation of one source file from some meta-level description), as long as the authoritative source is well known.

For example, in a mixed-language CORBA environment you may choose to treat the IDL definitions as authoritative. These definitions will be used to automatically generate source code files which duplicate the knowledge in the IDL. But that's okay: the IDL form of the knowledge meets the requirements of our definition.

Where different levels of abstraction are involved, a consistent conflict-resolution scheme must be used. This could be as simple as identifying one level as authoritative in all cases, or always deferring to the high level, or whatever, as long as it is consistently applied.

For example, in C++ the interface and implementation for a class are typically specified in separate files, duplicating knowledge. You may consider the header file to be authoritative for the contract of the class as viewed by its clients, and the source code to be authoritative regarding issues of implementation which are hidden by the implementation.

Notes: This principle is similar to OnceAndOnlyOnce, but with a different objective. With OnceAndOnlyOnce, you are encouraged to refactor to eliminate duplicated code and functionality. With DRY, you try to identify the single, definitive source of every piece of knowledge used in your system, and then use that source to generate applicable instances of that knowledge (code, documentation, tests, etc).

The principle also assumes that your projects have a high degree of automation, allowing the generation of the derivative knowledge artifacts whenever required.

-- AndrewHunt

The following conversations get much simpler when we recognize the most important thing to DRY up is...

declarations of behavior

Keeping the structure of a program DRY is harder, and lower value. It's the business rules, the if statements, the math formulas, and the metadata that should appear in only one place. The WET stuff - HTML pages, redundant test data, commas and {} delimiters - are all easy to ignore, so DRYing them is less important.

--PhlIp

The opposite is WET:

We
Edit
Terribly, Tumultuously, Tempestuously, Tenaciously, Too much, Timidly, Tortuously, Terrifiedly...

I think WET also stands for "We Enjoy Typing"

-- DuncanBayne

But:

It may take time and effort to select and/or create a definitive source - see YouArentGonnaNeedIt. -- DickBotting

Hmm. "Every piece of knowledge must have a single, unambiguous, authoritative representation within a system." Is this not a reasonable definition for a Singleton?

No. DRY refers to your source code, not your running program. ThirdNormalForm is the analogous principle for data.

Ugh. I think you're both barking up the wrong trees there. It doesn't refer to source code, nor to the running program. It refers to a system. Needn't have computers involved at all.

In TheArtOfUnixProgramming, EricRaymond calls this the SinglePointOfTruth rule (SPOT).

I agree with this, "in principle". However, taken literally, this principle is impossible to achieve. For example, a function signature is duplicated at the function call site and the function definition site. Even if you don't declare types, the number of arguments is a fact that is duplicated. I prefer to think of getting rid of duplication (i.e. OnceAndOnlyOnce) instead of never having it in the first place, because you keep discovering that something that you didn't think was a duplicate, really is. Also, sometimes it is more trouble than it is worth to eliminate some duplication. Moreover, it is often better to wait til you have several examples of the duplication so you can better see how to eliminate it. -- RalphJohnson

Ralph, that's a really good way of looking at things. My biggest programming headaches have always come from abstractly struggling with "how can I write a good general solution to this problem", even though I only know of one place where it's definitely going to be used. "I'd better think of a general solution now," I think, "or I'll have to copy-and-paste-and-change code!" But it's absurd to try to come up with a general solution without knowing more about the different varieties of the problem that exist (or will exist) in the system.

It's a battle of two really strong urges - OnceAndOnlyOnce vs avoiding PrematureGeneralization. Do I duplicate for now and try to live with the duplication for a while, or violate YagNi and come up with some half-cocked generalized solution? It's a tough one, because almost all programmers hate duplication; it's a sort of primordial programming urge.

However, even though CopyAndPasteProgramming can be expensive to clean up, so is a botched "general solution" - and copy and paste is far cheaper up front. So I also am in favour of temporary duplication, to be refactored when you have a clearer view of the situation. -- MatthewBennett

Alas, just exactly how I thought about it. Now the temporary duplication is a hellish permanent duplication. I have no time to remove it, and the more I have to update the evil twins of code, the less time to ever do something about it i get. -- L.A.Nilsson

(For more in the vein of "let a duplication live for a while, in order to see how to best eliminate it", see CodeHarvesting.)

I think there is also a collision with DoTheSimplestThingThatCouldPossiblyWork where I believe that latter should win most of the time (aside from the situations where refactoring is too complicated, database schemas?)

The DRY principle is really a philosophical issue (although it does rear its head retroactively during coding). The idea is to try to plan ahead to prevent duplication, rather that to waste time removing stuff you've already done. Function signatures are a form of duplication, albeit a fairly minor one, as the compiler should be able to spot times where one side gets out of step. It's the more insidious duplication we hope to stop with DRY. -- Dave

	''Why is this duplication? Neither the protoype alone nor the function definition carries 
	the full message. Without the protoype errors will creep in, and with out the function definition errors 
	creep in. This is not duplication it is virtuous reduncancy which inceases the total error traping power 
	synergisticly. Did I miss the point somewhere ?''

I like this principle. Perhaps related to the idea of being able to trace any piece of coding back to the reasons for its existence and form. For me a key point is the selection of one item as a definitive source for certain purposes.

For example, as in the above description, I'd normally take a function prototype/header/declaration as guiding the function calls. This is why I like header files to contain comments describing the contract (make sure the first argument points at real storage and this call will do ..... for you). The function bodies are definitive for the implementation decisions but might be derived from more definitive generic algorithms perhaps? -- DickBotting

If code does not conform to the DRY principle can we say that it is wet?

I certainly like this principle. The applications in embedded systems should be obvious to the casual observer. Can the pump dispense medication or not? Only one entity knows the absolute answer to this puzzler. Every component that needs to know what the machine's current status is goes to a well-known method (a resource) to find out. What units are we dealing with, ml or mg/cc? Each component may need a local copy of this setting to figure out how to convert, but they all go to a central repository to acquire that knowledge. Most excellent, wouldn't you say?

Let's try a simple XP coding example:

-- are all these sqrt examples really making the point? Should we delete them or move them? -- AnonymousCoward

Agreed the square root example kinda went off on a tangent, the discussion itself wasn't too bad but it doesn't belong here --AndrewRicketts {be bold about your opinions}

P1: OK, we need a square root function. Lets code the first test case

	test1: assert_equal(sqrt(4), 2);

P2: and the code:

	sub sqrt(x) { return 2; }

P1: Hmm. OK, perhaps we need another test to get a better implementation

	test2: assert_equal(sqrt(9), 3);

P2: OK

  sub sqrt (x) { if (x==4) { return 2 } elseif (x==9) { return 3 } }

P1: You can't do that, you're just repeating the information in the test: it violates the DRY principle

P2: That doesn't matter. Code must only obey OAOO. DRY is just philosophical claptrap. The tests don't provide enough information to generalize, so the simplest thing is to duplicate the information.

P1: No, it does matter. A test and its code must use different algorithms when expressing the same information. If they are the same then the test adds no value.

Who is correct? What implementation would enable the tests + code to conform to DRY?

''Here's my resolution for this particular example:

Start with a good statement about the process to design and test.

Documented requirement:

produce the square root of (x).
Where sqrt(x)*sqrt(x) = x.
Error or imaginary numbers must be returned for x < 0.
x = 0 should return 0.
Inserted requirement: x is in the Real Number Domain.
Inserted performance requirement: Function Overflow must be considered and reported as an error.
Inserted performance requirement: As Fast As Possible/Do Not Bottle Neck.

Then write your test:

	for(y=1 to 100) {
		x = random_in_range(1, some_max) -- why not replace 1 by some_min - possibly large and -ve?
		assert_equal(x,sqrt( (x*x) ))

you might also want to consider assert_equal(x,sqrt(x)*sqrt(x)) which is a different proposition. -- DavidMartland

	}
	assert_error_or_imaginary(sqrt(-1))
	assert_zero(sqrt(0))

notice that the requirement does not specify an upper domain. So modified the test for random values > 0. This way Mr. Guru Coder doesn't get lazy and implement any complex if statements. If statement nothing. A guard clause and a table lookup. -- MrGuruCoder. I happen to know from experience that such a table tends to get expensive with 32-bit and larger integers (several optimizations and I couldn't get a PrimeNumber table below a quarter gig before giving up - but hey, what a weekend of coding).
finally code and test the sqrt function.

-- WyattMatthews

[Noted in passing: This test doesn't seem to do much toward verifying that the code works when the argument to sqrt is not an integer. Indeed, if x is a floating-point number that is not an integer, the square root of x can never be represented exactly as a floating-point number (Counterexample: sqrt(0.25) = 0.5 is perfectly representable in binary floating point; RV) (because the square root of a non-integer is always irrational - definitely wrong: sqrt(0.01) == 0.1, where 0.01 is non-integer and 0.1 is rational -- MrMathematician). So testing a square-root program outside the integer domain is a much harder problem. -- AndrewKoenig]

[Second note in passing: I misspoke when I said that the square root of a non-integer is always irrational. What I meant to say is that the square root of an integer is either an integer or irrational. Now, any floating-point value x can be multiplied by an even power of 2 to yield an integer. Call the smallest such number N. Then sqrt(x*N) == sqrt(x) * sqrt(N), and because N is an even power of 2, sqrt(N) is a power of 2. Because x*N is an integer, sqrt(x*N) is either an integer or irrational. If sqrt(x*N) is an integer, then sqrt(x) is an exact floating-point number, and it can be represented with half the significant bits of x. If sqrt(x*N) is irrational, then sqrt(x) must also be irrational, so it cannot have an exact floating-point representation.

So what I'm trying to say is that if you take the square root of a floating-point number, then either the result is exact, and has lots of low-order zeroes, or it is a necessarily inexact approximation to an irrational. -- AndrewKoenig]

[The assertion that sqrt(x) can be represented with half the significant bits of x is not always precisely correct.]

Thanks to the comment inserted and the block below, I will note, that I didn't write the code for the sqrt function - a simple table may still suffice to bypass my domainless test as it may be proven that sqrt(x) = sqrt(y)*sqrt(z). Anyone up on a proof that this might (or might not) work? Also, instead of doing a stupid integral search (as shown below - it is only stupid because it does not check for y*y > x and break) a binary search of the Real Numbers Domain will at least afford some additional speed.

-- WyattMatthews

If x is an integer...

 int sqrt(int x) {
	int y = 1;
	while (y*y != x) ++y;
	return y;
 }

I'd like to see a proof that this will halt when x is an odd number other than 1.

Here's your proof: If x=9, then: y=1, y*y=1; y=2, y*y=4; y=3, y*y=9. returns y=3

OK - but it doesn't halt if x is odd and also not a square. -- DavidMartland

More appropriately, you should have asked for a proof that the function will pass the random test if x = random_in_range(1, some_max) can specify non-integral values or values <=0 with respect to the requirement. As it stands, this function is not coded to survive a large portion of the Integral Domain, much less the Real Numbers Domain. This function should be renamed to something such as "IsXaPerfectSquare?" and modified a little.

At the least, the stated requirement wants:

error_code sqrt(int x, double result) {

	error_code res;
	double test;

	if(x<0) res = error_or_imaginary;
	else
	{
		res = no_error;
		if(x=0)
			result = double(0);
		else
		{
			 //Some code to search the real number space.
			 //A binary search to a certain depth for example.
		}
	}

	return res;

}

-- Wyatt Matthews

If you program it in C style, at least make it work in C (pass variables by value, etc)

The Single Authority afforded by the DRY principle and the duplication of the source as a basis for the program environment adds inertia to the system (see RedundancyIsInertia) for the sake of its stability. Here's a little list that I believe shows how RedundancyIsInertia, the DRY principle, OnceAndOnlyOnce, and testing should work together:

the single authority is the interface's signature and documented requirements/goals. (RequirementsAndGoals are another issue)
the authority needs to be readable by any potential direct user. (ie: a textual or graphical representation for the software writer and a binary representation for the software)
the interface's body (the process) may duplicate the signature if and only if the implementation environment requires it.
The process is a fact - It must abide by OnceAndOnlyOnce and test true against the authority.
R&G markers may be placed in the textual copy of the process much as edit notes are sometimes placed there.
the user duplicates only the part of the signature required to establish communications. (the compilation process is allowed to duplicate the interface's process.)
the user of the interface should only need test the interface against the R&G once per life cycle. (see CategoryTesting)

-- WyattMatthews

There is a pattern to the exceptions to the principle. It is ok to have more than one representation of a piece of knowledge provided an effective mechanism for ensuring consistency between them is engaged.

Definitions and declarations of C functions: these are usually in sync because the compiler flags inconsistencies, forcing the programmer to take action.
Unit tests: inconsistency means the tests will fail, again forcing someone to take action.
Autogenerated stuff: periodic regeneration ensures consistency.

Often the effective mechanism is automated but it doesn't need to be. An example is end-user documentation: an automated mechanism for verifying software/documentation consistency is hard to come by (AiComplete?), so you need a human to review this. A mechanism to ensure consistency could in this case be an instruction on what to do before each software release.

-- AndersMunch

DRY doesn't just combat inconsistency, though. Yes, a passed test shows that code and test are consistent, but they may still both be wrong. If the test and the code have the same underlying structure/knowledge, it's more likely that a double fail will make a pass. However, in many functions, it's easy to eliminate the double errors because their inverse is radically different. For example, consider error correcting codes: encoding is just a hash while decoding is a complex operation. Implementing the encoder as a test reliably finds errors in the decoder.

DRY says that one should strive to use a different implementation in test and code to avoid the same error in both causing a false positive. This actually brings up another point - what do you do when the only way of implementing something is identical to the way you test it? I must admit I'm never quite happy when I have to do that. Luckily, it usually only occurs when the operation is so simple that it barely needs testing at all. -- JamesKeogh

See www.computerworld.com/news/2004/story/0,11280,94817,00.html for an amusing demonstration of the evils of duplicate code.

-- AristotlePagaltzis

Posting a link is itself an amusing demonstration of the evils of duplicate code (this link is a valid example). --M Jones

I'm struck by the fact that this is the principle that pushes people to normalize relational databases. It's not without its downside considering the reasons why people choose star schemas. There needs to be some discussion of when it is OK to abandon this rule. -- meta@pobox.com

Actually, normalized databases rely on data duplication and one usually has to write a set of constraints to ensure that keys linking tables are consistent.

	  HUH ? The fact that the same  data is in  the FK and the PK of the linked tables is not dupliction it is a 
	  necessary method of implementing the link.

Meta's comment about databases reminds me of the fact that the schema of the database and the code that access the database are two representations of data structures. Ideally, both the code and the database schema would be driven off the same source of data structure information.

This is the attraction of Ruby's ActiveRecord and the reason I hate Hiberate and Struts. Both Hiberate and Struts have duplicate information that needs to be in sync. Without something like XDoclet they will get out of sync and will violate DRY. SaS

Merely hoping to learn what "DRY" is, I came. The thought-inspiring philosophical discussion was too much, merely to read.

Epidemiology (in short, the study of disease and disease distribution in populations) promulgates the idea that a given disease has single cause -- a root cause. In systems, Root Cause Analysis inevitably takes us back to a design error and this is often traceable to an error in the requirements definition. (Data from the Software Engineering Institute (SEI) at Carnegie-Mellon University (CMU) shows a number over 60%, for the percentage of errors caused within the steps part of Requirements Definition.)

Earlier computer engineering concepts (say, early-1980s) had matured sufficiently that students (and managers) might believe people other than those who are anti-social can be as good (or better) than a "guru". The coming generation of engineers/programmers would value writing their ideas even if they didn't work for IBM. Some might even plan ahead. This allowed design to avoid circumstantial redundancy. The DoD-funded work at CMU that gave rise to the SEI in the 1980s revealed that "best practices" begin with "planning" of some sort.

To sum up what can easily occupy volumes: We need to learn from our mistakes. This requires conscious effort and time-commitment, which makes it rare. No acronym can represent every project, so a cheat sheet might simply read as, "DRY". This is redundant with the starting post -- but I didn't write it which means DRY has not been violated. (Though this contrib /may be/ a bit dry.)

-- Kernel.package

OnceAndOnlyOnce is hard to apply when consumers appreciate PutThingsWhereYouLook. And consumers look for everything everywhere. Translation, "Why isn't this email a link I can click. Your email should be on the contacts page, the user page, the help page, and the home page." -- LeeLouviere

It must be visable everywhere. It can live in only on place. A view is not the data.

Customer requests trump DRY. If source code on the inside is dry, then those email clickers will all use the same code, trivially plugged into several different view locations --PhlIp

Hmm. Let me rephrase. PutThingsWhereYouLook (applied to the concept of putting things everywhere the user might look), may lead to accidentally breaking OnceAndOnlyOnce, and can accidentally be viewed as contrast to DontRepeatYourself. However, you should really implement a program data as uncoupled from the interface. Don't confuse DRY to mean that information can only have one location as opposed to one source.

I take the approach to design data and interface as if they were separate programs, then have their "communication" derived last. That way I ensure that data is implemented "native" to the machine language of simplicity, and the user interface is designed as most conforming to how the user would use it; meet the user demand of PutThingsWhereYouLook and the myriad of different arrangements of the data to accommodate every users distinct way of understanding data. (EditMe need a better term here, native is bad, communication is horrible.) -- LeeLouviere

If this principle exists at all it is more likely to be called 'eliminate needless redundancy' than 'do not repeat yourself.' A = A is not a redundant statement. A is not needless repetition. A= B is not the same as A=A and A by itself does not always mean A=A if it is test and not an analytic truth. There are numerous instances of claims of repetition in the discussion above where removing one of the instance of the alleged duplication will change the meaning and behavior of a system. If you can not remove one of the duplicates without changing meaning it is not duplication.

The notion that normalization requires duplication because the keys must exist in both tables, suggest that removing one of the dubs is possible but it is not.

Even If I conceded that duplication is a better term than redundancy the key is that it is only needless duplication that must be purged. The acid test for needless, for those whose intuition is not a sufficient guide is this. If you can't remove it with out loosing functionality it is not needless.

-- Annon

OK, but a lot of functionality should still be generalized in many cases since they're very similar, and there could be a better (more elegant) implementation of it. -- Anonymous

See: UnnecessaryHolography, OnceAndOnlyOnce

CategoryModelingLawsAndPrinciples, CategoryExtremeProgramming, CategoryPlanning, CategorySimplification