The Epistemology of Truth

January 9th, 2011

Every person that deals with data and data integration, especially at big scales, sooner or later comes to a very key but scary choice to make: deciding whether truth is discovered or invented.

Sure, there are various shades between those options, but either you believe in a metaphysical reality that is absolute truth and you just have to find a way to discover it. Or you don’t and what’s left is just human creation, social contracts, collective subjectivity distilled by hundreds of years of echo chambers.

Deciding where you stand on this choice influences dramatically how you think, how you work, who you like to work with and what efforts you feel drawn to and want to be associated with.

What is also surprising about this choice is how easy it is to make: just like with religion, you either have faith in something you can only aspire to know, or you don’t. And both sides can’t really understand how the other can’t see what’s so obvious to them.

This debate about invention vs. discovery, objectivity vs. subjectivity, physical vs. metaphysical, embodied vs. abstract has been raging for thousands of years and takes many forms but what’s striking is how divisive it is and how incredibly persistent over time, like a sort of benign and widespread memetic infection (a philosophical cold, if you allow me).

What interests me about this debate is its dramatic impact on knowledge representation technologies.

The question of truth seems easy enough at first, but it gets tricky very quickly. Let me show you.

Let’s start with an apparently bland and obvious statement:

Washington, D.C is the capital of the United States of America

True of false? Most people would say true without even pausing to think, but people that have dealt with knowledge modeling problems will probably ask “when?”, when is this statement supposed to be true? now or in a different time? So, for example

Washington, D.C. is the capital of the United States of America in 1799

True or false? It gets trickier. One has to define what “capital” means and has to know enough history to understand that the newly formed government of the US was actually assembling in Philadelphia that year. But one could very well claim that Washington, D.C. was being built and therefore was already the capital even if nobody was living there yet.

As you can see, something that appears as factual, benign and obvious as knowledge that every elementary school kid knows by heart can immediately become tricky to model symbolically in a way that encompasses disagreements and details all nuances of interpretation.

But there are cases where one statement rings more true/false than another

Washington, D.C. was the capital of the Roman Empire

given that the roman empire ended before Washington ever existed, it is operationally safe to assume this one to be false. And yet there are statements that are unknown/unknowable

Washington, D.C. is the capital of the United States of America in 4323

and statements (/me winks to Gödel) which we are certain their validity can’t be known

This statement is false

Unfortunately, when you’re modeling knowledge and try to condense it into a form that is mechanically digestible and symbolically consistent, the problem of finding a sustainable operational definition of truth becomes necessary.

This is where the epistemological debate on truth actually manages to enter the picture: if you think that truth is discovered, you won’t accept compromises, you want to find a process that finds it. And given that even if metaphysically existing truth is probably never reachable, it is extremely difficult to embrace this philosophy and yet obtain an actionable definition of truth.

On the other hand, those who believe that truth is merely distilled subjectivity and a series of ever-evolving collective social contracts, one solution is to avoid considering truth in terms of absolutes, but just thinking of statements as ‘true enough’. Examples of true enough are “true enough to be understood and agreed upon by a sufficiently large number of people”, “true enough to pass scrutiny of a large enough of people considered experts in this domain”, “true enough to remain present over years of public edits in highly visible and publicly edited wiki”, etc.

This ‘true enough’ modus operandi allows to build knowledge representation that is useful for certain tasks and models the world and answers questions in a way that rings true to its users enough times to build trust about all the other results that can’t be judged by the user directly.

The operational definition of truth as the byproduct of emergent coordination has the huge benefit of being actionable but has the equivalently huge problem of degrading rapidly with the number of people that have to share the same knowledge and understanding. This is simply a result of the combination of finite resources available to people (energy, time, communication bandwidth) and inherent variability of the world.

While this is so obvious to be almost tautological, it is nevertheless the biggest problem with knowledge representation: the number of assertions that can be considered true, independently and without coordination, decreases rapidly with the amount of people that need to naturally resonate with their truths. Even if we find a way to exclude the fraction of the population of divergent naysayers or plain ignorant, the trend remains the same: the larger the population the harder it is that all will agree on something.

Yet, such ‘eigenvectors of knowledge’ are incredibly valuable as they form the conduit, the skeleton, upon which the rest of discourse and data modeling can take place; the plumbing on top of which integration can flow and exchange of knowledge over information can happen.

Natural languages, for example, are examples of such set of culturally established eigenvectors of knowledge. They are not modeled strictly and centrally, but they have rules and patterns, a core of which we all pretty much share when we know the language and it allows my brain to inject knowledge all that way into yours right this very moment (which is nothing short of incredible, if you stop and think about it).

There is an implicit assumption out there, shared my many that work to build large-scale knowledge representation systems, that it is possible to bypass this problem by focusing on modeling only “factual” information. That is, information that descends directly from facts and therefore would be true in both views of the world, emergent and metaphysical.

Unfortunately, as I showed above, it’s relatively easy to blur the ‘factual’ quality of an assertion by simply considering time, the world’s inherent variability and cultural biases of interpretation.

So, if we can’t even model truth, what hope is left to be able to model human knowledge in symbolic representations enough that computers can operate and interrogate on our behalf and generate results that human users would find useful?

Personally, I think we need to learn from the evolution of natural languages and stop thinking in terms of ‘true/false’ or ‘correct/incorrect’ and focus merely on practical usefulness instead.

Unfortunately, this is easy to say but incredibly hard to execute as engineers and ontologists alike have a natural tendency to abhor the incredible amount of doubt and uncertainty that such systems will have to model alongside the information they carry in order to follow the principles I outlined above.

Also, users won’t like to receive answers that have associated truth probabilities.

Think about it: if you asked such system “Is Elvis Presley dead?” and the answer you got was “yes (98%) / no (2%)”, would you be satisfied? would you use the system again or would you rather prefer a system that told you “yes” hid all the doubts from you and made your life easier?

Last but not least, people in this field hate to talk about this problem, because they feel it brings to much doubt into the operational nature of what they’re trying to accomplish and undermines the entire validity of their work so far. It’s the elephant in the room, but everybody wants to avoid it, hoping that with enough data collected and usefulness provided, it will go away, or at least it won’t bother their corner.

I’m more and more convinced that this problem can’t be ignored without sacrificing the entire feasibility of useful knowledge representation and even more convinced that the social processes needed to acquire, curate and maintain the data over time need to be at the very core of the designing of such systems, not a secondary afterthought used to reduce data maintenance costs.

Unfortunately, even the most successful examples of such systems are still very far away from that…. but it’s an exciting problem to have nonetheless.

Permalink | Posted in Article

On The Impact of Damage non-locality in Incentive Economies around Data Sharing

June 17th, 2010

For centuries, it was common for scientists to exchange ideas with epistular discussions. These days, remotely located scientists collaborate via email, or exchange digital documents when they don’t meet face to face. These are way faster and easier to exchange than hand-written letters sent via postal services. Unfortunately, they still retain that ‘after the fact’ property that they are often revealed only when some scholar decides later they were important enough to dig out and organize.

With that in mind, I find myself excited every time I get the chance to participate in in ‘blog rebuttals’ like the ones that David Karger and myself have been having lately about requirements, motives and incentives for people to share structured data on the web. Both of us care a great deal about this problem and we still cross paths and cross-pollinate ideas even after I left MIT. We also have very different backgrounds but they overlap enough so that we can understand each other’s language even when we try to explain our own (sometimes still foggy) thinking.

It is a rare situation when people from different backgrounds cross paths and earn each other’s respect. It is even rarer when their discussions are aired publicly as they are happening; this creates a very healthy and stimulating environment not only for those participating but also for eventual readers.

In any case, the point of contention in the current discussion is the reasons why people would want to share structured data and what can facilitate it.

It seems to me that the basic (and implicit) assumption of David’s thinking is that because a web of hyperlinked web pages came to exist, it would be enough to understand why it did, replicate the technological substrate (and its social lubrification properties) and the same growth property would apply to different kind of content.

I question that assumption and I’m frankly surprised that questioning whether the nature of the content can influence the growth dynamics of a sharing ecosystem makes him dismiss it as being related to a particular class of people (programmers) or to a particular class of business models (my employer’s).

It might well be that David is right and the same exact principles apply… but it seems a rather risky thing to take for granted. People post pictures on public sites, write public tweets, contribute to wikipedia, write public blogs, or create personal web sites, all this is shared and all this is public. These are facts. They don’t publish nearly as much structured data and this is another fact. But believing that people would do the same with structured data if only there was technology that made it easier or made is transparent, is as assumption, not a fact. It implicitly assumes that the nature of the content being contributed has no impact on the incentive economies around it.

And it seems to me a rather strong assumption considering, for example, that it doesn’t hold true for open sharing of software code.

Is it because software programmers are more capricious about sharing? Is it because what’s being shared is considered more valuable? Or is it because the incentive economies around sharing change dramatically when collaboration becomes a necessary condition to sustainability?

Could it be that sharing for independent and dispersed consumption (say, a picture, a tweet, a blog post) is governed by economies of incentives that are different from sharing for collaborative and reciprocal consumption? (say, software source code, wikipedia, designs for lego mindstorm robots or electronic circuitry)

I am the first to admit that it is reasonable to dismiss my questioning for being philosophical or academic, or too ephemeral to provide valuable practical benefits, but recent insights that crystalized collectively inside Metaweb (my employer) make me think otherwise. The trivial, yet far-reaching insight is this:

the impact of mistakes in hypertext are localized,

while the impact of mistakes in structured data or software are not

If somebody writes something false, misleading or spammy on a web page, that action impacts the perceived value of that page but it doesn’t impact any other. Pages have different relevance depending on their location or rank so the negative impact of that action changes depending on the page importance. But the ‘locality of negative impact’ property remains the same: no other page is directly influenced by that action.

This is not true for data or software: a change in one line of code, or one structured assertion, could potentially trigger a cascading effect of damage.

This explains very clearly, for example, why there are no successful software projects that use a wikipedia model for collaboration and allow anybody that shows up to be able to modify the central code repository.

Is that prospect equally unstable for collaborative development over structured data? or is there something in between, some hybrid collaboration models that take the best practices between the wiki models (which shines in lowering the barrier to entry) and the open software development models (which manages to distill quality in an organic way)?

I understand these questions don’t necessarely apply to the economy of incentives of individuals wanting to publish their structured datasets without the need for collaboration, but I present them here as a cautionary tale about taking the applicability of models for granted.

More than programmers vs. professors, I think the tension between David and myself is about the nature of our work: he’s focusing on facilitating the sharing of results from individual entities (including groups), I’m focusing on fostering collaboration and catalyzing network effects between such entities.

Still, I believe that understanding the motives and the incentive economies around sharing, even for purely individualistic reasons, is the only way to provide solutions that meet people’s real needs. Taking them for granted is a very risky thing to do.

Permalink | Posted in Commentary

Drivers vs. Enablers

June 5th, 2010

I’ve heard many times people saying that the web exists because of “view source”.

“view source”, if you don’t know what I mean, is the ability that web browsers have to show you the source HTML content of the web page you are currently browsing. If you ask around, pretty much everybody that worked on the web early on will tell you that they learned HTML by example, by viewing the source or other people’s pages. Tricks and techniques were found by somebody, applied, and spread quickly.

There is wide and general consensus that ‘view source’ was a very instrumental tool to easily propagate knowledge and simplify adopting the web as a platform, yet its role is often confused.

“view source” was an enabler, a catalyst; something that makes it easier for a reaction or a process to take place and thus increases rate, effectiveness, adoption, or whatever metric you want to use.

But it is misleading to confuse “view source” for a driver: something that makes it beneficial and sustainable for the process to take place. The principal driver for the web was the ability for people to publish something to the entire world with dramatically reduced startup costs and virtually zero marginal costs. “view source” made it easier and reduced such startup costs, but had nothing to do with lowering marginal costs and certainly had very little to do with the intrinsic world-wide publishing features of the web.

You might think that the current HTML5 vs. Flash diatribe is what’s sparking these considerations, but it’s not: it’s something that Prof. David Karger wrote about my previous post (we deeply enjoy these blog-based conversations). He’s suggesting that while my approach of looking for sustainable models for open data contributions is good and worthwhile, he believes that a more effective strategy can be the one of convincing the tool builders to basically add a “view source” for data and that once that is in place, we wouldn’t have to care as the data would be revealed simply by people using the tools.

It’s easy to see the appeal for such a strategy: the coordination costs are greatly reduced as you have to talk and convince a much smaller population and all composed of people that already care about surfacing data and see potential benefits for further adoption of their toolsets.

On the other hand, if feels to me that it’s confusing enablers for drivers.

The order I pose questions in my mind when engineering adoption strategies is normally “why” then “how”: taking for granted that because you have drivers then everybody else must share it or have a similar one can easily lead you astray . The question of motive, of “what’s in for me?”, might feel materialistic, un-intellectual and limiting, but an understandable and predictable reward is the basis for behavioral sustainability.

David is basing his thoughts around Exhibit and I assume he considers the driver to be the tool itself and its usefulness: it can taking your data and presents it neatly and interactively without you having to do much work or bother your IT administrators to setup and maintain server-side software. That’s appealing, that’s valuable and that’s easy to explain.

The enabler for the network effect is that “cut/paste data” icon that people can click and obtain the underlying data representation of the model…. and do whatever they want with it.

But here is where things start to get interesting when you consider drivers and enablers separately: ‘view source’ was a great enabler for the web because it was useful for other people’s adoption but didn’t impact your own adoption drivers. The fact that others had access to the html code of your pages didn’t hurt you in any way…. mostly because the complexity of the system was locked on your end in your servers and your domain name is something you control and they can’t replicate. What you had access to was a thin surface of a much more complicated system running on somebody else’s servers. It was convenient to you and your developers to have that view-source and the fact that others benefited from it posed no threats to you.

This is dramatically different in the Exhibit situation (or in many other open data scenarios): not only you can take the data with you, but you can take the entire exhibit. Some people are not bothered by this fact, but you can assume that normal people get a weird feeling when they think that others can just take their entire work and run with it.

This need of ‘preventing people from benefitting from your work without you benefitting from theirs’ is precisely the leverage used by reciprocal copyright licenses (the GPL first, the CC-share-alike later) to promote themselves, but there is nothing in the Exhibit adoption model that addresses this issue explicitly.

If your business is to tell or synthesize stories emerged from piles of data (journalists, historians, researchers, politicians, teachers, curators, analysts, etc), we need to think about a contribution ecosystem where sharing your data benefits you and in a way that it’s obvious for you to understand (and to explain to your boss!). Or, as David suggests, a ‘view source’-style model where the individualistic driver is clear and obvious and the collaborative enabler is transparent, meaning that it doesn’t require them to do work and is not perceived as a threat to their individualistic driver.

The thing is: with Exhibit, or with any other system that makes the entire data available (this includes Freebase), the immediate perception that people have is that making their entire dataset available to others is clearly benefiting others and doesn’t seem to offer clear benefits for them (which was the central issue of my previous post).

Sure, you can try to guilt-trip them into releasing their data (cultural pressure) or use reciprocal licensing models (legal pressure), but really, the driver that works best is when people want to collaborate with one another (or are not bothered by others doing it on their own work) because they immediately perceive value in doing so.

Both Exhibit and Gridworks were designed with the explicit goal to be at first drivers for individual adoption (so that you have a social platform to work with) and potential enablers for collaborative action later (so that you can experiment with trying to build these network effects); but a critical condition for the collaborative enabler is that it must not reduce the benefit of individual adoption or otherwise it will reduce its ability to drive network effects.

Think for a second about a web where a ‘view source’ command in a browser pulled the entire codebase out of a website you’re visiting: do you really think it would have survived this long? remember how heated the debate was when the GPLv3 wanted to contain reciprocal constraints even for software that was just executed and not redistributed (which would have impacted all web sites and web services which are now exempt)?

It is incredibly valuable to be inspired by systems and strategies that worked in the past and by the dynamics that made them sustainable… but we must do so by appreciating both the similarities and the differences if we want to be successful in replicating their impact.

Counterintuitively, what might be required to bootstrap a more sustainable open data ecosystem is not more being more open but less, building tools that focus first on protecting individual investments, and then in fostering selective disclosure and collaboration over such disclosed part.

We sure can (and did) engineer systems that act as trojan horses for openness (Exhibit is one obvious example), but they have failed so far to create sustainable network effects because, I think, we have not yet identified the dynamics that entice stable and sustainable collaborative models around data sharing.

Permalink | Posted in Article, Commentary