The Epistemology of Truth

January 9th, 2011

Every person that deals with data and data integration, especially at big scales, sooner or later comes to a very key but scary choice to make: deciding whether truth is discovered or invented.

Sure, there are various shades between those options, but either you believe in a metaphysical reality that is absolute truth and you just have to find a way to discover it. Or you don’t and what’s left is just human creation, social contracts, collective subjectivity distilled by hundreds of years of echo chambers.

Deciding where you stand on this choice influences dramatically how you think, how you work, who you like to work with and what efforts you feel drawn to and want to be associated with.

What is also surprising about this choice is how easy it is to make: just like with religion, you either have faith in something you can only aspire to know, or you don’t. And both sides can’t really understand how the other can’t see what’s so obvious to them.

This debate about invention vs. discovery, objectivity vs. subjectivity, physical vs. metaphysical, embodied vs. abstract has been raging for thousands of years and takes many forms but what’s striking is how divisive it is and how incredibly persistent over time, like a sort of benign and widespread memetic infection (a philosophical cold, if you allow me).

What interests me about this debate is its dramatic impact on knowledge representation technologies.

The question of truth seems easy enough at first, but it gets tricky very quickly. Let me show you.

Let’s start with an apparently bland and obvious statement:

Washington, D.C is the capital of the United States of America

True of false? Most people would say true without even pausing to think, but people that have dealt with knowledge modeling problems will probably ask “when?”, when is this statement supposed to be true? now or in a different time? So, for example

Washington, D.C. is the capital of the United States of America in 1799

True or false? It gets trickier. One has to define what “capital” means and has to know enough history to understand that the newly formed government of the US was actually assembling in Philadelphia that year. But one could very well claim that Washington, D.C. was being built and therefore was already the capital even if nobody was living there yet.

As you can see, something that appears as factual, benign and obvious as knowledge that every elementary school kid knows by heart can immediately become tricky to model symbolically in a way that encompasses disagreements and details all nuances of interpretation.

But there are cases where one statement rings more true/false than another

Washington, D.C. was the capital of the Roman Empire

given that the roman empire ended before Washington ever existed, it is operationally safe to assume this one to be false. And yet there are statements that are unknown/unknowable

Washington, D.C. is the capital of the United States of America in 4323

and statements (/me winks to Gödel) which we are certain their validity can’t be known

This statement is false

Unfortunately, when you’re modeling knowledge and try to condense it into a form that is mechanically digestible and symbolically consistent, the problem of finding a sustainable operational definition of truth becomes necessary.

This is where the epistemological debate on truth actually manages to enter the picture: if you think that truth is discovered, you won’t accept compromises, you want to find a process that finds it. And given that even if metaphysically existing truth is probably never reachable, it is extremely difficult to embrace this philosophy and yet obtain an actionable definition of truth.

On the other hand, those who believe that truth is merely distilled subjectivity and a series of ever-evolving collective social contracts, one solution is to avoid considering truth in terms of absolutes, but just thinking of statements as ‘true enough’. Examples of true enough are “true enough to be understood and agreed upon by a sufficiently large number of people”, “true enough to pass scrutiny of a large enough of people considered experts in this domain”, “true enough to remain present over years of public edits in highly visible and publicly edited wiki”, etc.

This ‘true enough’ modus operandi allows to build knowledge representation that is useful for certain tasks and models the world and answers questions in a way that rings true to its users enough times to build trust about all the other results that can’t be judged by the user directly.

The operational definition of truth as the byproduct of emergent coordination has the huge benefit of being actionable but has the equivalently huge problem of degrading rapidly with the number of people that have to share the same knowledge and understanding. This is simply a result of the combination of finite resources available to people (energy, time, communication bandwidth) and inherent variability of the world.

While this is so obvious to be almost tautological, it is nevertheless the biggest problem with knowledge representation: the number of assertions that can be considered true, independently and without coordination, decreases rapidly with the amount of people that need to naturally resonate with their truths. Even if we find a way to exclude the fraction of the population of divergent naysayers or plain ignorant, the trend remains the same: the larger the population the harder it is that all will agree on something.

Yet, such ‘eigenvectors of knowledge’ are incredibly valuable as they form the conduit, the skeleton, upon which the rest of discourse and data modeling can take place; the plumbing on top of which integration can flow and exchange of knowledge over information can happen.

Natural languages, for example, are examples of such set of culturally established eigenvectors of knowledge. They are not modeled strictly and centrally, but they have rules and patterns, a core of which we all pretty much share when we know the language and it allows my brain to inject knowledge all that way into yours right this very moment (which is nothing short of incredible, if you stop and think about it).

There is an implicit assumption out there, shared my many that work to build large-scale knowledge representation systems, that it is possible to bypass this problem by focusing on modeling only “factual” information. That is, information that descends directly from facts and therefore would be true in both views of the world, emergent and metaphysical.

Unfortunately, as I showed above, it’s relatively easy to blur the ‘factual’ quality of an assertion by simply considering time, the world’s inherent variability and cultural biases of interpretation.

So, if we can’t even model truth, what hope is left to be able to model human knowledge in symbolic representations enough that computers can operate and interrogate on our behalf and generate results that human users would find useful?

Personally, I think we need to learn from the evolution of natural languages and stop thinking in terms of ‘true/false’ or ‘correct/incorrect’ and focus merely on practical usefulness instead.

Unfortunately, this is easy to say but incredibly hard to execute as engineers and ontologists alike have a natural tendency to abhor the incredible amount of doubt and uncertainty that such systems will have to model alongside the information they carry in order to follow the principles I outlined above.

Also, users won’t like to receive answers that have associated truth probabilities.

Think about it: if you asked such system “Is Elvis Presley dead?” and the answer you got was “yes (98%) / no (2%)”, would you be satisfied? would you use the system again or would you rather prefer a system that told you “yes” hid all the doubts from you and made your life easier?

Last but not least, people in this field hate to talk about this problem, because they feel it brings to much doubt into the operational nature of what they’re trying to accomplish and undermines the entire validity of their work so far. It’s the elephant in the room, but everybody wants to avoid it, hoping that with enough data collected and usefulness provided, it will go away, or at least it won’t bother their corner.

I’m more and more convinced that this problem can’t be ignored without sacrificing the entire feasibility of useful knowledge representation and even more convinced that the social processes needed to acquire, curate and maintain the data over time need to be at the very core of the designing of such systems, not a secondary afterthought used to reduce data maintenance costs.

Unfortunately, even the most successful examples of such systems are still very far away from that…. but it’s an exciting problem to have nonetheless.

Permalink | Posted in Article

Drivers vs. Enablers

June 5th, 2010

I’ve heard many times people saying that the web exists because of “view source”.

“view source”, if you don’t know what I mean, is the ability that web browsers have to show you the source HTML content of the web page you are currently browsing. If you ask around, pretty much everybody that worked on the web early on will tell you that they learned HTML by example, by viewing the source or other people’s pages. Tricks and techniques were found by somebody, applied, and spread quickly.

There is wide and general consensus that ‘view source’ was a very instrumental tool to easily propagate knowledge and simplify adopting the web as a platform, yet its role is often confused.

“view source” was an enabler, a catalyst; something that makes it easier for a reaction or a process to take place and thus increases rate, effectiveness, adoption, or whatever metric you want to use.

But it is misleading to confuse “view source” for a driver: something that makes it beneficial and sustainable for the process to take place. The principal driver for the web was the ability for people to publish something to the entire world with dramatically reduced startup costs and virtually zero marginal costs. “view source” made it easier and reduced such startup costs, but had nothing to do with lowering marginal costs and certainly had very little to do with the intrinsic world-wide publishing features of the web.

You might think that the current HTML5 vs. Flash diatribe is what’s sparking these considerations, but it’s not: it’s something that Prof. David Karger wrote about my previous post (we deeply enjoy these blog-based conversations). He’s suggesting that while my approach of looking for sustainable models for open data contributions is good and worthwhile, he believes that a more effective strategy can be the one of convincing the tool builders to basically add a “view source” for data and that once that is in place, we wouldn’t have to care as the data would be revealed simply by people using the tools.

It’s easy to see the appeal for such a strategy: the coordination costs are greatly reduced as you have to talk and convince a much smaller population and all composed of people that already care about surfacing data and see potential benefits for further adoption of their toolsets.

On the other hand, if feels to me that it’s confusing enablers for drivers.

The order I pose questions in my mind when engineering adoption strategies is normally “why” then “how”: taking for granted that because you have drivers then everybody else must share it or have a similar one can easily lead you astray . The question of motive, of “what’s in for me?”, might feel materialistic, un-intellectual and limiting, but an understandable and predictable reward is the basis for behavioral sustainability.

David is basing his thoughts around Exhibit and I assume he considers the driver to be the tool itself and its usefulness: it can taking your data and presents it neatly and interactively without you having to do much work or bother your IT administrators to setup and maintain server-side software. That’s appealing, that’s valuable and that’s easy to explain.

The enabler for the network effect is that “cut/paste data” icon that people can click and obtain the underlying data representation of the model…. and do whatever they want with it.

But here is where things start to get interesting when you consider drivers and enablers separately: ‘view source’ was a great enabler for the web because it was useful for other people’s adoption but didn’t impact your own adoption drivers. The fact that others had access to the html code of your pages didn’t hurt you in any way…. mostly because the complexity of the system was locked on your end in your servers and your domain name is something you control and they can’t replicate. What you had access to was a thin surface of a much more complicated system running on somebody else’s servers. It was convenient to you and your developers to have that view-source and the fact that others benefited from it posed no threats to you.

This is dramatically different in the Exhibit situation (or in many other open data scenarios): not only you can take the data with you, but you can take the entire exhibit. Some people are not bothered by this fact, but you can assume that normal people get a weird feeling when they think that others can just take their entire work and run with it.

This need of ‘preventing people from benefitting from your work without you benefitting from theirs’ is precisely the leverage used by reciprocal copyright licenses (the GPL first, the CC-share-alike later) to promote themselves, but there is nothing in the Exhibit adoption model that addresses this issue explicitly.

If your business is to tell or synthesize stories emerged from piles of data (journalists, historians, researchers, politicians, teachers, curators, analysts, etc), we need to think about a contribution ecosystem where sharing your data benefits you and in a way that it’s obvious for you to understand (and to explain to your boss!). Or, as David suggests, a ‘view source’-style model where the individualistic driver is clear and obvious and the collaborative enabler is transparent, meaning that it doesn’t require them to do work and is not perceived as a threat to their individualistic driver.

The thing is: with Exhibit, or with any other system that makes the entire data available (this includes Freebase), the immediate perception that people have is that making their entire dataset available to others is clearly benefiting others and doesn’t seem to offer clear benefits for them (which was the central issue of my previous post).

Sure, you can try to guilt-trip them into releasing their data (cultural pressure) or use reciprocal licensing models (legal pressure), but really, the driver that works best is when people want to collaborate with one another (or are not bothered by others doing it on their own work) because they immediately perceive value in doing so.

Both Exhibit and Gridworks were designed with the explicit goal to be at first drivers for individual adoption (so that you have a social platform to work with) and potential enablers for collaborative action later (so that you can experiment with trying to build these network effects); but a critical condition for the collaborative enabler is that it must not reduce the benefit of individual adoption or otherwise it will reduce its ability to drive network effects.

Think for a second about a web where a ‘view source’ command in a browser pulled the entire codebase out of a website you’re visiting: do you really think it would have survived this long? remember how heated the debate was when the GPLv3 wanted to contain reciprocal constraints even for software that was just executed and not redistributed (which would have impacted all web sites and web services which are now exempt)?

It is incredibly valuable to be inspired by systems and strategies that worked in the past and by the dynamics that made them sustainable… but we must do so by appreciating both the similarities and the differences if we want to be successful in replicating their impact.

Counterintuitively, what might be required to bootstrap a more sustainable open data ecosystem is not more being more open but less, building tools that focus first on protecting individual investments, and then in fostering selective disclosure and collaboration over such disclosed part.

We sure can (and did) engineer systems that act as trojan horses for openness (Exhibit is one obvious example), but they have failed so far to create sustainable network effects because, I think, we have not yet identified the dynamics that entice stable and sustainable collaborative models around data sharing.

Permalink | Posted in Article, Commentary

Freebase Gridworks, Data-Journalism and Open Data Network Effects

May 24th, 2010

Earlier this year, David pinged me over IRC and prodded me to look at a new software prototype he had just created. Just like it happened many times before, I was blown away: what I had in front was a game changer. Not only it was a wonderfully executed prototype of obvious usefulnes (a rare thing on its own), but it was solid yet flexible in design and it allowed me to plug many of my own ideas and code prototypes into it that had been laying around disconnected is various random projects over the years.

The months after became a wonderful and exciting development collaboration between the two of us (much like in the good old days of SIMILE while we were both at MIT) to take the outstanding ideas and foundation he had built, sprinkle a few of mine and bake it together into a software product that we could be proud of and that we could try to use to bootstrap a network effect around the problem of enticing substantial data contributions to Freebase.

Several months later, that early prototype became Freebase Gridworks.

We knew we were onto something valuable here because while developing we started using the tool itself for daily situations that had nothing to do with our development effort. We were writing software we wanted to use ourselves, for daily tasks, and that’s the best (and rarest!) kind of software.

What we didn’t expect is how much people resonated with it.

In Praise of Gridworks

Jon Udell wrote a post entitled “PowerPivot + Gridworks = Wow!” where he marveled at the possibilities of mixing the latest data powertool from Microsoft with Gridworks and in another post wrote (emphasis is mine)

[...] Freebase Gridworks will make you weep with joy.

As the open data juggernaut picks up steam, a lot of folks are going to discover what some of us have known all along. Much of the data that’s lying around is a mess. That’s partly because nobody has ever really looked at it. As a new wave of visualization tools arrives, there will be more eyeballs on more data, and that’s a great thing. But we’ll also need to be able to lay hands on the data and clean up the messes we can begin to see. As we do, we’ll want to be using tools that do the kinds of things shown in the Gridworks screencasts.

The News Applications team at the Chicago Tribune wrote on their blog

The genius of Gridworks is that it is generic enough to work for a wide variety of datasets without the need to write any code at all [...] We really can’t say enough about what a great application Gridworks is and about its myriad uses for hacker journalists and data-nerds of all stripe.

Chris Amico of PBS NewsHour tweeted “Gridworks is like crack for data junkies”; Scott Klein of ProPublica tweeted “I think @thejefflarson is going to name a dog after Gridworks.” speaking of his colleague Jeff Larson; Rich Vázquez of ImpactNews tweeted “I just got to know old data all over again using Freebase Gridworks” and many others we have collected from the Gridworks twitter stream.

Data-Journalism

From The Guardian Simon Rogers asks this question:

is data journalism? If you need to ask yourself the question then you are about to miss out on an information bonanza.

Unfortunately, as I have written before, people will soon realize (as Jon also warned above) that the ‘information bonanza’ that Simon is talking about looks a lot more like somebody’s gigantic basement than the well ordered shelves of a library or the heavily curated archives of a museum. Most importantly, they won’t find any mention of this coming from governments or open data advocates since they have all the intentions (and the incentives) to make you believe otherwise.

At the same time, a new breed of electronic investigative journalism is emerging and it feeds on the perception that there must be golden stories buried in the giant pile of digital ore that open data advocates have helped surfacing. The problem now is shifting: before, it was getting your hands on the data, now is surviving information overload or sieving thru all that noise and find the golden digital nuggets worth of a story.

It’s also important to realize that gold is not only a metaphor here: ProPublica, for example, won no less than the Pulitzer Prize this year (first time ever for an independent, non-profit newsroom that produces investigative journalism in the public interest) for their investigative journalism in partnership with the New York Times. The advantages and rewards for digital journalists are real and tangible, especially in an era when anybody can publish something to the world with a click of a button and individual bloggers can’t afford a “news application development team”.

But there are tons of data manipulation tools out there including free ones: why are people (and especially data journalists) so excited about Gridworks? What is so special about it?

We don’t know for sure (it’s hard to reconstruct resonance, even after it happens), but this is my personal take:

Unlike most data tools that assume that data inconsistencies are mistakes and that for that reason they are assumed rare, Gridworks was designed around the concept that data inconsistencies are a fundamental property of any dataset; things like alignment, consistency and curation are first order tasks that need to be done every time a dataset is used for an application different from its original one. Quality is not an absolute property of a dataset, therefore it’s misleading to assume so. Gridworks makes complex data manipulation and curation operations natural and uniform, while in other tools, even the most popular ones like Excel, these is a huge gap between trivial search-replace operations and fully programmable scripting solutions. Most curation operations require complexity that lies in that nowhere land and in most tools are available only with complex scripting and programming; but not in Gridworks.
Gridworks follows the principle that data first designs are more aligned to natural human cognitive abilities and are also easier to bootstrap because the return on the invested effort is easier to predict (and forecast) each step of the way. Coupled with the previous point, this means that it should be easy and natural for the user to re-structure data to follow whatever mental model fits them best and feels most empowering, rewarding and liberating in contrast with other tools that know better and tell them what to do and for that reason feel rigid and taxing.
Unlike most data mining tools out there that focus on creating summarized executive reports or spotting numerical trends or correlations, Gridworks focuses less on numbers and more on relations. Numbers and dates are not the focus of the data model but they are decorations of a relational model between more abstract data points. This makes Gridworks fit a special (an in our opinion extremely fertile but mostly unexplored) functional space between a spreadsheet and a relational database, retaining the data-first incremental familiarity of the first and the querying and filtering capacity of the second.

Open Data Network Effects

Resonance and traction are great rewarding properties of a successful product launch but they usually only paint the picture of individual interest, at least at first.

As we have seen with the accent on the social aspect of the web in recent years, even the simplest and most trivial of services (say, microblogging services like Twitter) can assume a completely different scope of impact and importance once sustainable network effects come into play. So how does Gridworks fair in the realm of Open Data network effects and what are the obstacles on its path?

First of all, it’s worth noting that all successful and sustainable network effects share a unique property: the system needs to be beneficial for the individual independently on how many others use it. If this is not the case, a ‘chicken/egg’ problem surfaces where the system is beneficial for them only if many people use it but nobody wants to use it until it’s beneficial for them.

The regular web, the blogosphere and microblogging all share this fundamental property: people find expressing themselves rewarding, independently of how many others read what they write. But these systems naturally create self-sustaining network effects: once other people read what you wrote, they often want to write something too; if its easy/cheap enough for them to do so, this starts a chain reaction that sustains the network effect.

Because David and I have been working on untangling chicken/egg problems of the web of data for years (more than 7 now that I think about it) and gained a lot of experience with previous tools (timeline, exhibit and timeplot) that data lovers really liked, we knew that first and foremost a data tool should feel immediately powerful and rewarding even just for individual use and that was our major focus for the first phase of Gridworks.

At the same time, no network effect will emerge unless Gridworks becomes even more useful when others use it too.

It is in that spirit that this tweet today from @delineator that made me stop and ponder (emphasis is mine):

@symroe I’m making a lot of use of gridworks too – are you uploading your data back into freebase? not sure if I want to give them the scoop

This is something that was in the back of my mind but I had not put in such clear terms before: the people digging for open data gold might be keen to praise and support all efforts that make more free data and free tools available (as they feel it makes it easier for them to find their digital gold), but while they have clear and established incentives to reveal their findings (what is the story and where they found it, which is the foundation of their credibility as journalists), they do not (yet) have incentives to reveal how they got to it or to share the result of the data curation effort with others. This is because they worry that it might only make it easier for others to find other stories from that pile of already cleaned data and thus, de-facto, ‘steal’ it from them.

This is not much different, for example, to what happened with the human genome project when public and private institutions started to race to compile the entire map of the human DNA: only when the costs of DNA sequencing became so low as to make their proprietary advantage in data hoarding marginal, private institutions started to share their data with public efforts.

The principal network effect attractor for Gridworks is the notion that internal consistency, external reconciliation and data integration between heterogeneous datasets are surprisingly expensive even for the most trivial and well covered data domain (this is something Metaweb learned the hard way while building Freebase).

This fact makes “curated open data hoarding” an unstable equilibrium: all it takes is one person to be a little less selfish and share their partially curated datasets in an open shared space in order to share the curation cost with others to disrupt the proprietary advantage of hoarding. This is very similar to the idea of creating a vendor branch of an open source project and make money off of the proprietary fork: it works only if the vendor branch is as effective as the open community to keep up with innovation and evolution of the ecosystem (and history of open software shows this is hardly ever a sustainable business model if the underlying community is healthy and vibrant).

Unfortunately, another chicken/egg problem surfaces here: Metaweb and the people relying on Freebase data for their applications won’t be keen on letting people enter badly or partially curated data into the main shared data pool, to avoid diluting the perception of data quality for all other users. On the other hand, curated data hoarding dynamics will remain stable unless Gridworks provides simple and effective ways for people to collaborate on the curation of datasets in an incremental and immediately rewarding way (just like proprietary software development models were perfectly stable before the internet and open development processes lowered coordination costs enough for sharing network effects to become sustainable).

Unlocking this conundrum and lowering the coordination costs enough to make open data curation sharing processes sustainable is what the Gridworks team (and the user-facing side of Metaweb) is going to focus next.

In the meanwhile, we can’t wait to see what kind of digital gold people will be able to extract from the open data piles using Gridworks and how they will decorate it and augment it with the data coming from Freebase.