We're excited to announce that will be starting a new series of posts on Veracity here at the Forge.AI blog. The series will explore quite a few aspects, including:
- Machine learning models for veracity
- Moral and ethical aspects of modeling veracity
- Experimental design and processes for creating machine learning models
- System design
This series is being written in tandem with the development of our system and models at Forge.AI and, as such, it is more of a journey than a destination. I will explore the rationale behind our choices and structure in order to illuminate hidden assumptions that may be baked into our models. We will also include my thoughts on being a Modeler (as opposed to other viewpoints and stances one can take when designing, building, or using machine learning).
At Forge.AI we consume many types of textual data as input including, but not limited to: news articles, research and analytical reports, tweets, marketing materials, shareholder presentations, SEC filings, transcripts, etc. We are working to turn this “soup” of data into coherent, succinct, and salient nuggets we call Events. These Events are consumed by our customers to make decisions, take actions, and reason about concepts relevant to their business. Before we can transform our input data into such Events we need to consider one of the most fundamental principles of any computational system, including machine learning systems: the concept of input quality. This principle is most succinctly represented as: Garbage In, Garbage Out.
Well, easy enough. We just won't give our systems bad inputs! Done. End-of-Post.
Yet this post is still going...
So, what did we miss? To continue our metaphor, what people sometimes leave out is that one man's trash is another man's treasure. Whether so-called bad inputs should be ignored or treated as informative is a matter of perspective and use-case. This is especially true for us at Forge.AI since we do not, and can not, know every single use-case for every single possible consumer of our Events. Knowing the quality of the input is key to being able to decide what to do with the it -- whether to not even use the input or to treat it as highly informative and useful in a particular system.
One such quality of textual data that is often of interest is whether the meaning of the text is actually true or not. For numerical data we might measure quality with a signal-to-noise ratio, but textual data quality is not as easily quantified. So, we turn to veracity in hopes of being able to understand and reason about the quality of textual input.
In addition to our own machine learning algorithms here at Forge.AI, we must also be cognizant of our customer’s machine learning systems. Do our customer’s systems handle “bad” quality data well? Is it useful for consumers of our Events data to know about things that are perhaps low-veracity? Some algorithms can leverage low-veracity Events such a tabloid hearsay or FUD campaigns, for others ignorance is bliss. Our Events are designed to be inputs to machine learning systems far and wide; systems which themselves have to reason about their input data quality. Veracity is not just of interest to our internal machine learning algorithms, it is paramount for all consumers of our Events. Our models of data quality not only allow us to reason about *our* machine learning systems, they allow our consumers to reason about *their* machine learning systems.
Our initial definition for veracity comes from the the Merriam-Webster dictionary: conformity with truth or fact. Even though this definition uses only five words we can already see the complexity of trying to model veracity. To start, we have a choice in how to interpret conformity: is this a boolean property or are there gradations of conformity? We are a client-facing company therefore the results of our model of veracity—and our chosen definition of veracity—will be used by our clients to reason about their very own machine learning systems. At Forge.AI we pursue simplicity when building machine learning models in order to improve our own reasoning and because the users of our model results can reason about whether and how to use the results more efficiently. For these reasons we will treat conformity as being all-or-nothing.
Rewriting the definition, we now have: Veracity = "Does [a blob of text] fully conform to truth or facts? [ Yes | No ]." Notice that I have now made veracity a boolean property and I have further focused our models to work on blobs of text. It is important to narrow the scope of the model early on so that we can:
- Construct a valid and useful problem statement
- Create hypotheses and design experiments
- Gather relevant data
Let's tackle the last two salient words: truth and fact. Wow. These words are loaded with potential meanings, nuances, politics, and philosophical pitfalls -- entire fields are devoted to these two words. We need a base definition that is widely adopted from a trusted source, so we turn to one of the most reputable dictionaries of the english language: The Oxford English Dictionary. From the Oxford English Dictionary:
- Truth: That which is true or in accordance with fact or reality
- Fact: The truth about events as opposed to interpretation
Seems clear enough -- even with the circular definition; veracity is a boolean property that represents whether a blob of text contains things, or Events, that are true according to reality. This definition is scoped to what the Event(s) represents. For example, one can have a “true” Event that asserts “Javier *said* I cannot write” and a second, false, Event that incorrectly asserted “Javier cannot write”.
Knowledge, Infinities, and Truth
We now have a definition of what we want to model, but no model yet. How do we begin to model veracity, as defined in the previous section?, It seems like any model will have to ascertain whether something, call it X, is true. How many things can be true? How do we represent facts? How do we query something, call it Y, for whether X is true or not?
There will never be a Y that can store "all true things", or, at least, practically never. We can have local regions of knowledge stored in Knowledge Bases, and such objects would allow us to query for whether X was true or not according to the Knowledge Base, but having a global, sufficient Knowledge Base seems like an impossible task. It is not even clear whether the number of truths or facts is finite or infinite. Additionally, what is known changes over time: non-stationarity will get you every time!
Of course, we can always assume certain things about truth. We could assume that truth is finite, or we could go down the non-parametric Bayesian route and assume truth is infinite but only a finite representation is needed at any point in time. We can also assume that truth and facts change slowly over time, perhaps not at all, in order to ignore the non-stationarity or make the growth rate manageable. How far do these assumption take us? We can reduce the problem of veracity down to the following: given a blob of text representing X, is X contained in our Knowledge Base Y? Unreservedly glossing over how we get X from the blob of text (where X is something we can then query our Knowledge Base Y about), we have now turned our model of veracity into a query of a Knowledge Base. The caveats:
- Our Knowledge Base must contain all (finite, current, pertinent) truths
- Our Knowledge Base contains no falsehoods
So, even with two very strong assumptions regarding finiteness of truth and stationarity, we still have to somehow construct a Knowledge Base that contains all truths and nothing but the truth. A back of the envelope calculation is enough to see the sheer magnitude, and impracticality, of storing all possible facts: There are 7.6 billion people on earth (source: worldometers) and each makes at least one decision daily which gives rise to at least one fact … and this does not include facts about things that are not people!
This line of thought has given me pause. Let us say, for now, that we will not have a Knowledge Base big enough to contain all truth. Can we, perhaps, only apply veracity towards subjects that we care about? How then, do we define what we care about? Is it based on some form of utility function, perhaps applied to a future state? However we define it, we can try to build a knowledge base whose facts and truth cover a certain region or domain of knowledge well. We can quantify what well means, as well as cover. In fact, this idea of how one builds a knowledge base is already being discussed here at Forge.AI.
There will invariably come a time, though, when we must apply a model of veracity to something which is not in our Knowledge Base; either because it is a brand new thing or because our Knowledge Base is just incomplete. What recourse, then? Do we throw up our hands and say that anything not in our Knowledge Base is, by construction and definition, unreasonable (unable-to-be-reasoned-about)? What then of consumers who are on the bleeding edge, whose needs require reasoning over new and unexplored regimes? The real question, then, is can we make a model of veracity that does not depend on our, or any, Knowledge Base? 1
Veracity Devoid of Fact
Eschewing a Knowledge Base, what is left? Let us consider why we want a model for veracity in the first place. I can come up with the following reasons:
- We want to know when someone or something is lying in order to "count lies". It does not matter what the lie is about, or why, just that we see a lie
- We want to know if we can trust something
- We want to know if someone or something is lying (or not) because we want to know why they are lying (or telling the truth): was it simply a mistake, or is there some other game afoot?
Option #1 seems like we really do need a Knowledge Base. This option is all about whether the text is truthful or not and is a direct application of our definition of veracity. I can see, for example, academics wanting to study the statistics of lies, or the dynamics of lies, being interested in this option. I do not see a way to create a model of veracity for option #1; it is a valid problem but not one we will consider from now on.
Option #2 is all about trust. Now, trust certainly has a similar feel to facts and truth, but it is not exactly the same thing. You may be able to trust the content of a blob of text by trusting the source of that text. In fact, it is very reasonable to not know the truth behind a piece of text and to trust the source and therefore learn what the text says as a fact. Here I see several ways that we can try to model a quantity much like veracity but whose goal is actually trust:
- Expertise (global)
- Expertise (specific)
- Community Membership/"Points"
- Multiple Sources and Corroboration
- Past Experience
- Intention of Source
Option #3 is all about the intention of the source for the target audience. Notice that the last bullet point for the previous option also has the intention of the source. Coincidence? I think not!
Intention: An Aim or Plan
The section heading comes straight out of the definition of intention from the Oxford English Dictionary. Intention is all about the aim or the goal of a particular piece of text. Is the textual piece trying to persuade? Is the piece trying to inform as objectively as possible? Is the piece trying to get itself re-syndicated somehow (sensational writing and/or clickbait headlines)? Are there, possibly hidden, agendas for the textual information? If so, what are the agendas? Are there utility feedback loops in the environment which can inform the intention of a piece of writing, for example:
- Web ads + click revenue = clickbait
- Academic publication + grant likelihood = doomed to succeed
One striking property of the examples on intention above is that none of them revolve around truth. In fact, they are all relevant and interesting questions irregardless of the factual content (or lack thereof) of a piece of text. This looks like a promising direction since intention, it seems, does not require a Knowledge Base.
Stepping Back and Stepping Forward: The Modeler's Dance
Okay, let’s step back and see where we have arrived. We started with the idea of veracity and a solid and clear definition: conformity to truth or facts. However, when we began digging for a model of veracity we stumbled upon Knowledge Bases and the seemingly impossible task of ever having "the right Knowledge Base containing all truth and nothing but the truth". So, what does a modeler do? We started dissecting the reasons why we would want a model of veracity. It turns out that for two of the three reasons we quickly came up with, the idea of intention was paramount. And intention has nothing to do with truth, or Knowledge Bases, so we can sidestep that whole mess altogether.
Such back-and-forth reasoning I call the modeler's dance because it is an unavoidable process when creating models. Modelers are not, strictly, scientists in the sense of a search for truth; a model may be an approximation of reality, or it may be a useful tool to compute something. As such, modelers are not explicitly tied to the truth of the universe but sometimes are like engineers and create tools that are useful for specific functions.
Now, you may think that modeling anything but truth will always be worse than if you modeled truth itself. If that thought crossed your mind, I ask you this: have you ever found a self-deception to be useful, perhaps even necessary, during some part of your life (e.g. when dealing with fears or anxieties self-deception is a useful ego defense)? We do not have the freedom to wait until we have fully understood reality to live in it (we are alive right now!). Similarly, a modeler cannot fail to consider that the model needs to be used, the questions need to be asked and answered, even if the truth of the matter is as yet unknown to all of humanity (or even just to the modeler). There is always a part of a modeling process where we need to determine both what we believe to be truth, and what we believe we need to model; the two are not always the same thing:
Dance of the Modeler 1: Intention and Syntax
Our journey so far has taken us to a new goal: creating a model of intention. We want this model to explicitly distance itself from Knowledge Bases, facts, and truth. Further, we would like this model to apply as generally as possible to pieces of text. How do we do this? We turn to the somewhat contentious ideas, authored by James E. Pennebaker in "The Secret Life Of Pronouns", that the use and structure of language carries information about the emotions and self-concepts of the author of said language. As an example, people who obsess about the past use high rates of past-tense verbs. Pennebaker’s ideas are alluring because they tie together language and creator of that language so that analyzing the language reveals something about the creator -- such as their intentions. 2 In order to be general to types of text, is there something in the syntax that can help infer the intention of the piece of text? According to Pennebaker: Yes. "Functional word" (pronouns, articles, auxiliary verbs) usage may give insight into the speaker, the audience, or both. We will go a step further and try to use the syntax structures themselves (not the actual word choice) as features with which to build a model of intention. An example and teaser of how syntax differs by intention is the following: the average number of adverbs used in survey questions (designed by statisticians to be unbiased) differs from that in reading comprehension questions (where presumably statisticians were not employed to create unbiased questions).
While syntax seems to align with intention, we must be extremely cautious not to end up creating a model for voice. Voice, sometimes erroneously referred to as the writing style of a piece, is not fully understood. Different communities and cultures may have norms determining the voice that is "proper" or "good" for a particular piece of text. Human readers are often swayed by the voice of a piece of text, especially if the voice goes against the norms of the reader's community. While voice may be an interesting part of a piece of text, it is not the intention of the text. Our wish is for a model of intention, irregardless of voice. Why? Because as modelers, we are responsible for the decisions, actions, and possible misuses of our models. We want to be very careful not to make a model that unfairly biases towards/against:
For us, we will use the following definition: voice is all stylistics elements that are based on the individual speaker/writer and are inherent to the speaker/writer; voice encompasses intrinsic properties of the speaker/writer devoid of intention. Style, on the other hand, is a choice made by a writer and has an inherent intentionality behind it; I may choose to write in a persuasive style, or an analytical style. For those further interested in voice, Dell Hymes' "Ethnography, Linguistics, Narrative Inequality: Toward an Understanding of Voice" is an interesting read.
Now that we have a traded one seemingly impossible modeling task (veracity) for another just as seemingly impossible modeling task (intention), look forward to the next step towards a model of intention using syntax on a future post.
I started this post talking about the importance of knowing the “quality” (by which I meant veracity) of data being given to machine learning algorithms -- not just here at Forge.AI but also any consumers of our Events data stream. We danced, we've thought, and we have even shifted our original goal from a purely truth-based veracity model to one whose use-case of trust is explicit form the start: a model of intention. We begin by using syntax, the underlying structure of language itself, as a way to generalize our model into one useful over many types of texts. But there are dangers that we must explicitly guard against; dangers of bias against culture or race, voice or gender. In this way, a Modeler is always willing to create the future, reality be damned! As always, my hope is that our models perform better than a human audience. Is this an attainable goal? Perhaps; but it is our goal nonetheless.
- Going into this exercise, I had a strong leaning towards one particular outcome. However, it's important to understand when your feelings are tied to your intuition. It is even more important to understand that intuition is sometimes wrong. The process of creating a machine learning model is sometimes touted as an art or a creative process. That may be true for some but, in my view, models are rational and the process for creating a machine learning model should also be rational. The best place to live is right at the edge of intuition, and a little beyond!
- In case you were wondering, I did just throw in the concept of intention without any real definition. As it turns out, intention is a confusing concept, and one which deserves it's own post (or two or three as we progress down our journey). For now, let us say that intention boils down to one of a small set of future changes to the reader a writer may have hoped for, including staples such as: deception, persuasion, information transfer, objective query-and-response, and emotional modification.