The real definition of probability

Over centuries many great minds have pondered over the meaning of probability, trying to apply it to profound questions such as "is harvest likely to be good this year", "how likely is my stock portfolio going to appreciate", "will I get a hangover after drinking this bottle", etc. Some even went as far as to claim that the term "probability" is undefinable using simpler concepts and nevertheless fearlessly proceeded to derive all the correct rules for relating one probability to another without ever revealing the secret of determining the value of either one.

Thanks to the power of Interwebs and Inkscape you no longer have to wonder along. Probability is just glorified counting and taking ratios (e.g. counting things of one type within the set of things of another type). In the end, even the supposedly more general Bayesian view on probability reduces to just one elementary operation, counting. This means that with enough perseveration one can reduce any probabilistic problem to counting balls in an imaginary urn. Like so (click to enlarge):

Definition of probability in terms of counting

By the way, when you hear talk about "prior information", what is really meant is "counts". The natural questions is to ask which counts exactly. If they can't tell you, they are safe to ignore. Also, keep in mind that some counts don't count as much as the others. Probably...

(Coming soon: the real definition of maximum entropy - it's all about counting too!)

More seriously, it may be helpful to realize that every probabilistic model corresponds exactly to some such urn-based setup. Any reasoning using the urn model can be mapped back to the situation described by the probabilistic model - and vice versa. Moreover, urn-based setups may be transformed formally into one another while preserving their meaning. While it's difficult to juggle probabilistic formulae in one's mind, ball-filled urns are quite easy to visualize and quick-check for surprising contents. The continuous probability case also fits nicely by imagining the "going into the limit process", that is, shrinking balls ad infinitum.

Shrinking balls. Oh well...

The curious word 'why'

The same little word "why" may be used to obtain a causal explanation ("by what process has X come to be?") or a goal ("what purpose does X serve, as a part of a larger mechanism?"). These are obviously quite opposite ways of looking at things - backwards vs. forwards in time - one of them neutral, the other postulating agency. So I wonder, why does the word "why" combine both meanings and what are the consequences for everyday reasoning? Is it so in all languages, or do some exist in which inquiries as to cause and purpose must be represented by two distinct question words?

Accelerating Genetic Engineering

Throughout the history of science, those disciplines capable of controlled experimentation have advanced rapidly (e.g. physics) while those with limited capability of this sort have hardly made any progress in comparison (e.g. social sciences or economics). It would be no exaggeration to say that the tweak-it-and-see-what-happens-then approach is the key to gaining insight into how systems operate and how they can be changed to our advantage. Beyond science, it also appears to be the base principle for learning of any kind (consider language development in children, for example). For efficiency, it is crucial that the tweaking occurs in a controlled fashion, ceteris paribus, completely at our will, not disturbed by what is known as "confounding factors" in statistics.

Consider debugging computer software as another example (or troubleshooting of any kind, if you are not into software). If the computer program under inspection only changes its behavior in response to the programmer's modifications and inputs under her control, then the task of understanding and shaping it into whatever form is desired is mostly trivial. However, if there are unknown varying inputs that influence the program's behavior on each run, which mask the programmer's corrective actions, the debugging task becomes a nightmare, or at least calls for statistical analysis (not commonly available to real-life programmers). The same sort of problems arises if the modifications available to the programmer are too coarse-grained, e.g. if she can only replace large (and needed) components rather than "dig inside" and fix them.

It appears that researchers in Genetic Engineering have very recently made a breakthrough by gaining the ability to not just observe, but also tweak their "programs" in a piecewise, controlled fashion. Watch this presentation by Craig Venter to learn more: From Darwin to New Fuels (In A Very Short Time). They now expect that the progress will be greatly accelerated by this capability, and looking at the history, there is every reason to believe them. The potential for grim accidents is also there, of the same sort which is present in software systems. The same tweak-and-see techniques that are so helpful in offline development environments can wreak havoc when (or rather if) applied in production systems. (Most) programmers are smart enough to make the distinction. The same must be expected from genetic engineers.

No fuss about causality

Throughout history and up into modern days, a big fuss has been made among philosophers about defining and dealing with causality. For a nice overview, see these lecture slides, which illustrate the troubled history of the concept. In recent times, formal approaches have been developed to connect causality to probabilistic/statistical reasoning (Rubin) or to do just the opposite, treating causality as an extension supposedly completely out of scope of probability theory (Pearl). It seems that the causality debate still rages on, apparently now on the battlefield of notations. For example, listen to Pearl's recent lecture in which he quips that "mere mortals" not trained by Rubin cannot verify certain expressions required within Rubin's framework. Pearl himself advocates a graphical representation of causality (little wonder in light of his past work). Even so, when asked about modeling just slightly complicated scenarios (A causes B, but only given C), he grudgingly admits that graphs do not directly expressing such constraints. Instead, the constraint can be hidden within the probability distribution associated with a graph.

Hearing all this, I wonder whether the award-winning philosopher is not now in the business of shooting sparrows with cannons. I agree with Pearl's assessment that given a set of structural equations or a graphical model (like his electric circuit example), all causal and counterfactual questions can be readily answered by simply running the model (simulation). I'm puzzled why Pearl does not go one step further and point out that nowadays (and since 50+ years) we have very elaborate and wildly popular tools for expressing causal models and the equipment for running them. They are imperative programming languages and computers, of course. Every program written in an imperative language is an intricate causal model, in which expressing constraints of the sort mentioned above comes effortlessly and the notion of time (so central to all causal reasoning) is given by the execution semantics.

For example:

if (c == C)
        if (a == A)
                b = B;

which is of course equivalent to

if (c == C && a == A) { b = B; }

which is of course equivalent to stating "A and C (combined) cause B". Given such a model, we may call A and C separately "necessary causes" if we so prefer. We may call "A and C" the "sufficient cause". Finally, given a particular run and a different expression of the sort "A or C", we may speak of the "actual" cause having been either "A" or "C" or both. What I wish to say is that there are no doubts about causality given a model in form of a computer program. It also makes obviously clear how pointing to a single variable as "the" cause of something could be incorrect. Finally, modeling runs of computer programs has been a topic in computer science for decades, even if the researchers have never bothered to use the word "causality" in this context.

Of course, computer programs are entirely deterministic and hardly "statistical" beasts. However, who says that the "real-world" causality is not or at least may not be treated as such? If you view probability, as I do, as a means for modeling epistemic (that is, modeler's own) uncertainty rather than some ontologic "stochastic randomness" of nature, then you can apply it without hesitation to deterministic computer programs, in circumstances where parts of the state or code are unknown. For example, you could model an unknown variable value as a probability distribution over possible values, or you could model an unknown segment of code as a probability distribution over possible segments. (If you can't even enumerate the possibilities or if they appear "infinite", you are in trouble; ask yourself whether and why you know so little and how you could find out more.)

The challenge of science is, as Pearl rightly points out, that we seldom know the causal model. That is, we either don't know what program has (or may have) generated our observations, or the same set of observations might have been equally been generated by many different programs. In this latter case we have a uniform probability distribution over programs. Our task then is to somehow infer the program from the observations and from "causal assumptions" - data and the prior. The "somehow" should be plausible reasoning according to the rules of probability theory, and so we have a connection (not of the sort contemplated by Pearl/Rubin).

The causal assumptions correspond to our estimate about which models (programs) are possible at all, and which are consistent with other models (programs) that we already deem as accurate and useful representations of reality. Interventions before observation help enormously by lowering probabilities for sets of programs not compatible with the intervention+observation data.

For example, given the following set of observations:

a = 1, b = 0
a = 0, b = 0
a = 0, b = 1
a = 1, b = 0
a = 0, b = 1
a = 1, b = 0
a = 1, b = 0

we could just as well fit the following two causal models (and many others):

if (a == 1) { b = 0; }


if (b == 1) { a = 0; }

However, if we perform a set of interventions of setting b = 1 and observing a != 0, and another set of interventions of setting a = 1 and observing b == 0, the first model will stand the test while the second one will become very implausible. However, we should be careful to not proclaim it impossible, as there still could be hidden variables not accounted for within the model contributing to the observed outcomes. One day, we might find these factors and control for them and setting b = 1 might then indeed begin causing a == 0. And so we see that:

  • causality, much like probability, is in the eye of the beholder
  • (incomplete) causal models may be treated as if they generated data according to some probability distributions
  • causal models may be assigned probabilities

That said, there is little reason to make a big fuss about finding the "one true definition" of causality, the "one true notation" for representing causal arguments, or "measurement methods" for determining strength of "causal connections". We have no need for big philosophy of causal reasoning, but great need for good, sufficiently granular and computationally cheap causal models that reliably deliver predictions about effects of actions to their users.

The intellectual dishonesty of "stochastics"

To describe inference problems using the language of stochastics does not necessarily yield poor results, but it seems inherently intellectually dishonest. To see what I mean, consider a typical language used in stochastics: "Given that we are dealing with a random process of the sort X, we can infer that Y is true ... [a valid argument follows]". The intellectual dishonesty is concealed in the "given that" introduction, as users of stochastics arguments hardly ever feel obliged to demonstrate that the premise is fulfilled. A particularly frequent example is the assumption of normally distributed errors.

A satisfactory demonstration of the assumptions' validity would usually require many empirical measurements, which might be outright impossible (e.g. to determine an error of an instrument you need an even more accurate instrument, which might be unavailable), too expensive, or simply out-of-reach of the person who is making the stochastic argument. If confronted with that inconvenient fact, several lame tactics are possible:

  • refer to the literature (claim that the actual measurements have been made already.. by someone else.. sometime);
  • refer to others behaving the same way (if everybody does it, then it must be right);
  • vaguely proclaim that we are dealing with idealized models, so we're ok after all;
  • if the normal distribution is questioned, refer to its natural occurrence and the central limit theorem - that is, claim that it is very likely to be the right distribution, after all.

Why are these tactics lame? Because they are simply attempts to conceal, at all cost, the speaker's lack of information; socially conditioned grasps to retain authority about a subject. However, we can and ought to be smarter than that. Consider this:

  1. Knowing is generally preferred to not knowing.
  2. Knowing that you don't know is generally preferred to pretending (to yourself and others) that you do. Even if it makes you feel good and shuts up critics.

It turns out that the "stochastic" statements about the random process can be easily translated into statements about the speaker's (and perhaps everyone else's!) lack of information about the exact characteristics of the deterministic process. In other words, we assume a particular distribution because we don't know any better one - the alternatives are even worse given what we do know. If you think about it for a second, it is quite a different (and better) approach than lying to yourself about what you know, for the very simple reason that the former way of thinking invites the possibility of learning more while the latter way of thinking has the precisely opposite effect.

It is very possible that experienced users of stochastics do realize all of the above, and so I am belaboring a trivial point. If so, it remains somewhat puzzling as to why their language does not mirror their thinking. A case of professional jargon abuse, maybe? Needless to say, this sort of language is definitely misleading to the uninitiated student of probability theory/statistics. The sooner you see through it, the better.

On zero probability in the continuous variable case

Here is a quote from a SIGGRAPH course by Welch and Bishop:

In the case of continuous random variables, the probability of any single discrete event A is in fact 0.

The same quote could be taken from many introductory texts on probability theory. It seemed absurd to me the first time when I read it. Back then, I got over it, attributing the feeling to my own inexperience. Well, after some years and I dare say improved understanding, I know that it is in fact an absurd - or at least an uncomfortably sloppy - statement. Moreover, I can explain why and get rid of the confusion.

There are two main reasons for the intuitively perceived absurdity:

  • Zero probability is synonymous with "impossible event". If the quoted statement was true, it would follow that, regardless what value of the random variable you choose, it is impossible (and I really mean any value). Yet we know from experience, which our model is supposed to reflect, that the random variable does assume some value in reality.
  • The positive probability of a value falling in a given interval arises from summing probabilities (integration) of all discrete values within that interval. However, adding together zeros - even in an infinite loop - yields zero.

Of course, one could ask: if the probability P(A=x) is not 0 in the continuous case, then how big is this probability? The simple answer to that is: there is no continuous case, it is a figment of a mathematician's imagination, a model primarily intended to ease calculations, rather than a representation of reality. The zero probability "exists" in the same sense as a mathematical point "exists". On the other hand, when we talk about "possible" and "impossible" events, we talk about [our perceptions of] reality. We'd also like the connection to reality to remain intact when we use the notion of probability, continuous or not. Of course, if the continuous case is discretized (and you can choose to do it using as many discrete events as you desire), the "paradox" of possible zero-probability events is resolved at once.

Where do the idea and the bold assertions about P(A=x) = 0 come from, then? They are but a sloppy description of the limiting process, of increasing the number of events without bounds. That is, a way to say that "the more equiprobable events we have, the smaller the per-event probability". It is correct to say that we approach zero probability, which is quite a different thing from saying that we (ever) reach this value. In all practical thinking, we may safely ignore infinite processes and infinite "things" a mathematician is so fond of, or better yet, accept them as a convenient approximation of our discrete reality to which our actual reasoning applies.

Introduction to Probability Theory, Part 4

Having discussed how different persons might assign different probabilities to the same proposition and for the time being disposed of the notion of the "one true probability", let's turn to another intriguing question: how does one person know whether her assigned probability is correct - that it reflects her own background information? Obviously, if what you know remains unchanged, so should the probability assignments that you make for any proposition on that basis. In other words, it would not be sensible to assign at your whim two different probabilities to the same proposition, unless you have in the meanwhile learned something new which is somehow related to that proposition. However, if you agree with me that the probability assignments are stable given some state of knowledge, then still the question remains: which probability assignment among the infinite number of assignments between 0 and 1 is appropriate - and what does "appropriate" even mean precisely?

The answer to the second question given by probability theory is intuitive and satisfactory. An "appropriate" or "correct" probability assignment is such one that is consistent with all the other probability assignments you might make. That is, you cannot have your cake and eat it too: because all considered propositions are either true or false, and because they are interwoven (their meanings are related to each other), there is some risk that you might come up with an internally contradictory probability assignment - based on which you'd have to conclude that some proposition is both true and false. The "correct" assignment, on the other hand, does not evoke any such absurd conclusions.

For example, you cannot and would not at the same time believe that I am both younger and older than any given age; if you felt 75% sure that I'm older, you'd also feel 25% sure that I'm younger and vice versa. However, if you were to assign probabilities to some related and indirect propositions instead, from which my age could be derived (say, propositions about myself having witnessed certain historical events during my lifetime, propositions about my friends' and parents' ages etc.), it could happen by accident that your combined probability assignment would imply that you do believe in that absurdity. You would then have to reject such a probability assignment and find out at which particular subproposition it went wrong (including the possibility of going wrong many times).

As an analogy, it helps to consider "financial arithmetics". If you were an accountant tasked with summing percentual fractions representing parts of a whole amount and arrived at a sum which was either greater or less than 100%, you'd know that you must have made a mistake somewhere along the way. Note, however, that it is a rather weak criterion of correctness: for not all sums that end up with 100% contain all the right components. Indeed, you can produce an infinite number of artificial sums that all end up with 100% by tweaking the individual components relatively to each other. So what you'd need to become more certain that the arithmetical calculation reflects reality would be some additional means of checking consistency, such as partial sums. Depending on your level of paranoia, you could introduce more and more partial sums ad infinitum. The point is, they would all have to be consistent in order for you to be satisfied that the calculation - analogous to a probability assignment - was true. Just one slight deviation from their expected relationship would mean that an error slipped in.

Now that we have introduced self-consistency as a means of checking whether a given probability assignment is the correct one, have we also answered the first question - how do we find this correct assignment? Yes and no. In principle, we could "simply" write down hundreds of thousands of probability assignments and then go through each one and note the inconsistencies it contains, and in the end accept the probability assignment which has the least amount of inconsistency. Obviously, given the infinite number of different possible assignments, this would be a formidable task (for any human and for any machine), and also nothing like what we are used to in solving real problems. This would be comparable to an accountant generating randomly hundreds of thousands of balance statements and then going through the heap to check which of them reflects the company's finances. Fortunately, it's not how accountants work and not a sensible use of probability theory either. What we need instead is a kind of reliable, mechanical rules that allow us to construct internally consistent assignments, as long as we stick to them, much like rules of artihmetics don't ever let you down. Such rules indeed do exist, and they form the very core of probability theory, or as R. T. Cox called them "an algebra of probable inference".

To be continued...