Plausible Reasoning

The real definition of probability

2010-01-11T18:31:00.004+01:00

Over centuries many great minds have pondered over the meaning of probability, trying to apply it to profound questions such as "is harvest likely to be good this year", "how likely is my stock portfolio going to appreciate", "will I get a hangover after drinking this bottle", etc. Some even went as far as to claim that the term "probability" is undefinable using simpler concepts and nevertheless fearlessly proceeded to derive all the correct rules for relating one probability to another without ever revealing the secret of determining the value of either one.

Thanks to the power of Interwebs and Inkscape you no longer have to wonder along. Probability is just glorified counting and taking ratios (e.g. counting things of one type within the set of things of another type). In the end, even the supposedly more general Bayesian view on probability reduces to just one elementary operation, counting. This means that with enough perseveration one can reduce any probabilistic problem to counting balls in an imaginary urn. Like so (click to enlarge):

By the way, when you hear talk about "prior information", what is really meant is "counts". The natural questions is to ask which counts exactly. If they can't tell you, they are safe to ignore. Also, keep in mind that some counts don't count as much as the others. Probably...

(Coming soon: the real definition of maximum entropy - it's all about counting too!)

More seriously, it may be helpful to realize that every probabilistic model corresponds exactly to some such urn-based setup. Any reasoning using the urn model can be mapped back to the situation described by the probabilistic model - and vice versa. Moreover, urn-based setups may be transformed formally into one another while preserving their meaning. While it's difficult to juggle probabilistic formulae in one's mind, ball-filled urns are quite easy to visualize and quick-check for surprising contents. The continuous probability case also fits nicely by imagining the "going into the limit process", that is, shrinking balls ad infinitum.

Shrinking balls. Oh well...

The curious word 'why'

2009-10-31T17:26:00.003+01:00

The same little word "why" may be used to obtain a causal explanation ("by what process has X come to be?") or a goal ("what purpose does X serve, as a part of a larger mechanism?"). These are obviously quite opposite ways of looking at things - backwards vs. forwards in time - one of them neutral, the other postulating agency. So I wonder, why does the word "why" combine both meanings and what are the consequences for everyday reasoning? Is it so in all languages, or do some exist in which inquiries as to cause and purpose must be represented by two distinct question words?

Accelerating Genetic Engineering

2009-08-31T22:36:00.002+02:00

Throughout the history of science, those disciplines capable of controlled experimentation have advanced rapidly (e.g. physics) while those with limited capability of this sort have hardly made any progress in comparison (e.g. social sciences or economics). It would be no exaggeration to say that the tweak-it-and-see-what-happens-then approach is the key to gaining insight into how systems operate and how they can be changed to our advantage. Beyond science, it also appears to be the base principle for learning of any kind (consider language development in children, for example). For efficiency, it is crucial that the tweaking occurs in a controlled fashion, ceteris paribus, completely at our will, not disturbed by what is known as "confounding factors" in statistics.

Consider debugging computer software as another example (or troubleshooting of any kind, if you are not into software). If the computer program under inspection only changes its behavior in response to the programmer's modifications and inputs under her control, then the task of understanding and shaping it into whatever form is desired is mostly trivial. However, if there are unknown varying inputs that influence the program's behavior on each run, which mask the programmer's corrective actions, the debugging task becomes a nightmare, or at least calls for statistical analysis (not commonly available to real-life programmers). The same sort of problems arises if the modifications available to the programmer are too coarse-grained, e.g. if she can only replace large (and needed) components rather than "dig inside" and fix them.

It appears that researchers in Genetic Engineering have very recently made a breakthrough by gaining the ability to not just observe, but also tweak their "programs" in a piecewise, controlled fashion. Watch this presentation by Craig Venter to learn more: From Darwin to New Fuels (In A Very Short Time). They now expect that the progress will be greatly accelerated by this capability, and looking at the history, there is every reason to believe them. The potential for grim accidents is also there, of the same sort which is present in software systems. The same tweak-and-see techniques that are so helpful in offline development environments can wreak havoc when (or rather if) applied in production systems. (Most) programmers are smart enough to make the distinction. The same must be expected from genetic engineers.

No fuss about causality

2009-01-05T20:12:00.002+01:00

Throughout history and up into modern days, a big fuss has been made among philosophers about defining and dealing with causality. For a nice overview, see these lecture slides, which illustrate the troubled history of the concept. In recent times, formal approaches have been developed to connect causality to probabilistic/statistical reasoning (Rubin) or to do just the opposite, treating causality as an extension supposedly completely out of scope of probability theory (Pearl). It seems that the causality debate still rages on, apparently now on the battlefield of notations. For example, listen to Pearl's recent lecture in which he quips that "mere mortals" not trained by Rubin cannot verify certain expressions required within Rubin's framework. Pearl himself advocates a graphical representation of causality (little wonder in light of his past work). Even so, when asked about modeling just slightly complicated scenarios (A causes B, but only given C), he grudgingly admits that graphs do not directly expressing such constraints. Instead, the constraint can be hidden within the probability distribution associated with a graph.

Hearing all this, I wonder whether the award-winning philosopher is not now in the business of shooting sparrows with cannons. I agree with Pearl's assessment that given a set of structural equations or a graphical model (like his electric circuit example), all causal and counterfactual questions can be readily answered by simply running the model (simulation). I'm puzzled why Pearl does not go one step further and point out that nowadays (and since 50+ years) we have very elaborate and wildly popular tools for expressing causal models and the equipment for running them. They are imperative programming languages and computers, of course. Every program written in an imperative language is an intricate causal model, in which expressing constraints of the sort mentioned above comes effortlessly and the notion of time (so central to all causal reasoning) is given by the execution semantics.

For example:

if (c == C)
{
        if (a == A)
        {
                b = B;
        }
}

which is of course equivalent to

if (c == C && a == A) { b = B; }

which is of course equivalent to stating "A and C (combined) cause B". Given such a model, we may call A and C separately "necessary causes" if we so prefer. We may call "A and C" the "sufficient cause". Finally, given a particular run and a different expression of the sort "A or C", we may speak of the "actual" cause having been either "A" or "C" or both. What I wish to say is that there are no doubts about causality given a model in form of a computer program. It also makes obviously clear how pointing to a single variable as "the" cause of something could be incorrect. Finally, modeling runs of computer programs has been a topic in computer science for decades, even if the researchers have never bothered to use the word "causality" in this context.

Of course, computer programs are entirely deterministic and hardly "statistical" beasts. However, who says that the "real-world" causality is not or at least may not be treated as such? If you view probability, as I do, as a means for modeling epistemic (that is, modeler's own) uncertainty rather than some ontologic "stochastic randomness" of nature, then you can apply it without hesitation to deterministic computer programs, in circumstances where parts of the state or code are unknown. For example, you could model an unknown variable value as a probability distribution over possible values, or you could model an unknown segment of code as a probability distribution over possible segments. (If you can't even enumerate the possibilities or if they appear "infinite", you are in trouble; ask yourself whether and why you know so little and how you could find out more.)

The challenge of science is, as Pearl rightly points out, that we seldom know the causal model. That is, we either don't know what program has (or may have) generated our observations, or the same set of observations might have been equally been generated by many different programs. In this latter case we have a uniform probability distribution over programs. Our task then is to somehow infer the program from the observations and from "causal assumptions" - data and the prior. The "somehow" should be plausible reasoning according to the rules of probability theory, and so we have a connection (not of the sort contemplated by Pearl/Rubin).

The causal assumptions correspond to our estimate about which models (programs) are possible at all, and which are consistent with other models (programs) that we already deem as accurate and useful representations of reality. Interventions before observation help enormously by lowering probabilities for sets of programs not compatible with the intervention+observation data.

For example, given the following set of observations:

a = 1, b = 0
a = 0, b = 0
a = 0, b = 1
a = 1, b = 0
a = 0, b = 1
a = 1, b = 0
a = 1, b = 0

we could just as well fit the following two causal models (and many others):

if (a == 1) { b = 0; }

if (b == 1) { a = 0; }

However, if we perform a set of interventions of setting b = 1 and observing a != 0, and another set of interventions of setting a = 1 and observing b == 0, the first model will stand the test while the second one will become very implausible. However, we should be careful to not proclaim it impossible, as there still could be hidden variables not accounted for within the model contributing to the observed outcomes. One day, we might find these factors and control for them and setting b = 1 might then indeed begin causing a == 0. And so we see that:

causality, much like probability, is in the eye of the beholder
(incomplete) causal models may be treated as if they generated data according to some probability distributions
causal models may be assigned probabilities

That said, there is little reason to make a big fuss about finding the "one true definition" of causality, the "one true notation" for representing causal arguments, or "measurement methods" for determining strength of "causal connections". We have no need for big philosophy of causal reasoning, but great need for good, sufficiently granular and computationally cheap causal models that reliably deliver predictions about effects of actions to their users.

The intellectual dishonesty of "stochastics"

2008-11-01T15:28:00.010+01:00

To describe inference problems using the language of stochastics does not necessarily yield poor results, but it seems inherently intellectually dishonest. To see what I mean, consider a typical language used in stochastics: "Given that we are dealing with a random process of the sort X, we can infer that Y is true ... [a valid argument follows]". The intellectual dishonesty is concealed in the "given that" introduction, as users of stochastics arguments hardly ever feel obliged to demonstrate that the premise is fulfilled. A particularly frequent example is the assumption of normally distributed errors.

A satisfactory demonstration of the assumptions' validity would usually require many empirical measurements, which might be outright impossible (e.g. to determine an error of an instrument you need an even more accurate instrument, which might be unavailable), too expensive, or simply out-of-reach of the person who is making the stochastic argument. If confronted with that inconvenient fact, several lame tactics are possible:

refer to the literature (claim that the actual measurements have been made already.. by someone else.. sometime);
refer to others behaving the same way (if everybody does it, then it must be right);
vaguely proclaim that we are dealing with idealized models, so we're ok after all;
if the normal distribution is questioned, refer to its natural occurrence and the central limit theorem - that is, claim that it is very likely to be the right distribution, after all.

Why are these tactics lame? Because they are simply attempts to conceal, at all cost, the speaker's lack of information; socially conditioned grasps to retain authority about a subject. However, we can and ought to be smarter than that. Consider this:

Knowing is generally preferred to not knowing.
Knowing that you don't know is generally preferred to pretending (to yourself and others) that you do. Even if it makes you feel good and shuts up critics.

It turns out that the "stochastic" statements about the random process can be easily translated into statements about the speaker's (and perhaps everyone else's!) lack of information about the exact characteristics of the deterministic process. In other words, we assume a particular distribution because we don't know any better one - the alternatives are even worse given what we do know. If you think about it for a second, it is quite a different (and better) approach than lying to yourself about what you know, for the very simple reason that the former way of thinking invites the possibility of learning more while the latter way of thinking has the precisely opposite effect.

It is very possible that experienced users of stochastics do realize all of the above, and so I am belaboring a trivial point. If so, it remains somewhat puzzling as to why their language does not mirror their thinking. A case of professional jargon abuse, maybe? Needless to say, this sort of language is definitely misleading to the uninitiated student of probability theory/statistics. The sooner you see through it, the better.

On zero probability in the continuous variable case

2008-11-01T12:10:00.014+01:00

Here is a quote from a SIGGRAPH course by Welch and Bishop:

In the case of continuous random variables, the probability of any single discrete event A is in fact 0.

The same quote could be taken from many introductory texts on probability theory. It seemed absurd to me the first time when I read it. Back then, I got over it, attributing the feeling to my own inexperience. Well, after some years and I dare say improved understanding, I know that it is in fact an absurd - or at least an uncomfortably sloppy - statement. Moreover, I can explain why and get rid of the confusion.

There are two main reasons for the intuitively perceived absurdity:

Zero probability is synonymous with "impossible event". If the quoted statement was true, it would follow that, regardless what value of the random variable you choose, it is impossible (and I really mean any value). Yet we know from experience, which our model is supposed to reflect, that the random variable does assume some value in reality.
The positive probability of a value falling in a given interval arises from summing probabilities (integration) of all discrete values within that interval. However, adding together zeros - even in an infinite loop - yields zero.

Of course, one could ask: if the probability P(A=x) is not 0 in the continuous case, then how big is this probability? The simple answer to that is: there is no continuous case, it is a figment of a mathematician's imagination, a model primarily intended to ease calculations, rather than a representation of reality. The zero probability "exists" in the same sense as a mathematical point "exists". On the other hand, when we talk about "possible" and "impossible" events, we talk about [our perceptions of] reality. We'd also like the connection to reality to remain intact when we use the notion of probability, continuous or not. Of course, if the continuous case is discretized (and you can choose to do it using as many discrete events as you desire), the "paradox" of possible zero-probability events is resolved at once.

Where do the idea and the bold assertions about P(A=x) = 0 come from, then? They are but a sloppy description of the limiting process, of increasing the number of events without bounds. That is, a way to say that "the more equiprobable events we have, the smaller the per-event probability". It is correct to say that we approach zero probability, which is quite a different thing from saying that we (ever) reach this value. In all practical thinking, we may safely ignore infinite processes and infinite "things" a mathematician is so fond of, or better yet, accept them as a convenient approximation of our discrete reality to which our actual reasoning applies.

Introduction to Probability Theory, Part 4

2008-08-24T16:45:00.010+02:00

Continued from Part 3...

Having discussed how different persons might assign different probabilities to the same proposition and for the time being disposed of the notion of the "one true probability", let's turn to another intriguing question: how does one person know whether her assigned probability is correct - that it reflects her own background information? Obviously, if what you know remains unchanged, so should the probability assignments that you make for any proposition on that basis. In other words, it would not be sensible to assign at your whim two different probabilities to the same proposition, unless you have in the meanwhile learned something new which is somehow related to that proposition. However, if you agree with me that the probability assignments are stable given some state of knowledge, then still the question remains: which probability assignment among the infinite number of assignments between 0 and 1 is appropriate - and what does "appropriate" even mean precisely?

The answer to the second question given by probability theory is intuitive and satisfactory. An "appropriate" or "correct" probability assignment is such one that is consistent with all the other probability assignments you might make. That is, you cannot have your cake and eat it too: because all considered propositions are either true or false, and because they are interwoven (their meanings are related to each other), there is some risk that you might come up with an internally contradictory probability assignment - based on which you'd have to conclude that some proposition is both true and false. The "correct" assignment, on the other hand, does not evoke any such absurd conclusions.

For example, you cannot and would not at the same time believe that I am both younger and older than any given age; if you felt 75% sure that I'm older, you'd also feel 25% sure that I'm younger and vice versa. However, if you were to assign probabilities to some related and indirect propositions instead, from which my age could be derived (say, propositions about myself having witnessed certain historical events during my lifetime, propositions about my friends' and parents' ages etc.), it could happen by accident that your combined probability assignment would imply that you do believe in that absurdity. You would then have to reject such a probability assignment and find out at which particular subproposition it went wrong (including the possibility of going wrong many times).

As an analogy, it helps to consider "financial arithmetics". If you were an accountant tasked with summing percentual fractions representing parts of a whole amount and arrived at a sum which was either greater or less than 100%, you'd know that you must have made a mistake somewhere along the way. Note, however, that it is a rather weak criterion of correctness: for not all sums that end up with 100% contain all the right components. Indeed, you can produce an infinite number of artificial sums that all end up with 100% by tweaking the individual components relatively to each other. So what you'd need to become more certain that the arithmetical calculation reflects reality would be some additional means of checking consistency, such as partial sums. Depending on your level of paranoia, you could introduce more and more partial sums ad infinitum. The point is, they would all have to be consistent in order for you to be satisfied that the calculation - analogous to a probability assignment - was true. Just one slight deviation from their expected relationship would mean that an error slipped in.

Now that we have introduced self-consistency as a means of checking whether a given probability assignment is the correct one, have we also answered the first question - how do we find this correct assignment? Yes and no. In principle, we could "simply" write down hundreds of thousands of probability assignments and then go through each one and note the inconsistencies it contains, and in the end accept the probability assignment which has the least amount of inconsistency. Obviously, given the infinite number of different possible assignments, this would be a formidable task (for any human and for any machine), and also nothing like what we are used to in solving real problems. This would be comparable to an accountant generating randomly hundreds of thousands of balance statements and then going through the heap to check which of them reflects the company's finances. Fortunately, it's not how accountants work and not a sensible use of probability theory either. What we need instead is a kind of reliable, mechanical rules that allow us to construct internally consistent assignments, as long as we stick to them, much like rules of artihmetics don't ever let you down. Such rules indeed do exist, and they form the very core of probability theory, or as R. T. Cox called them "an algebra of probable inference".

To be continued...

Introduction to Probability Theory, Part 3

2008-08-17T13:28:00.011+02:00

Continued from Part 2...

All probabilities are conditional

Ok, now we come to an interesting point. If probability is attached to propositions, and propositions are about objective things that can be true or false, is it right to say that "the objective probability of proposition X is so and so"? What if you and me disagree in our probability assignment about the same proposition - I feel that this page is top-notch and you feel that it is mediocre, without either of us knowing the actual public rating?

Regardless of what you might have been taught about "events" having some inherent "probabilities" that "we" are trying to calculate, the above example of disagreement about probability of a proposition is a perfectly normal situation. We all know that it happens all the time. Just turn on your TV and look at some programme with folks arguing like crazy about different issues. Obviously, independently of how concrete a proposition is, people may disagree on its probability - it is a measure of their degree of confidence, not your degree of confidence after all! Now, the next question naturally is: why can different people have different degrees of confidence in truth of the same thing?

The answer is their different information context. Whether or not you hold some proposition for likely strongly depends on what other propositions you believe in. In a way, all the different propositions are related in our heads, and we are usually quite ready to change our opinion on one proposition after learning something about another. For example, you might be somewhat certain that I'm a native English speaker after reading my text, but if you could throw a brief glance at my passport, it would change your assessment. If you saw an entry "American" under nationality in it, you'd become (almost) certain about the truth of that proposition. On the other hand, if you saw some other nationality, you would become almost certain that the proposition is false. Now, someone else might not have had the same opportunity of looking at my passport and therefore assign a different probability.

Generally, what probability we assign to some proposition depends on what we already know about some other propositions. In mathematical speak, we refer to conditional probability - the probability assigned to X given that we already know that Y is true. In fact, for all practical purposes, all probabilities are conditional. Instead of saying that two different people assign a different probability to the same proposition, we may just as well say that they are just giving us two different probabilities concerning this proposition. The first person is giving us the probability conditional on A (her state of information), the other person is giving the probability conditional on B (her different state of information). There is nothing strange or disturbing about the discrepancy in numbers that arises then. On the contrary, if we could bring the two persons to believe exactly the same set of the "remaining" relevant propositions, they would agree perfectly on the probability assigned to the one uncertain proposition because they would effectively think exactly the same and thus lack any reason to disagree. This convergence of opinions is not easy to achieve, but it is not as far-fetched as it might seem. It can and routinely does happen during practical investigations and in science.

The important point to take from this part is: probabilities are assigned to propositions, but they are not properties of the propositions alone. Instead, a probability is a property of the proposition in question together with all other propositions held to be true by the person who assigned the probability. In fact, we can forget about the person altogether and just represent her by the totality of all propositions she knows to be true.

Continued in Part 4...

Introduction to Probability Theory, Part 2

2008-08-17T13:27:00.007+02:00

Continued from Part 1...

Propositions - the carriers of probability

A probability is a number between 0 and 1 which expresses someone's degree of confidence in the truth of some proposition. A proposition is simply a statement of fact like "This page is over 1000 words long". In reality, every such statement can be either true or false. You would only talk about a "probability" if you were unsure which of the both (true or false) was the case. However, what you always do know up-front, is that the proposition is either false or true, but not both, and not something in between either.

What about statements like "This page is entertaining and informative"? How can it really be either true or false? Doesn't it just depend on who is judging it? Well, it does, until you define some way of measuring "entertaining and informative" which does not involve a single person's tastes. But let's say that we agreed on some voting scheme in which all potential evaluators would participate in. Then the "entertaining and informative" would no longer be up to your or my opinion only - it would become more of an objective property of this page. And yes, without having seen the actual ratings, you could be unsure about this property (how everyone has rated it). So you could assign different probabilities to all the possible "entertaining and informative" ratings it might have. In other words, you would then have propositions like "The entertainment rating is 0/10" or "The informativeness rating is 9/10", and of course each of them could be true or false, but not at the same time. You might feel more confident that this page has good ratings than bad ratings and express this by numbers using your probability assignment when asked about it.

The thing I'd like you to consider is that when we are discussing probabilities, we are talking about our degree of uncertainty about some concrete propositions. If the propositions appear fuzzy and their truth seems undecidable in principle, then we have to become more specific first and clarify what we mean before we can even start talking about and asking questions about probabilities. Obviously, if we don't even know what our questions are about, we cannot expect any definite and useful answers.

Incidentally, propositions like "a die throw result is 4" or "a coin throw outcome is heads" are very clear. Pretty much everyone agrees on what they mean and could check their truth just like anyone else. Now you see one reason why these sorts of propositions are so eagerly used in classroom introductions to probability. Still, there are many other propositions that just as concrete and a lot more fun to think about than these trivial examples.

Finally, note that the very reason why we talk about probabilities of propositions is that, although they are verifiable in principle (their truth could be checked - and we know how), they may be quite hard to verify in practice. Maybe the proposition is about something that has not happened yet; it could also just as well be about some past event. If we were able to directly find out whether it's true or false, we would of course just do it and we wouldn't waste time talking about its "probability". Probability is for situations where we have to infer the truth of a proposition from whatever indirect clues we can collect without doing miracles or spending a fortune.

Continued in Part 3...

Introduction to Probability Theory, Part 1

2008-08-17T13:15:00.018+02:00

In this series of tutorial-style articles I recap what I have learned about probability theory from studying the work of E. T. Jaynes (available online here (book) and here (lectures)), which I recommend - with some reservations. The introductory parts are easy to read and enjoy. However, the later chapters are dominated by references to physics and mathematical formulae whose explanations are rather too brief for my taste. Jaynes seemed to write for students of physics at graduate level (even though I believe it was not his intention). I feel that his ideas are so intriguing and general that they deserve a broader audience. The goal of these posts is to introduce the most important concepts with fewer assumptions about the reader's level of mathematical sophistication; and to verify my own understanding in the process.

It's not just about coins and dice!

If you are like most people, you were introduced to the concept of probability at school with examples such as throwing dice, flipping coins, selecting cards from a deck, spinning lottery wheels, pulling colored balls from urns and other such. You will find plenty of such examples in various tutorials on the web, too. While there is nothing wrong about them in general, they can leave the impression that this is what "probability theory" is all about. A rather boring application of basic arithmetics to some idealized useless "random experiments" that noone cares about in real life. That is, unless they are after good marks for mechanical answers to silly questions like "what is the probability of scoring more than 2 but fewer than 8 with two dice". It appears just about as exciting and thought-provoking as solving quadratic equations for sports.

What they usually don't tell you is that probability theory describes what you - and everyone else - have been doing for your whole life with more or less success, without even realizing. All kinds of reasoning and decision making depend on probabilities that people assign to various propositions:

Whenever you look at something (like Escher's drawing of waterfall on the left), you unconsciously figure out the probabilities of seeing different scenes. You make up your mind what the scene is about and whether it is "real" or not;
Before you cross a street, you unconsciously figure out the probability of being hit by a car and getting to the other side safely;
Whenever you decide to buy something, you figure out the probability of getting good value for your money;
Detectives figure out who dun it based on probabilities of finding particular criminal evidence;
Criminals figure out how to reduce the probability of getting caught;
Scientists figure out which explanation is more probable than others for an observed phenomenon;
Businessmen figure out which deals are more likely to bring them profits;
Politicians figure out which public statements are more likely to bring them voters;
and so on, and so forth.

The really important thing to notice here is that we are almost never 100% certain about anything. We can be rather sure or rather doubtful about different things, but we can hardly ever honestly proclaim: "I know it's a sure thing" or "I know it's completely impossible" - except perhaps when trivial and uninteresting stuff is concerned. To put it in a slightly different way, whenever we need to think and make choices, there is always some uncertainty involved.

Real applied probability theory is about systematically improving our everyday thinking and decisions:

It's about drawing the best conclusions from whatever we already know and understand;
It's about not getting fooled and confused;
It's also about knowing how to act to become more knowledgeable about stuff that matters.

The concept of probability is quite difficult to grasp, though mathematically very simple. A tiny little part of it is about throwing dice and shaking urns in the classroom.

Continued in Part 2...