Why probabilistic programming matters




















But don ' t get distracted by the fact that PPL programs look a lot like ordinary software implementations, where the goal is to run the program and get some kind of output. The goal in PP is analysis, not execution. As a running example, let ' s imagine that we ' re building a system to recommend research papers to students based on the classes they take.

Every paper is either a PL paper, a statistics paper, or both. It ' s pretty easy to imagine that machine learning should work for this problem: the mixture of areas revealed by your class schedule should say something about the papers you want to read. The problem is that the exact relationship can be hard to reason about directly. Clearly taking means you ' re more likely to be interested in statistics, but exactly how much more likely? What if you registered for because it was the only class that fit into your schedule?

The machine-learning way of approaching this problem is to model the situation using random variables , some of which are latent. The key insight is that the arrows in Figure 1 don ' t make much sense: they don ' t really represent causality!

It ' s not that taking makes you more interested in a given paper; there ' s some other factor that probabilistically causes both events. These are the latent random variables in a model for explaining the situation. Allowing yourself latent variables makes it much easier to reason directly about the problem.

Here ' s a model that introduces a couple of latent variables for each person ' s interest in statistics and programming languages. We ' ll get more specific about the model, but now the arrows at least make sense: they mean that one variable influences another in some way.

Since we all know that you don ' t take every class you ' re interested in, we include a third hidden factor: how busy you are, which makes you less likely to go take any class.

This diagram of depicts a Bayesian network , which is a graph where each vertex is a random variable and each edge is a statistical dependence. Variables that don ' t have edges between them are statistically independent. That is, knowing something about one of the variables tells you nothing about the outcome of the other. To complete the model, we ' ll also draw nodes and edges to depict how our latent interest variables affect paper relevance:.

The idea isn ' t that we ' ll ask people what their interest levels and business are: we ' ll try to infer it from what we can observe. And then we can use this inferred information to do what we actually want to do: guess the paper relevance for a given student. So far, we ' ve draw pictures of the dependencies in our model, but we need to get specific about what we mean. Here ' s how it normally goes: you write down a bunch of math that relates the random variables.

That ' s a lot of math for such a simplistic model! And it ' s not even the hard part. The hard — and useful — bit is statistical inference , where we guess the latent variables based on our observations.

Statistical inference is a cornerstone of machine-learning research, and it ' s not easy. Traditionally, experts design bespoke inference algorithms for each new model they devise by hand.

Even this tiny example should demonstrate the drudgery of by-hand statistical modeling. It ' s like writing assembly code: we ' re doing something that feels a bit like programming, but there are no abstractions, no reuse, no descriptive variable names, no comments, no debugger, no type systems.

Look at the equations for the class registration, for example: I got tired of writing out all that math because its so repetitive. This is clearly a job for an old-fashioned programming language abstraction: a function. The goal of PPLs is to bring the old and powerful magic of programming languages, which you already know and love, to the world of statistics. To introduce the basic concepts of a probabilistic programming language, I ' ll use a project called webppl , which is a PPL embedded in JavaScript.

It ' s a nice language to use as an introduction because you can play with it right in your browser. The first thing that makes a language a probabilistic programming language PPL is a set of primitives for drawing random numbers.

At this point, a PPL looks like any old imperative language with a rand call. Here ' s an incredibly boring webppl program:. This boring program just uses the outcome of a fair coin toss to return one string or another. It works exactly like an ordinary program with access to a flip function for producing random Booleans. Functions like flip are sometimes called elementary random primitives , and they ' re the source of all randomness in these programs.

Things get slightly more interesting when we realize that webppl can represent entire distributions, not just individual values. The webppl language has an Enumerate operation, which prints out all the probabilities in a distribution defined by a function:. You get a printout with all the possible die values between 2 and 14 and their associated probabilities.

That viz. This may not look all that surprising, since you could imagine writing Enumerate in your favorite language by just running the roll function over and over. But in fact, Enumerate is doing something a bit more powerful. The real power of a probabilistic programming language lies in its compiler or runtime environment like other languages, probabilistic ones can be either compiled or interpreted.

In addition to its usual duties, the compiler or runtime needs to figure out how to perform inference on the program.

Inference answers the question: of all of the ways in which a program containing random choices could execute, which of those execution paths provides the best explanation for the data? Another way of thinking about this: unlike a traditional program, which only runs in the forward directions, a probabilistic program is run in both the forward and backward direction.

It runs forward to compute the consequences of the assumptions it contains about the world i. In practice, many probabilistic programming systems will cleverly interleave these forward and backward operations to efficiently home in on the best explanations. Probabilistic programming will unlock narrative explanations of data, one of the holy grails of business analytics and the unsung hero of scientific persuasion. People think in terms of stories - thus the unreasonable power of the anecdote to drive decision-making, well-founded or not.

But existing analytics largely fails to provide this kind of story; instead, numbers seemingly appear out of thin air, with little of the causal context that humans prefer when weighing their options. The specific solutions that are chosen from this space then constitute specific causal and narrative explanations for the data. The dream here is to combine the best aspects of anecdotal and statistical reasoning - the persuasive power of story-telling, and the predictiveness and generalization abilities of the larger body of data.

Probabilistic programming decouples modeling and inference. Just as modern databases separate querying from indexing and storage, and high-level languages and compilers separate algorithmic issues from hardware execution, probabilistic programming languages provide a crucial abstraction boundary that is missing in existing learning systems.

The following example is adapted from a post on the pymc mailing list. Matrix Factorizations are models that assume a given matrix was created from the product of two low-rank or sparse matrices.

These models often come up in recommendation engines where the matrix represent how a given user row will rate a certain item column. When these are used for recommendation, the approaches are usually called collaborative filtering methods. These have been extended to allow some groupings of the items into topics among other things.

Part of the power of probabilistic programming languages is they can be updated as our understanding of different distributions grows.

This allows new models to be fit that may have been cumbersome before. Note, this is work the library user benefits from without having to be actively involved in the development. As an example, suppose we wish to use sparse priors for our linear regression example instead of L2 norms. We now may represent arbitrary sparse priors such as Laplace L1 and spike-slab L0. Note, the above is not valid JAGS code, but an extension to support these generalized normal distributions is easy to add to any of these languages.

Conditional Random Fields allow you to create machine learning models where the label for a piece of label is dependent on not just local features of a data point but also features and labels of neighboring pieces of data.

As an example, the part-of-speech of a word is dependent on the part-of-speech of words around it. Unfortunately, I could not express CRFs in terms of because it is not a generative model. I include it here to show how succiently this model is expressed in Factorie which supports undirected graphical models. I chose some of the more popular models just to show how much flexibility we obtain with probabilistic programming languages. These languages make complex modeling and many machine learning tasks accessible to a wide audience.

In subsequent articles I will show how to fit these models in different probabilistic programming languages and how well they perform. Convex Optimized Toggle navigation.

Archive Publications About. Normal variables The simplest use case for a probabilistic language is find the average and standard deviation of a list of numbers. Sparsity and Sparse Bayes Part of the power of probabilistic programming languages is they can be updated as our understanding of different distributions grows.

Conditional Random Fields CRF Conditional Random Fields allow you to create machine learning models where the label for a piece of label is dependent on not just local features of a data point but also features and labels of neighboring pieces of data.

Stay tuned.



0コメント

  • 1000 / 1000