Please consider applying for **graduate studies in computer science** (or encourage others to apply if like me, your grad-school days are behind you).

In recent years, I’ve taken a special interest in the **theory of machine learning**, and there are **several opportunities at Harvard** in this area. In particular, I’ve been collaborating with people both in and outside CS, including incoming CS faculty **Sham Kakade**, as well as **Demba Ba** (Electrical Engineering and Bioengineering), **Lucas Janson** (Statistics), and **Cengiz Pehlevan** (Applied Mathematics). I’m very much open to co-advising students outside computer science as well.

Students interested in **quantum computation and information** can apply to any of our programs (including computer science, physics, and others) as well as to the new **quantum science and engineering degree**. There are a number of Harvard faculty interested in quantum computing including new incoming faculty member **Anurag Anshu**. In this area as well I am open to co-advising, including students outside CS.

If you are applying to graduate studies and are potentially interested in working with me, you can indicate this in the application form and statement of interest. This is the best way to ensure that I read your application (and in particular better than emailing me separately: for fairness sake I read all the applications together in batch, regardless of whether the candidate emailed me or not).

We are also looking for **postdocs**! We have the general **Rabin postdoc position**, as well as specific positions in **privacy**, **fairness**, and **theory of machine learning**, all **described in the following ad**. There is also a separate **networking+theory** postdoc position. Several other machine-learning theory related postdocs are linked in our **ML theory opportunities page**. See also the **quantum initiative postdoctoral fellowship**. There may be more CS, ML and quantum related opportunities that I am missing. If I hear of more relevant opportunities then I will post them here and/or **on Twitter**.

Finally, Harvard welcomes applications from candidates of all backgrounds, regardless of disability, gender identity and expression, physical appearance, race, religion, or sexual orientation. I would like to **especially encourage** applications from **members of under-represented groups**. Computer Science, and in particular the areas I work in, have a significant diversity problem. We are trying at Harvard to create an environment that is welcoming and inclusive for people from all backgrounds. I am sincerely grateful and appreciative of people placing their trust in us by applying, and we will do our best to live up to that trust.

The goal of the SIGACT Research Highlights Committee is to help promote

top computer science theory research via identifying results that are of

high quality and broad appeal to the general computer science audience.

These results would then be recommended for consideration for the CACM *Research Highlights* section as well as other general-audience computer science research outlets.

**Nomination and Selection Process:**

The committee solicits two types of nominations:

1) **Conference nominations.** Each year, the committee will ask the PC

chairs of theoretical computer science conferences to send a selection

of up to three top papers from these conferences (selected based on both

their technical merit and breadth of interest to non-theory audience)

and forwarding them to the committee for considerations.

2) **Community nominations. **The committee will accept nominations from the members of the community. Each such nomination should summarize the contribution of the nominated paper and also argue why this paper is

suitable for broader outreach. The nomination should be no more than a

page in length and can be submitted at any time by emailing it to

sigact.highlights.nominations@outlook.com. Self-nominations are

discouraged.

The nomination deadline is **Friday, Oct 22, 2021 **.

**Committee:**

The SIGACT Research Highlights Committee currently comprises the

following members:

Irit Dinur, Weizmann Institute of Science

Yuval Ishai, Technion

Jelani Nelson, UC Berkeley (chair)

Ronald de Wolf, CWI Amsterdam

During the week of July 11-15, 2022, Cynthia Dwork and Guy Rothblum will be teaching a graduate course on algorithmic fairness at IPAM:Graduate Summer School on Algorithmic Fairness (ucla.edu)

The first 3 days will focus on theory, and the last two days will transition towards practice and will feature lectures by domain experts in machine learning, medicine and economics. Participants will need to apply here. The application closes on **March 11, 2022**. See the website and the application link for further details.

**Call for Participation: APPROX/RANDOM 2021**

The 25th International Workshop on Randomization and Computation (RANDOM 2021) and the 24th International Workshop on Approximation Algorithms for Combinatorial Optimization Problems (APPROX 2021) will be starting on Monday August 16! The conferences will be held as parallel virtual conferences, August 16-18, 2021. RANDOM 2021 focuses on applications of randomness to computational and combinatorial problems while APPROX 2021 focuses on algorithmic and complexity theoretic issues relevant to the development of efficient approximate solutions to computationally difficult problems.

To learn more about the conferences and program, visit:

APPROX: https://approxconference.wordpress.com/approx-2021/

RANDOM: https://randomconference.com/random-2021-home/

In addition to an exciting program of live talks and discussion (to complement pre-recorded talks), the conference will feature two invited talks, by Jelani Nelson (UC Berkeley) and Vera Traub (ETH Zurich), as well as a social event with trivia and a cartoon caption contest!

Registration is only $10 for general audience members. You can register here: https://www.eventbrite.com/e/approx-2021-and-random-2021-tickets-162840998811?discount=audience

]]>See part 1 of this series, and pdf version of both parts. See also all seminar posts.

In the previous post we described the outline of the replica method, and outlined the analysis per this figure:

Specifically, we reduced the task of evaluating the expectation of a (potentially modified) log partition function to evaluating expectation over replicas which in turn amount to

where is the matrix of _overlaps_ (inner products between pairs of replicas), and is some nice analytical function depending on and the log probability obtaining this overlap . Then since this is an integral of exponentials, it turns out to be dominated by the maximizer , arriving at

reducing our desired quantity to and so if we’re lucky, all we need to do is run Mathemtica or Sympy to find the maximizer of over the space of all?? (not really: see below) matrices.

Unfortunately, the description of above is a gross simplification. cannot be any matrix — it needs to satisfy particular constraints. In particular, it cannot be a matrix that appears with probability tending to zero with as the overlap matrix of a tuple of replicas from the Gibbs distribution.

Hence, we need to understand the space of potential matrices that could arise from the probability distribution, and is the global minimum under these constraints.

The most important constraint on is the **replica symmetry** (RS), or the lack thereof (**replica symmetry breaking**, or **RSB**). Recall that encodes the overlap between , where each element is a Gibbs random variable. On a high level, the structure of describes the geometry of the Gibbs distribution. An in depth description of the relationship between the two is beyond the scope of this post (check out Mean Field Models for Spin Glasses by Michel Talagrand). We will give some intuitions that apply in the zero-temperature limit.

The **replica symmetric ansatz** studies the following special form of matrix

where is the Kroneker delta. In other words, this ansatz corresponds to the guess that if we pick random replicas then they will satisfy that for all , and for .

This ansatz is especially natural for problems with unique minimizers for a fixed problem instance .

In such a problem we might imagine that the replicas are all random vectors that are have the same correlation with the true minimizer and since they are random and in high dimension, this correlation explains their correlation with one another (see example below).

>**What have we done?** It is worthwhile to pause and take stock of what we have done here. We have reduced computing into finding an expression for and then reduced this to computing whre the expectation is taken over the induced distribution of the overlap matrix. Now for every fixed , we reduce the task to optimizing over just two parameters and . Once we find the matrix that optimizes this bound, we obtain the desired quantity by taking .

Annealed langevin dynamics on a convex and non-convex objective below illustrate how the geometry of the learning problem influences the structure of the overlap matrix .

We see that even in low-dimensional problems, the structure of the loss landscape influences the resulting matrices. Since the replica method works only in high dimensions, these animations cannot be taken too seriously as a justification of the symmetry ansatz, but below we discuss in what kinds of models we could expect the symmetry ansatz to be a good idea.

We will now discuss a simple model where the replica symmetry ansatz is especially natural. For a fixed problem instance , suppose that the vectors are distributed in a point cloud about some mean vector .

where are zero-mean noise independently sampled across different replicas with covariance . This is equivalent to stipulating a Gibbs measure with energy where is the distribution of each noise variable. In this case, the matrix has elements

By the central limit theorem, these sums of independently sampled random variables are approximately Gaussian (remember ), so we can estimate how behaves in the large limit

This implies that in the thermodynamic limit, the matrix concentrates around a replica symmetric structure. Note that the emergence of this RS structure relied on the fact that high dimensional random vectors are approximately orthogonal. For many supervised learning problems such as least squares fitting, this toy model is actually relevant by specifically taking .

To show these tools in action we will first study the simplest possible example with that has an interesting outcome. We will study the generalization performance of ridge regression on Gaussian distributed random features. In particular we will study a thermodynamic limit where the number of samples and the number of features are both tending to infinity but with finite ratio . We will observe a phase transition the point , where the learning problem transitions from over-parameterized () to under-parameterized (). In the presence of noise this leads to an overfitting peak which can be eliminated through explicit regularization.

Hertz, Krogh, and Thorbergsson first studied this problem and noted the phase transition at . Advani and Ganguli examine this model as a special case of M-estimation. Analysis of this model can also be obtained as a special case of kernel regression, for which general learning curves were obtained by Canatar, Bordelon, and Pehlevan with the replica method. Similar overfitting peaks were recently observed in nonlinear two layer neural networks by Belkin, Hsu, Ma, and Mandal and modeled with the replica method by d’Ascoli, Refinetti, Biroli, and, Krzakala allowing them to clarify the two possible types of overfitting peaks in random feature models. This problem can also be studied with tools from random matrix theory as in the work of Mei and Montanari and several others.

Our problem instance is a dataset with drawn i.i.d. from a Gaussian distribution . The target values are generated by a noisy linear teacher

where and noise is Gaussian distributed . We will compute, not the generalization error for a particular problem instance , but the *average* performance over random datasets! The energy function we study is the ridge regression loss function

The ridge parameter controls the trade-off between training accuracy and regularization of the weight vectors. When , the training data are fit perfectly while the limit gives as the minimizer of . The limit of the Gibbs distribution corresponds to studying the performance of the ridge regression solution which minimizes . The generalization error is an average over possible test points drawn from the same distribution

We introduce a partition function for the Gibbs distribution on

We can rewrite the integral through a simple change of variables since . represents the discrepancy between the learned weights and the target weights . We will now replicate and average over the training data , ie compute .

**Warning:** Notice that by writing these integrals, we are implicitly assuming that is an integer. Eventually, we need to take limit to obtain the generalization error from . After computation of at integer , we will get an analytic expression of which we will allow us to non-rigorously take .

The randomness from the dataset is present in the first term only appears through mean-zero Gaussian variables which have covariance structure

where is the vector of all ones and we introduced overlap order parameters defined as

The average over the randomness in the dataset is therefore converted into a routine Gaussian integral. Exploiting the independence over each data point, we break the average into a product of averages.

Each average is a multivariate Gaussian integral of the form

This integral can be derived by routine integration of Gaussian functions, which we derive in the Appendix.

To enforce the definition of the order parameters, we insert delta-functions into the expression for which we write as Fourier integrals over dual order parameters

This trick is routine and is derived in the Appendix of this post.

After integration over and , we are left with an expression of the form

where is a function which arises from the average over and is calculated through integration over the variables.

**Warning:** The functions and have complicated formulas and we omit them here to focus on the conceptual steps in the replica method. Interested readers can find explicit expressions for these functions in the references above.

To make progress on the integral above, we will make the replica symmetry assumption, leveraging the fact that the ridge regression loss is convex and has unique minimizer for . Based on our simulations and arguments above, we will assume that the and matrices satisfy *replica symmetry*

After the replica symmetry ansatz, the replicated partition function has the form

In the limit, this integral is dominated by the order parameters which satisfy the saddle point equations

**Warning**:Notice that is small (we are working in limit to study ) but is large (we are studying the “thermodynamic” limit). The order of taking these limits matters. It is important that we take first before taking so that, at finite value of , the integral for is dominated by the saddle point of .

We can solve the saddle point equations symbolically with Mathematica (see this notebook) in the limit. We notice that must scale like and must scale like . After factoring out the dependence on the temperature, we can compute the saddle point conditions through partial differentiation.

This symbolically gives us the order parameters at the saddle point. For example, the overlap parameter . After solving the saddle point equations, the generalization error can be written entirely in terms of the first order parameter at the saddle point. For replica , the generalization error is merely . Thus

Where at the saddle point.

When the generalization error decreases linearly with : for and for . This indicates the target weights are perfectly estimated when the number of samples equals the number of features . A finite ridge parameter increases the generalization error when noise is zero . Asymptotically, the generalization error scales like for large .

In the presence of noise , the story is different. In this case, the generalization error exhibits a peak at before falling at a rate at large . In this regime, accurate estimation requires reducing the variance of the estimator by increasing the number of samples.

In small limit, the order parameter behaves like for and for .

The free energy exhibits a discontinuous first derivative as , a phenomenon known as first-order phase transition. Let be the value of the free energy at the saddle point . Then we find

which indicates a discontinous first derivative in the limit as . We plot this free energy for varying values of , showing that as a discontinuity in the free energy occurs at . The non-zero ridge parameter prevents the strict phase transition at .

Using the analysis of the saddle point, we are now prepared to construct a full picture of the possibilities. A figure below from this paper provies all of the major insights. We plot experimental values of generalization error in a dimensional problem to provide a comparison with the replica prediction.

( a ) When , the generalization error either falls like if noise is zero, or it exhibits a divergence at if noise is non-zero.

( b ) When noise is zero, increasing the explicit ridge increases the generalization error. At large , .

( c ) When there is noise, explicit regularization can prevent the overfitting peak and give optimal generalization. At large , .

( d ) In the plane, there are multiple possibilities for the learning curve . Monotonic learning curves are guaranteed provided is sufficiently large compared to . If regularization is too small, then two-critical points can exist in the learning curve, ie two values where (sample wise double descent). For very large noise, a single local maximum exists in the learning curve , which is followed by monotonic decreasing error.

_Detailed calculations can be found in this excellent introduction of the problem by Song Mei._

Suppose we have a -by- rank-1 matrix, , where is a norm-1 column vector constituting the signal that we would like to recover. The input we receive is corrupted by symmetric Gaussian i.i.d. noise, i.e.,

where ( is drawn from a Gaussian Orthogonal Ensemble). At large , eigenvalues of are distributed uniformly on a unit disk in the complex plane. Thus, the best estimate (which we call ) of

from is the eigenvector associated with the largest eigenvalue. In other words

The *observable* of interest is how well the estimate, , matches , as measured by . We would like to know its average over different .

In the problem setup, is a constant controlling the signal-to-noise ratio. Intuitively, the larger is, the better the estimate should be (when averaged over ). This is indeed true. Remarkably, for large , is almost surely for . For , it grows quickly as . This discontinuity at is a **phase transition**. This dependence on can be derived using the replica method.

In the simulations above, we see two trend with increasing . First, the average curve approaches the theory, which is good. In addition, trial-to-trial variability (as reflected by the error bars) shrinks. This reflects the fact that our observable is indeed self-averaging.

Here, we give a brief overview of how the steps of a replica calculation can be set up and carried out.

Here, is the problem parameter () that we average over. The minimized function is

This energy function already contains a “source term” for our observable of interest. Thus, the vanilla partition function will be used as the augmented partition function. In addition, this function does not scale with . To introduce the appropriate scaling, we add an factor to , yielding the partition function

It follows that (again using angular brackets to denote average over )

Since we are ultimately interested in this observable for the best estimate, , at the large , limit, we seek to compute

Why don’t we evaluate the derivative only at ? Because is not a source term that we introduced. Another way to think about it is that this result needs to be a function of , so of course we don’t just evaluate it at one value.

Per the replica trick, we need to compute

**Warning:** Hereafter, our use of “” is loose. When performing integrals, we will ignore the constants generated and only focus on getting the exponent right. This is because we will eventually take of and take the derivative w.r.t. . A constant in in front of the integral expression for does not affect this integral. This is often the case in replica calculations.

where is a uniform measure on and is the probability measure for , as described above.

We will not carry out the calculation in detail in this note as the details are problem-specific. But the overall workflow is rather typical of replica calculations:

1. Integrate over . This can be done by writing as the sum of a Gaussian i.i.d. matrix with its own transpose. The integral is then over the i.i.d. matrix and thus a standard Gaussian integral. After this step, we obtain an expression that no longer contains ,a major simplification.

2. Introduce the order parameter . After the last integral, the exponent only depends on through and . These dot products can be described by a matrix , where we define and .

3. Replace the integral over with one over . A major inconvenience of the integral over is that it is not over the entire real space but over a hypersphere. However, we can demand by requiring . Now, we rewrite the exponent in terms of and integrate over instead, but we add many Dirac delta functions to enforce the definition of . We get an expression in the form

4. After some involved simplifications, we have

where does not depend on and is . By the saddle point method,

where needs to satisfy the various constraints we proposed (e.g., its diagonal is all and it is symmetric).

The optimization over is not trivial. Hence, we make some guesses about the structure of , which maximizes the exponent. This is where the *replica symmetry (RS) ansatz* comes in. Since the indices of are arbitrary, one guess is that for all has the same value ,. In addition, for all , . This is the RS ansatz — it assumes an equivalency between replicas. Rigorously showing whether this is indeed the case is challenging, but we can proceed with this assumption and see if the results are correct.

The maximization of is now over two scalars, and . Writing the maximum as , and using the replica identity

Setting the derivative of w.r.t. them to zero yields two solutions.

**Bad Math Warning:** Maximizing w.r.t. requires checking that the solutions are indeed global minima, a laborious effort that has done for some models. We will assume them to be global minima.

For each solution, we can compute , which will become what we are looking for after being differentiated w.r.t. . We obtain an expression of . The two solutions yield and , respectively. Differentiating each w.r.t. to get , we have and .

Which one is correct? We decide by checking whether the solutions (black line and blue line above) are sensible (“physical”). It can be verified that , which is the largest eigenvalue of . Clearly, it should be non-decreasing as a function of . Thus, for , we choose the solution, and for the solution. Thus, is and in the two regimes, respectively.

We frequently encounter Gaussian integrals when using the replica method and it is often convenient to rely on basic integration results which we provide in this Appendix.

The simplest Gaussian integral is the following one dimensional integral

We can calculate the square of this quantity by changing to polar coordinates

We thus conclude that . Thus we find that the function is a normalized function over . This integral can also be calculated with Mathematica or Sympy. Below is the result in Sympy.

`from sympy import *`

from sympy.abc import a, b, x, y

x = Symbol('x')

integrate( exp( -a/2 * x**2 ) , (x, -oo,oo))

This agrees with our result since we were implicitly assuming real and positive ().

We can generalize this result to accomodate slightly more involved integrals which contain both quadratic and linear terms in the exponent. This exercise reduces to the previous case through simple completion of the square

We can turn this equality around to find an expression

Viewed in this way, this formula allows one to transform a term quadratic in in the exponential function into an integral involving a term linear in . This is known as the Hubbard-Stratanovich transformation. Taking be imaginary ( for real ), we find an alternative expression of a Gaussian function

A delta function can be considered as a limit of a normalized mean-zero Gaussian function with variance taken to zero

We can now use the Hubbard-Stratanovich trick to rewrite the Gaussian function

Thus we can relate the delta function to an integral

This trick is routinely utilized to represent delta functions with integrals over exponential functions during a replica calculation. In particular, this identity is often used to enforce defintions of the order parameters in the problem. For example, in the least squares problem where we used

We commonly encounter integrals of the following form

where matrix is symmetric and positive definite. An example is the data average in the least squares problem studied in this blog post where . We can reduce this to a collection of one dimensional problems by computing the eigendecomposition of . From this decomposition, we introduce variables

The transformation from to is orthogonal so the determinant of the Jacobian has absolute value one. After changing variables, we therefore obtain the following decoupled integrals

Using the fact that the determinant is the product of eigenvalues , we have the following expression for the multivariate Gaussian integral

.

]]>*[Boaz’s note: Blake and Haozhe were students in the ML theory seminar this spring; in that seminar we touched on the replica method in the lecture on inference and statistical physics but here Blake and Haozhe (with a little help from the rest of us) give a great overview of the method and its relations to ML. See also all seminar posts.]*

See also: PDF version of both parts and part 2 of this post.

In computer science and machine learning, we are often interested in solving optimization problems of the form

where is an objective function which depends on our decision variables as well as on a set of problem-specific parameters . Frequently, we encounter problems relevant to machine learning, where is a random variable. The replica method is a useful tool to analyze *large problems* and their *typical* behavior over the distribution of .

Here are a few examples of problems that fit this form:

- In supervised learning, may be a training loss, a set of neural network weights and the data points and their labels
- We may want to find the most efficient way to visit all nodes on a graph. In this case describes nodes and edges of the graph, is a representation of the set of chosen edges, and can be the cost of if encodes a valid path and (or a very large number ) if it doesn’t encode a valid path.
- Satisfiability: is a collection booleans which must satisfy a collection of constraints. In this case the logical constraints (clauses) are the parameters . can be the number of constraints violated by .
- Recovery of structure in noisy data: is our guess of the structure and are instances of the observed noisy data. For example PCA attempts to identify the directions of maximal variation in the data. With the replica method, we could ask how the accuracy of the estimated top eigenvector degrades with noise.

The **replica method** is a way to calculate the value of some statistic (**observable** in physics-speak) of where is a “typical” minimizer of and is a “typical” value for the parameters (which are also known in physics-speak as the **disorder**).

In Example 1 (supervised learning), the observable may be the generalization error of a chosen algorithm (e.g. a linear classifier) on a given dataset. For Example 2 (path), this could be the cost of the best path. For Example 3 (satisfiability), the observable might be whether or not a solution exists at all for the problem. In Example 4 (noisy data), the observable might be the quality of decoded data (distance from ground truth under some measure).

An observable like generalization error obviously depends on , the problem instance. However, can we say something more general about this *type of problem*? In particular, if obeys some probability distribution, is it possible to characterize the the typical observable over different problem instances ?

For instance, in Example 1, we can draw all of our training data from a distribution. For each random sample of data points , we find the set of which minimize and compute a generalization error. We then repeat this procedure many times and average the results. Sometimes, there are multiple that minimize for a given sample of ; this requires averaging the observable over all global minima for each first, before averaging over different .

To give away a “spolier”, towards the end of this note, we will see how to use the replica method to give accurate predictions of performance for noisy least square fitting and spiked matrix recovery.

*Generalization gap in least squares ridge regression, figure taken from Canatar, Bordelon, and Pehlevan*

*Performance (agreement with planted signal) as function of signal strength for spiked matrix recovery, as the dimension grows, the experiment has stronger agreement with theory. See also Song Mei’s exposition*

Now that we are motivated, let’s see what quantities the replica method attempts to obtain. In general, given some observable , the average of over a minimizer chosen at random from the set , and take the average of this quantity over the choice of the disorder .

In other words, we want to compute the following quantity:

The above equation has two types of expectation- over the disorder , and over the minimizers .

The physics convention is to

- use for the expectation of a function over the disorder
- use for the expectation of a function over chosen according to some measure .

Using this notation, we can write the above as

where denotes an average over the probability measure for problem parameters , and is a uniform distribution over the set of that minimize with zero probability mass placed on sub-optimal points.

The ultimate goal of the replica method is to express

but it will take some time to get there.

Above, we glossed over an important distinction between the “typical” value of and the *average* value of . This is OK only when we have *concentration* in the sense that with high probability over the choice of , is close to its expected value. We define this as the property that with probability at least , the quantity is within a multiplicative factor of its expectation, whwere is some quantity that goes to zero as the system size grows. A quantity that concentrates in this sense is called **self averaging**.

For example, suppose that where each equals with probabilty and with probability independently. Standard Chernoff bounds show that with high probability or . Hence is a self averaging quantity.

In contrast the random variable is not self averaging. Since and these random variables are independent, we know that . However, with high probability a typical value of will be of the form . Since we see that the typical value of is exponentially smaller than the expected value of .

The example above is part of a more general pattern. Often even if a variable is not self averaging, the variable will be self-averaging. Hence if we are interested in the typical value of , the quantity is more representative than the quantity .

Suppose that we want to compute a quantity of the form above. When is it a good idea to use the replica method to do so?

Generally, we would want it to satisfy the following conditions:

- The learning problem is high dimensional with a large budget of data. The replica method describes a
*thermodynamic limit*where the system size and data budget are taken to infinity with some fixed ratio between the two quantities. Such a limit is obviously never achieved in reality, but in practice sufficiently large learning problems can be accurately modeled by the method. - The loss or the constraints are convenient functions of and . Typically will be a low degree polynomial or a sum of local functions (each depending on small number of variables) in .
- Averages over the disorder in are tractable analytically. That is, we can compute marginals of the distribution over .
- The statistic that we are interested in is self-averaging.

If the above conditions aren’t met, it is unlikely that this problem will gain much analytical insight from the replica method.

We now describe the conceptual steps that are involved in calculating a quantity using the replica method.

They are also outlined in this figure:

The uniform measure on minimizers is often difficult to work with. To aid progress, we can think of it as a special case of what is known as the Gibbs measure, defined as

where is the normalization factor, or **partition function**. is called the **inverse temperature**, a name from thermodynamics. It is easy to see that when (i.e., when the temperature tends to the absolute zero), the Gibbs measure converges to a uniform distribution on the minimizers of : .

Hence we can write

Physicists often exchange the order of limits and expectations at will, which generally makes sense in this setting, and so assume

Thus general approach taken in the replica method is to derive an expression for the average observable for any and then take the limit. The quantity is also known as the *thermal average* of , since it is taken with respect to the Gibbs distribution at some positive temperature.

To compute the thermal average of , we define the following *augmented partition function*:

One can then check that

Hence our desired quantity can be obtained as

or (assuming we can again exchange limits at will):

We see that ultimately computing the desired quantity reduces to computing quantities of the form

for the original or modified partition function . Hence our focus from now on will be on computing ().

Averaging over is known as “configurational average” or **quenched average**. All together, we obtain the observable, first thermal averaged to get (averaged over ) and then quenched averaged over .

What Concentrates?: It is not just an algebraic convenience to average instead of averaging itself. When the system size is large, concentrates around its average. Thus, the typical behavior of the system can be understood by studying the quenched average The partition function itself often does not concentrate and in general the values (known as the “annealed average”) and (known as the “quenched average”) could differ subtantially. For more information, please consult Mezard and Montanari’s excellent book, Chapter 5.

Hereafter, we use to denote the average and drop the dependence of and . To compute , we use an identity

For the limit to make sense, should be any real number. However, the expression for is only easily computable for natural numbers. This step is non-rigorous: we will obtain an expression for for natural number , and then take the limit after the fact.

Recall that under the Gibbs distribution , the probability density on state is equal to . Denote by the probability distribution over a tuple of independent samples (also known as **replicas**) chosen from .

Since the partition function is an integral (or sum in the discrete case) of the form , we can write as the integral of where are independent variables.

Now since each is weighed with a factor of , this expression can be shown as equal to taking expectation of some exponential function over a tuple of independent samples of **replicas** all coming from the same Gibbs distribution corresponding to the same instance .

(The discussion on is just for intuition – we will not care about the particular form of this , since soon average it over .)

Hence

The above expression is an expectation of an integral, and so we can switch the order of summation, and write it also as

It turns out that for natural energy functions (for example when is quadratic in such as when it corresponds to Mean Squared Error loss), for any tuple of , the expectation over of only depends on the angles between the ‘s.

That is, rather than depending on all of these -dimensional vectors, it only depends on the coefficients . The matrix Q is known as the **overlap matrix** or **order parameters** and one can often find a nice analytical function whose values are bounded (independently of ) such that

.

Hence we can replace the integral over with an integral over and write

where the measure is the one induced by the overlap distribution of a tuple taken for a random choice of the parameters .

Since only ranges over a small ( dimensional set), at the large limit, the integral is dominated by the maximum of its integrand (“method of steepest descent” / “saddle point method”). Let be the global minimum of (within some space of matrices). We have

Once we arrive at this expression, the configurational average of is simply . These steps constitute the replica method. The ability to compute the configurational average by creating an appropriate is one of the factors determining whether the replica method can be used. For example, in the supervised learning example, is almost always assumed to be quadratic in ; cross-entropy loss, for instance, is generally not amendable.

Bad Math Warning: there are three limits, , , and . In replica calculations, we assume that we take take these limits in whichever order that is arithmetically convenient.

**Coming up:** In part two of this blog post, we will explain the replica symmetric assumption (or “Ansatz” in Physics-speak) on the order parameters and then demonstrate how to use the replica method for two simple examples: *least squares regression* and *spiked matrix model*.

Registration to the conference is free (but required!)

Visit the webpage https://itcrypto.github.io/2021/index.html for more information and for the full program.

Hope to see you there!

– The Organising Committee

]]>We would like to ask for your feedback on the conference. Whether you attended this virtual edition of STOC or not, please fill this form to help us make future TheoryFests even better!

Moreover, as part of one of the STOC social event (and in line with initiatives in 2017 by Shuchi Chawla and the Edit-a-Thon from STOC 2019 , a spreadsheet aiming to crowdsource which TCS Wikipedia pages need improvement/creation has been set up:

https://docs.google.com/spreadsheets/d/1ZswaweUvsjVHnItMBnIxaiBZxcs-gEaUoODFoH4FGms/edit.

Feel free to use or edit it in view of improving the TCS coverage in Wikipedia.

Clément Canonne (on behalf of the STOC and TheoryFest organizers)

]]>In recent years there has been increasing interest in using machine

learning to improve the performance of classical algorithms in

computer science, by fine-tuning their behavior to adapt to the

properties of the input distribution. This “data-driven” or

“learning-based” approach to algorithm design has the potential to

significantly improve the efficiency of some of the most widely used

algorithms. For example, it has been used to design better data

structures, online algorithms, streaming and sketching algorithms,

market mechanisms and algorithms for combinatorial optimization,

similarity search and inverse problems. This virtual workshop will

feature talks from experts at the forefront of this exciting area.

The workshop will take place virtually on **July 13-14, 2021**.

Registration is **free but mandatory**. Link to register: https://fodsi.us/ml4a.html

**Confirmed Speakers:**

- Alex Dimakis (UT Austin)
- Yonina Eldar (Weizmann)
- Anna Goldie (Google Brain, Stanford)
- Reinhard Heckel (Technical University of Munich)
- Stefanie Jegelka (MIT)
- Tim Kraska (MIT)
- Benjamin Moseley (CMU)
- David Parkes (Harvard)
- Ola Svensson (EPFL)
- Tuomas Sandholm (CMU, Optimized Markets, Strategy Robot, Strategic Machine)
- Sergei Vassilvitski (Google)
- Ellen Vitercik (CMU/UC Berkeley)
- David Woodruff (CMU)

**Organizers:**

- Costis Daskalakis (MIT)
- Paul Hand (Northeastern)
- Piotr Indyk (MIT)
- Michael Mitzenmacher (Harvard)
- Ronitt Rubinfeld (MIT)
- Jelani Nelson (UC Berkeley)

(h/t Salil Vadhan)

The Journal of the ACM is looking for a new editor in chief: see the call for nominations. The (soft) deadline to submit nominations (including self nominations) is **July 19th** and you can do so by emailing Chris Hankin at c.hankin@imperial.ac.uk

**Previous post:** Toward a theory of generalization learning **Next post:** TBD.

See also all seminar posts and course webpage.

lecture slides (pdf) – lecture slides (Powerpoint with animation and annotation) – video

Much of the material on causality is taken from the wonderful book by Hardt and Recht.

For fairness, a central source was the book in preparation by Barocas, Hardt, and Narayanan as well as the related NeurIPS 2017 tutorial, and other papers mentioned below.

We may have heard that **“correlation does not imply causation”**. How can we mathematically represent this statement, and furthermore differentiate the two rigorously?

Roughly speaking, and are correlated if is different for different values of . To represent causation, we change the second part of the formula: causes if *intervening* to change to some value changes the probability of . That is,

depends on .

Suppose we have the random variables (taken over choice of a random person), which represent e**X**ercising, being over**W**eight and having **H**eart disease, respectively. We put forth the following (hypothetical! this is not medical advice!) scenarios for their relationships:

**Scenario 1.** . Now , the overweight indicator, follows the causal relation:

and the heart disease indicator follows the same rule

So, in this scenario, exercise prevents heart disease and being overweight, while if we don’t exercise, we may be overweight or suffer from heart disease with probability 1/2 independently..

**Scenario 2.**

and still depends on in the same rule in the previous scenario. So, in this scenario, people are naturally prone to being overweight with probability 1/4, and being overweight makes you less likely to exercise, rather than the causal relation being in the other way around. As before, exercise prevents heart disease, and someone who did not exercise will get heart disease with probability 1/2.

We find that in scenario 1, . In scenario 2,

In fact, as this table shows, the probabilities for all combinations of are identical in the two scenarios!

Now, consider the intervention of setting , i.e. stop exercising. That is, we change the generating model for to be . In scenario 1, is still . In scenario 2, tells us nothing about now so we get . Now that we added in an intervention, the two scenarios are different!

This is an example of why **correlations are not causations**: while the conditional probabilities identical in the two scenarios, the causal probabilities are diffent .

**NOTE:** Working out this example, and understanding **(a)** why the two scenarios induce identical probabilities, and in particular all conditional probabilities are identical and **(b)** why the causal probabilities differ from the conditional probabilities in Scenario 2, is a great way to get intuition for causality and its pitfalls.

Consider Scenario 1, where the causal structure is as follows:

Looking at the table above, we see that the unconditional probability equals . Since in this scenario, there is no causal relation between being overweight and suffering from heat disease, the causal probability is also equal to .

However, we can calculate the conditional probability from the table and see that .

That means that even though in this scenario, there is no causal relation between being overweight and getting heart disease, conditioning on not being overweight reduces the probability of getting heart disease.

Once again we see here a **gap** between the **conditional **and **causal** probabilities.

The reason is for this gap is that there is a **counfounding variable**, namely that is a common cause of both and .

**Definition:** are *confounded* if there are values such that

To fix the effect of a confounder, we condition on . It also allows us to find the probability of an intervention. The general **deconfounding formula** is

(★),

where ranges over all the immediate causes of .

Contrast this with the formula for computing the conditional probability which is

Using the deconfounding formula (★) requires **(a)** knowing the causal graph, and **(b)** observing the confounders. If we get this wrong and control for the wrong confounders we can get the causal probabilities wrong, as demonstrated by the following example.

One way to describe causality theory is that it aims to clarify the situations under which **correlation does in fact equal causation** (i.e., the **conditional probabilities are equal to the causal probabilities**), and how (by appropriately controlling for confounders) we can get to such a situation.

**Example (two diseases)** Consider the diagram below where there are two diseases and such that each occurs independently with probability . We assume each will send you to the hospital (variable ) and those are the only reason to arrive at the hospital.

If you control for (i.e look at only people who went to the hospital), we find that the probabilities are now correlated: A priori the probability is , and conditioned on , the probability is .

This relates to the joke “the probability of having 2 bombs on a plane is very low, so if I bring a bomb then it is very unlikely that there will be another bomb.”

In general, the causal graph can look as one of the following shapes:

If is a fork then controlling for can tease out the causal relation. If is a mediator or collider then controlling for can actually make things worse. –>

**Backdoor paths:** If and are two random variables, we say that there is a “backdoor path” from to if there is direct ancestor of that is connected in the undirected version of the causal graph in a path not going through .

We can show the following theorem:

**Theorem:** If there is no backdoor path then

Here is a “proof by picture”:

If there isn’t a backdoor path, we sort the graph in topological order, so that all the events that happen before are not connected to except through . So we can first generate all the variables that result in . Then the probability distribution of the events between and only depends on the value of , and so similarly is generated from some probability distribution that only depends on .

When we design experiments, we often want to estimate *causal effects*, and to do so we try to make sure we eliminate backdoor paths.

Consider the example of a COVID vaccine trial.

We let be the event that a trial participant obtained a vaccine, and be the event that the participant was infected with COVID.

We want to figure out .

However, there is a “backdoor path”.

You will not get the vaccine if you don’t participate in the trial (which we denote by ), but particpating in the trial could change your behavior and hence have a causal effect on .

To fix this we can cut the backdoor path using a placebo: it cuts the backward path by removing the confounding variable of participation, since it ensure that (conditioning on ), is now an independent variable from any behavioral changes that might impact .

In general, how does conditioning on some variable affect correlations? It may introduce correlations in events that occur before , but cuts any path that depends on .

Suppose we have some treatment variable that we don’t get to control (e.g. in a natural experiment). Let , and we hope to estimate which is known as the the **treatment effect**.

However, we worry that some underlying variable (e.g. healthy lifestyle) can affect both and .

The *propensity score*, defined as , allows us to calculate . We claim that as long as is a valid confounder (for which the formula (★) holds)

The proof is obtained by expanding out the claim, see below

Intuitively, knowing the probability that different groups of people get treatment allows us to make independent from and calculate the treatment effect.

**Calculating treatment effect using ML.** Suppose that the treatment effect is and . Now, if we learn a model , then

Since both and are calculable, we only need to do a linear regression.

When we cannot observe the counfounding variable, we can still sometimes use *instrumental variables* to estimate a causal effect.

Assume a linear model , where is the stuff we don’t observe. If is some variable that satisfies then

which is the ratio between two observable quantities.

We focus on fairness in classification problems, rather than fairness in learning generative models or representation (which also have their own issues, see in particular this paper by Bender, Gebru, McMillan-Major, and “Shmitchell”).

In the public image, AI has been perceived to be very successful for some tasks, and some people might hope that it is more “objective” or “impartial” than human decisions which are known to be fraught with bias). However, there are some works suggesting this might not be the case:

- Usage in prediciting recidivism for bail decisions. For example in ProPublica Angwin, Larson, Mattu, and Kirchner showed that 44.9% of African Americans who didn’t reoffend were labeled higher risk, whereas only 23.5% of white defendants who didn’t reoffend were labeled as such.
- Machine vision can sometimes work better on some segments of population than others. For example, Buolamwini and Gebrue showed that some “gender classifiers” achieve 99.7% accuracy on white men but only 65.3% accuracy (not much better than coin toss) on black women.
- Lum and Isaac gave an example of a “positive feedback loop” in predictive policing in Oakland, CA. While drug use is fairly uniform across the city, the arrests are centered on particular neighborhood (that have more minority residents). In predictive policing, more police would be sent out to the places where arrests occured, hence only exercabating this disparate treatment.

Drug use in Oakland:

Drug arrests in Oakland:

While algorithms can sometimes also help, the populations they help might not be distributed equally. For example, see this table from Gates, Perry and Zorn. A more accurate underwriting model (that can better predict the default probability) enables a lender to use a more agressive risk cut off and so end up lending to more people.

However, this is true within each subpopulation too, so it may be that if the model is less accurate in a certain subpopulation, then a profit-maximizing lender will unfairly offer fewer loans to this subpopulation.

In the case of employment discrimination in the U.S., we have the following components:

- Protected class
- categories such as race, sex, nationality, citizenship, veteran status, etc.

- An unfairness metric, measuring either:
- disparate treatment
- disparate impact.

Employers are not allowed to discriminate across protected classes when hiring. The unfairness metric gives us a way to measure if there is discrimination with respected to a protected class. In particular, disparate impacts across different protected classes is often *necessary* but *not sufficient* evidence of discrimination.

To see why algorithms, which at first glance seem agnostic to group membership, may exhibit disparate treatment or impact, we consider the following Google visualization by Wattenberg, Viégas, and Hardt.

Consider a blue population and an orange population for which there is no difference in the probability of a member of either population paying back the loan, but for which our model has different accuracies—in particular, the model is more accurate on the orange population. This is described by the plot below, in which the scores correspond to the model’s prediction of the probability of paying back the loan and opaque circles correspond to those who actually do not pay back the loan, whereas filled in circles correspond to those who do.

Suppose we are in charge of making a lending decision given the model prediction.

A scenario in which we give everyone a loan would be fair, but would be bad us —we would go bankrupt!

Profit when giving everyone a loan:

If we wanted to **maximize profit**, we would, however, give more loans to the orange population (since we’re more sure about which members of the orange population would actually pay back their loans) by setting a lower threshold (in terms of the score given by our algorithm) above which we give out loans.

This maximizes profit but is blatantly unfair. We are treating the identical blue and orange groups differently, just because our model is more accurate on one than the other, and we also have disparate impact on the two groups. A non-defaulting applicant would be 78% likely to get a loan if they are a member of the orange group, but only 60% likely to get a loan if they are a member of the orange group.

This “profit maximization” is likely the end result of any sufficiently complex lending algorithm in the absence of a fairness intervention. Even if the algorithm does not explicitly rely on the group membership attribute, by simply optimizing it to maximize profit, it may well pick up on attributes that are correlated with group membership.

Suppose on the other hand that we wanted to mandate “equal treatment” in the sense of keeping the same thresholds for the blue and orange group. The result would be the following:

In this case, since the threshold are identical, the algorithm will be **calibrated**. 79% of the decisions we make will be the correct ones, for both the blue and orange population. So, from our point of view, the algorithm is fair and treats the blue and orange populations identically. However, from the point of view of the applicants, this is not the case. If you are a blue applicant that will pay your loan, you have 81% chance of getting a loan, but if you are an orange customer you only have 60% of getting it. This demonstrates that defining fairness is quite delicate. In particular the above “color blind” algorithm is still arguable unfair.

This difference between the point of view of the lender and lendee also arose in the recidivism case mentioned above. From the point of view of the defendant that would not recidivate, the algorithm was more likely to label them as “high risk” if they were Black than if they were white. From the point of view of the decision maker, the algorithm was calibrated, and if anything it was a bit more likely that a white defendant labeled high risk would not recidivate than a Black defendant. See (slightly rounded and simplified) data below

If we wanted to achieve demographic parity (both populations get same total number of loans) or equal opportunity (true positive rate same for both) then we can do so, but again using different thresholds for each group:

While the above was a hypothetical scenario, a real life example was shown by Hardt, Price and Srebro using credit (also known as FICO) scores, as described by the plot below:

For a single threshold, around 75% of Asian candidates will get loans, whereas only around 20% of Black candidates will get loans. To ensure that all groups get loans at the same rate, we would need to set the thresholds differently. In order to equalize opportunity, we’d also need to initialize the thresholds differently as well.

We see that we have different notions of what it means to be *fair* and that each of these different notions result in different algorithms.

Berkeley graduate admissions in 1973 had the following statistics:

- 44% male applicants admitted, 35% female applicants admitted;
- However, female acceptance rate was higher at the
*department level*, for most departments.

This paradox is commonly referred to as Simpson’s Paradox.

A “fair” causal model for this scenario might be as follows:

In the above, perhaps gender has a causal impact on the choice of department to which the applicant applies. However, a fair application process would, conditional on the department, be independent of gender of the applicant.

However, not all models that follow this causal structure are necessarily *fair*. In the case Griggs v. Duke Power Co., 1971, the court ruled that decision-making under the following causal model was *unfair*:

While the model appears to be fair, since the job offer is conditionall independent of race, given the diploma, the court ruled that the job did not actually require a high school diploma. Hence, using the diploma as a factor in hiring decisions was really just a proxy for race, resulting in essentially purposeful unfair discrimination based on race. This creation of proxies is referred to as redlining.

We cannot come up with universal fairness criteria. The notion of fairness itself is based on assumptions about:

- representation of data
- relationships to unmeasured inputs and outcomes
- causal relation of inputs, predictions, outcomes.

Fairness depends on what we choose to measure to observe, in both inputs and outputs, and how we choose to act upon them. In particular, we have the following causal structure, wherein measure inputs, decision-making, and measured outcomes all play a role in affecting the real-world and function together in a feedback cycle:

A more comprehensive illustration is given in this paper of Friedler, Scheidegger, and Venkatasubramanian:

]]>