Friday 27 February 2015

Lecture 19

The fictional example in Section 19.2 about the heights of UK and Netherlands basketball players showed something striking. Can you think of any real life instances of this sort of effect?

The graphs on page 7 of this report on climate change are interesting in respect of showing how a small change in the mean can give rise to an amplified impact in the tail of a normal distribution. See Full SREX report: Managing the Risks of Extreme Events and Disasters to Advance Climate Change Adaptation, Special Report of the Intergovernmental Panel on Climate Change.

In this graph we see "hot weather" increasing by a factor of 2. 
But "extreme hot weather" increases even more, by a factor of 3

Wednesday 25 February 2015

Lecture 18

The Wikipedia article on the Bertrand paradox has more interesting information and pictures. The is also a Wolfram demonstration project called the Random chord paradox that you can play with.

Examples sheet 4, #2 is this classic question: "A stick is broken in two places, independently uniformly distributed along its length. What is the probability that the three pieces will make a triangle?" For a more a more complicated geometric probability problem see the blog posting on Planet Zog problems.

Monday 23 February 2015

Table of distributions

Here is a table of 5 discrete and 6 continuous distributions. It is supplied to second year students studying Statistics IB, as a reminder of things they are to know from Probability IA.

For each of distribution the table shows its usual notation, p.m.f. or p.d.f., mean, variance, and p.g.f. or m.g.f.

You will want to have at your fingertips most of the information in this table (or the method to by which to produce it). You might find it helpful to make your own table, using this one as a guide.

Some of these 11 distributions are not explicitly mentioned in the schedules for Probability IA. Rather they appear as examples of other things which are mentioned in the schedules. For example, the gamma distribution arises as an example of summing i.i.d. r.vs (in this case exponential r.vs). The multivariate normal provides an example of the p.d.f. of jointly distributed continuous r.vs. The beta distribution appears as an exercise on the joint distribution of order statistics of i.i.d. uniform r.vs. We have not mentioned in a lecture the negative binomial, but this is simply the sum of i.i.d. geometric r.vs. (as done in Examples sheet 3, #5).

Lecture 17

An important special case of Theorem 17.1 is when $Y=aX$, for some constant $a$.
The p.d.f. of $Y$ is then $f_Y(y)=(1/a)f_X(y/a)$.

There are many ways to define transitive ordering relations between random variables. Now you know about $EX\geq EY$ and  $EX\geq_{\text{st}} EY$. Such relationships are important in applications, such as when $X$ and $Y$ are lifetimes of components and we are interested ordering their reliabilities. (Reliability engineering is concerned with determining the ability of a system or component to function under some stated conditions for a specified period of time.)

There are other ordering relations, such as "hazard rate ordering", and "convex ordering". These are described within the Wikipedia article on stochastic ordering. I explained in the lecture that $EX\geq_{\text{hr}} EY\implies EX\geq_{\text{st}} EY\implies EX\geq EY$. The first implication is from
\[
1-F(x) = \exp\left(-\int_0^x h(t)dt\right)
\]and the second from Theorem 17.4. An important order relation in statistics is "likelihood ratio ordering". If $X$ and $Y$ have p.d.fs $f$ and $g$ such that $f(t)/g(t)$ increases with $t$ then $X\geq_{\text{lr}}Y$. It is a stronger ordering than the two we have met, since $X\geq_{\text{lr}}Y\implies X\geq_{\text{st}}Y$. For example, if $X$ and $Y$ are exponential random variables with parameters $1/2$ and $1$, then $X\geq_{\text{lr}}Y$

The treatment of the inspection paradox is this lecture derived from a paper by Sheldon Ross, The inspection paradox, Probability in the Engineering and Informational Sciences, 17, 2003, 47–51. Of course the use of the word "paradox" is somewhat in appropriate, since it is not hard to understand why $X_J\geq_{\text{st}}X_1$.

There are other such inspection paradoxes. Suppose $X_1,X_2,\dotsc$ are i.i.d. $U[0,1]$ and $N$ is the least integer such that $X_1+\cdots+X_N\geq 1$. Then conditional on this,

$X_N\geq_{\text{st}}X_{N-1}\geq_{\text{st}}\cdots \geq_{\text{st}}X_1$.

I did not discuss in the lecture the coda at the end of page 69. I leave you with this puzzle, "Do girls have more brothers than boys do?" The "proof" that the answer is "yes" is false, though initially it looks almost convincing. What is wrong?

What do you think is the truth of the matter? Let $G$ and $B$ be the mean number of brothers possessed by a typical girl and typical boy, respectively.

If every family were to consists of some number of boys and just one girl then $G > B$.

If every family were to consists of only boys or only girls then $G < B$.

What are the minimal conditions under which $G=B$? Must you assume that each child is independently a boy or girl with probability $1/2$?

Friday 20 February 2015

Probability distributions

The chart I showed you, with the distributions we are studying marked in yellow, can be downloaded here. It is from the paper Univariate Distribution Relationships, L. M. Leemis and J. T. McQueston. In that paper you can find explanations of the various relationships that are indicated by the arrows.

Here also is a simpler chart about Relationships between probability distributions. This one shows some relations that the above chart misses out, such as how the binomial can come from the Poisson as $X \mid X+Y$ (see Examples sheet 2, #5).

Wikipedia has a nice List of probability distributions.

In 16.5 of the notes I mention that if $X$ and $Y$ are i.i.d. exponential r.vs then $X/(X+Y)$ is a uniformly distributed r.v. You will prove this in Examples sheet 4, #8. A general method will be given in Lecture 20. But it can also be done this way:

$P(X/(X+Y) \leq t)=P(X(1-t)/t\leq Y)=\int_{x=0}^\infty\left(\int_{y=x(1-t)/t}^{\infty}\lambda e^{-\lambda  y}dy\right)\lambda e^{-\lambda x}dx = t$

which is the c.d.f. of $U[0,1]$.

Lecture 16

The human life plots of force of mortality that I showed in today's lecture can be viewed on the Understanding Uncertainly page: How long are you going to live?

David Spiegelhalter (Winton Professor for the Public Understanding of Risk) makes this comment:

When 7 years old, there's only 1 in 10,000 chance of not making your 8th birthday - this is the safest that anyone has been, anywhere, ever, in the history of humanity. But the jump from 16 to 17 is the biggest relative increase in risk in your life (all this, of course, assumes you're utterly average). 

But I think the most amazing thing is the old Gompertz observation of log-linearity - the log-hazard is quite extraordinarily straight over such a long range (it would be 7 to 90 if it were not for idiotic youth). So, on average, every year older means an extra 9% chance of dying before your next birthday - so your force of mortality doubles every 8 years, like a nasty compound interest. And it will get you in the end.
He makes reference to the Gompertz distribution, which is one with a linearly increasing hazard rate.

Notice that the hazard rate curve (force of mortality) has a 'bath-tub' shape - there is an early high-risk period immediately following birth, followed by long low period of annual risk, until accidents start kicking in for 17-year-old-boys, and then steadily increases. At the above link there are some tools for interactively exploring how your life expectancy changes as a function of your present age, gender and lifestyle (non-smoker, moderate alcohol, 5 a day, exercise).

Most things (such as car engines, tyres, light bulbs) have an increasing hazard rate. But some things have a decreasing hazard rate. A business that has lasted two centuries is less likely to go bankrupt next year than one that only started two years ago.

Notice that if $X\sim U[0,1]$ then $h(x)=f(x)/(1-F(x))=1/(1-x)$, which tends to $\infty$ as $x\to 1$.

Some fine fine print. I said that a continuous random variable is one that has a continuous c.d.f., and it is this that puts the "continuous" in its name. However, the fine print (which you can find in my notes near the middle of page 63) is that we actually need the c.d.f. to absolutely continuous, and this will ensure that a p.d.f. exists. The qualifier "absolutely" rules out weirdly pathological c.d.fs. For example, the Cantor distribution has a continuous c.d.f., but it has no p.d.f. Of course none of this fine print is examinable in Part IA. But I know that some of you find this type of thing interesting.

Wednesday 18 February 2015

Simulation of a random walk

Here is something for Lecture 15 (on random walk).

Simulating the Simple Random Walk
from the Wolfram Demonstrations Project by Heikki Ruskeepää

Click above to play with a simulation of random walk produced by tossing a fair coin. The "seed" reruns with a different set of random coin tosses. The curves are confidence intervals, within which the walk will lie with high probability. Random walks are very important in physics, biology, and finance.



There are many interesting questions that one can ask about symmetric random walk, i.e. the random paths produced by tossing a fair coin (up 1 for a head, down 1 for a tail).
  • How many times on-average does a walk of length $n$ cross the $x$-axis?
  • What is the distribution of the last crossing of the $x$-axis? (recall Lecture 1!)
  • What is the distribution of the terminal value $X_n$?
  • As you try different seeds, do you think some walks look more "typical" than others?

Lecture 15

The Wikipedia page about random walk is worth a look. Today, we focused upon the Gambler's ruin problem. However, there are many more interesting things that can be said about random walks. For example, one can consider random walks on arbitrary graphs. More about random walks will be presented to you in the course Markov Chains, Part IB.

If you wish to ensure you never go bankrupt then rather than bet £1 at each go you could bet £ $fz$, where $0<f<1$. That way you come next to $(1+f)z$ or $(1-f)z$. Kelly betting, is about choosing the optimal value of $f$.

The continuous version of random walk is Brownian motion, which is presented in Stochastic Financial Models, Part II. One of its beautiful properties is that it is self-similar. The graphic below shows how Brownian motion looks when we place it under the microscope and gradually increase the magnification. It is possible to say of Brownian motion in 2 dimensions that "almost surely every arc of the path contains the complete works of Shakespeare in the handwriting of Bacon." (Peter Whittle). Just as we might say in dimension 1 that at some magnifications the path below looks like a graph of FTSE all-share index for the past 5 years.

Monday 16 February 2015

Simulation of a branching process

Here is something for Lecture 14 (on branching processes).

Readme first. You can run this demo in your browser, but you will need to install the Wolfram CDF player. It may take a few moments to load, so please be patient. Also, there seems to a problem at present with the plugin running in Chrome, so you might have to use a different browser. The best thing is to download the demo from the link on the page ("Download Demonstration as CDF") and then run the demo in Mathematica on your own computer. Here is a screenshot from the demo.

This Demonstration shows up to six simulated generations of a branching process. The distribution of the number of descendants of each member of the population can have either a Bernoulli, Poisson, or geometric distribution; the parameters of these distributions can be set with sliders. From the three figures, the left one shows the simulated branching process, the upper-right one shows the probability of ultimate extinction as a function of a parameter of the distribution, and the lower-right one shows the probabilities of the size of the population at the selected generation.

To run the demo click on Simulating the Branching Process
[from the Wolfram Demonstrations Project by Heikki Ruskeepää]

Lecture 14

In this lecture you have had a taste of the theory of branching processes.The calculations we did today in Theorem 14.2 are similar to the sort of thing you need to do in answering Examples Sheet 3, #10, 14, 15. Question #16 does not look like one about branching processes, but you might think about trying to map into a version of a branching process so that you can apply theorems from this lecture.

If you would like to read something more about branching processes, I recommend to A Short Introduction to Branching Processes by Gesine Reinert (a professor of statistics at Oxford).

Branching processes have interesting applications beyond the obvious ones of population biology, e.g. in rumours and advertising. I asked Gesine if she might recommend something beyond the basics. She has written to me, "One application of branching processes is to assess the shortest path length between two randomly chosen vertices in sparse networks; but that may be a bit much for first-year students."

In talking about branching processes with more than one type of individual I was drawing on  Gesine's slides 26-28, and 30-32. In these you can see the idea of a multi-type branching process. We see that it is possible to prove a theorem similar to our Theorem 14.3, but now the probability of extinction depends on the maximum eigenvalue of the matrix $M=(m_{ij})$, where $m_{ij}$ is the mean number of offspring of type $j$ produced by a parent of type $i$. The branching process becomes extinct with a probability 1 or $<1$ as the eigenvalue is $\leq1$ or $>1$.

Suppose there are two types. The key quantities are the two multi-dimensional p.g.f.s

$F_i(z_1,z_2)=E_i\Bigl[z_1^{X_{i1}}z_2^{X_{i2}}\Bigr]$, $i=1,2$,

where $X_{ij}$ is a r.v. having the distribution of the number of offspring of type $j$ produced by a parent of type $i$.

 Let $q_i$ be the probability of extinction when we start with one individual of type $i$. Then, analogous to  Theorem 14.3, $(q_1,q_2)$ are smallest solutions in $[0,1]^2$ to

$q_1 = F_1(q_1,q_2)$
$q_2 = F_2(q_1,q_2)$.

Saturday 14 February 2015

Examples Sheet 2

Most of you will now be doing Examples Sheet 2. Let me make a few comments about some things I hope you learn from that sheet. Some of what I say below is "in brief', and you may need to fill in details or ask your supervisor.

 #3,4. There is often more than one nice way to solve a problem. I think that both of these problems have at least two nice methods of solution. Have you been able to think of more than one way to solve them? For example, #4 can be solved by summing a geometric series. It can also be solved via a simple order 0 recurrence relation.

 #6 (ii). You should discover that $Y$ and $X$ are independent. So $X=10$ provides no information about $Y$. What is the intuition for this? Well, we might imagine that $X$ and $Y$ are initially numbers of "detectable" and "non-detectable" misprints, created as independent samples from the Poission distribution with parameter $\lambda/2$. The page has $X+Y$ misprints (which is distributed Poisson, parameter $\lambda$). The proofreader finds the detectable ones. In doing this she discovers nothing about $Y$. This model gives the same distribution of $X$ and $Y$ as one in which there are $X+Y$ misprints and each is detected with probability $1/2$.

 #7,8. The results in these questions are often proved by students in something like 2 pages of working. But they can also be set out in perhaps only 2-4 lines of working. Make sure you learn (from your supervisor) the fast way to derive these results.

 #10. There are several ways to show that $Y_1,\dotsc,Y_n$ are independent. One way is via the fact that $P(Y_1=1,Y_2=1,\dotsc,Y_n=1)=1/n!=P(Y_1=1)\cdots P(Y_n=1)$. Then apply the result of question #2.

 #12. The coupon collector's problem. This question is one that should lodge permanently in your brain. There are many problems in probability, design of computer science algorithms, and elsewhere, which at some point involve this problem. I think it is interesting that the expected time to collect the $n$ figures is about $n\log n$. However, the expected time needed to collect $99\%$ of the figures is only about $n\log 100$. So it is the last fraction of figures that are costly to collect.

 #14. Did you notice that this method would work for simulating any probability $\alpha\in (0,1)$? For example, if we want $P(A)=1/5$, we write this in binary as $\alpha=0.0011001100...$ and then toss a coin. If we see TTHHH we would say $A$ has not occurred, because translating this to $.00111$, it differs from the binary form of $\alpha$ by representing a number $>\alpha$.

 #16. This questions has been set to IA students for many years. In my experience few students are able to solve it. But the solution is so beautiful that it would be a pity if your supervisor were not given the opportunity to explain it to you.

 #17,18. When I say, "This section is for enthusiasts", I am hoping that most students will think "This means me". After all, if you are reading Mathematics at Cambridge you are by definition a mathematics enthusiast. Even if you cannot solve these questions fully you should at least make some attempt. I would expect almost all students to be able to do #17 for the case $k=2$, $n=3$. It is a good general principle to tackle a difficult problem by first trying to solve a simple special case.

Discrete Fourier transforms and p.g.f.s

In non-examinable Section 13.4 and 13.5 I showed you some moderately sophisticated applications of the simple ideas we are learning about p.g.f. and conditional expectation. I have added an Appendix B to the notes to say a bit more about the fast Fourier transform.

The brief story is this. To find the distribution of $Y$, the sum of two i.i.d. r.vs $X_1$ and $X_2$ taking values in $\{0,1,\dotsc,N-1\}$, we might find its p.g.f. $$p_Y(z)=p(z)^2=(p_0+p_1z+\cdots+p_{N-1}z^{N-1})^2.$$ This would take $O(N^2)$ multiplications of the form $p_ip_j$.

However, it is quicker to evaluate $p(z)$ at $2N$ complex values $z=\omega^0,\dotsc,\omega^{2N-1}$, square those values, and then recover the real coefficients of the polynomial $p(z)^2$. The calculation simplifies because by taking $\omega$ as a $2N$th root of unity, $\omega=\exp(-i\pi/N)$, the calculation of the $p(\omega^k)$, $k=0,\dotsc,2N-1$, can be done by a clever method, the discrete fast Fourier transform, which needs only $O(N\log N)$ multiplications.

Further details are in Appendix B of the notes, and the Part II course, Numerical Analysis.

Friday 13 February 2015

Lecture 13

The tower property of conditional expectation (also  called the law of iterated expectations) is akin to the law of total probability that we met in Section 6.3. Another name is law of total expectation. A special case is

$E[X]=\sum_{i:P(A_i)>0} E[X \mid A_i]P(A_i)$,

where $A_1,\dotsc,A_n$ are events giving a partition of the sample space $\Omega$. Compare the above to

$E[X]=E\big[E[X \mid Y]\bigr]=\sum_y E[X \mid Y=y]P(Y=y)$.

Sometimes we might write $E\Big[E[X \mid Y]\Bigr]$ as $E_Y\Big[E[X \mid Y]\Bigr]$, just to remind ourselves that the outer expectation is being applied to the variable $Y$. Of course it is implicit in all this that $E[X]$ exists.

Remember that $P(A | B)$ is only defined for $P(B)>0$.  If you want to see something that illustrates this in a cute way, then read this nice description of Borel's paradox. The paradox occurs because of a mistaken step in conditioning on an event of probability $0$, which can be reached as a limit of a sequence of events in different ways.

Wednesday 11 February 2015

Lecture 12

Let me remind you that you can install copies of Mathematica and Matlab software. It can be fun to use these to help with an appreciation of things we are doing. Today I used this Mathematica program to calculate some probability generating functions.

(* Roll a fair die 10 times *) 
(* What is the probability the sum is 50? *)

p[z_] = (1/6) (z + z^2 + z^3 + z^4 + z^5 + z^6)
SeriesCoefficient[p[z]^10, {z, 0, 50}]
% // N

Series[1/(1 - x - x^2), {x, 0, 10}]

Table[Fibonacci[n], {n, 0, 10}]

Series[(1 - Sqrt[1 - 4 x])/(2 x), {x, 0, 10}]

Table[CatalanNumber[n], {n, 10}]

Mathematica is available here http://www.maths.cam.ac.uk/computing/software/mat... 
Matlab is here http://www.maths.cam.ac.uk/undergrad/catam/softwa...


A very nice tripos question on probability generating functions is the following, 2004/2/II/9F. You should now be able to do this (and could discuss in your next supervision)
    A non-standard pair of dice is a pair of six-sided unbiased dice whose faces are numbered with strictly positive integers in a non-standard way (for example, (2,2,2,3,5,7) and (1,1,5,6,7,8)). Show that there exists a non-standard pair of dice A and B such that when thrown

    P(total shown by A and B is $n$) = P(total shown by pair of ordinary dice is $n$).
    for all $2\leq n\leq 12$,

    [Hint: $(x+x^2+x^3+x^4+x^5+x^6)=x(1+x)(1+x^2+x^4)=x(1+x+x^2)(1+x^3)$.]
We saw today that generating functions are useful in realms beyond probability, as example being that

"the $n$th Fibonacci number, $F_n$, is the coefficient of $x^n$ in the expansion of the
function $1/(1−x−x^2)$ as a power series about the origin."

That is a quote from chapter 1 of Herbert Wilf's fun book called generatingfunctionology, whose second edition is free to download. In his book you can read about many weird and wonderful things that can be done with generating functions. He starts Chapter 1 by writing

"A generating function is a clothesline on which we hang up a sequence of numbers for display."

Generating functions can encapsulate complicated things in a lovely way.  For example, the coefficient of $x^n/n!$ in the power series expansion of $e^{e^x-1}$ is the Bell number $B(n)$, i.e. the number of different partitions that can be made from a set of $n$ elements. In Chapter 1 of Wilf's book you can see a derivation of this fact.

Examples Sheet 1

Most of you will have now done Examples Sheet 1. Let me make a few comments about some things I hope you learn from that sheet.

#7. (about infinite sequence of coin tosses and limsup of events) This was done in Section 4.4 of the notes. An even quicker solution can be done using the fact that $P$ is a continuous function (proved in Section 7.1 of notes).

$P(C)=P(\bigcap_{n=1}^\infty\bigcup_{k=n}^\infty A_k)=\lim_{n\to\infty}P(\bigcup_{k=n}^\infty A_k)\leq\lim_{n\to\infty}(p_n+p_{n+1}+\cdots)=0$.

This was part of the tripose question 2014, 2, II9.

#8. (sets none of which is a subset of the other) Think how this question would look if everything except the first and last sentences of its statement were removed. It would look hard! See how adding a layer of probability helps prove a theorem that seems to have nothing to do with probability. We saw this in Example 4.5 (Erdos's probabilistic method) and we will see this a couple more times in the course, next in Section 11.4 (Weierstrass approximation theorem).

#13 (Mary tosses more heads than John) and #18 (41 bags of red and blue balls). Both these questions have very beautiful solutions. Please make sure your supervisor shows you the beautiful solutions, if you have not found them yourself.

#16 This question relates to what I did in Lecture 10 about information entropy.

#17 (goat ate the cards!) I invented this question as a sort of joke - but I am pleased that it has turned out to be very thought provoking for many students. If the "penny has dropped" for you about this question then you are on your way to developing some good intuition in the field of probability.

Comments are welcome if you have any ideas for good questions on this part of the course.

Monday 9 February 2015

Lecture 11

The probabilistic proof of the Weierstrass approximation theorem was to show you how probability theory can be used to prove an important theorem in a different field. We did something similar in Example 4.5 (Erdos probabilistic method in combinatorics) and in Examples Sheet 1, #8. The Wikipedia page on Bernstein polynomials has a list of Bernstein polynomials and an animation showing how they are used to approximate a continuous function on $[0,1]$.

I cheated today by using a result from the schedules for Analysis II ("a continuous function on a closed bounded interval is uniformly continuous"). I had thought about giving an easier proof by not appealing to Analysis II, but I decided it would be nicer to show you the proper full result.

The proof used Chebyshev's inequality and illustrated a nice idea that is frequently used in probability calculations. We partitioned the sample space into $A$ and $A^c$ and reasoned differently in each. For $A$ we had $P(A)$ small and $|f(X/n)-f(x)|\leq 2$, whereas for $A^c$ we had $P(A^c)\leq 1$ and $|f(X/n)-f(x)|$ small.

Benford's law is surprising when first encountered. The fact that the leading digit should be "1" with a frequency of 0.301 (rather than $1/9$) has been used as a test to check for fraud in financial records. The idea is that if a person is making up numbers, he unlikely to arrange that his distribution of first digits agree with Benford's law. You can read more about that, and many other examples of Benford's law, in the Wikipedia article referenced above.

In discussing why Benford's Law applies to leading digits of Fibonacci numbers I used a result that if $\beta=\log_{10}\alpha$ is irrational, then the sequence of fractional parts $\{]n\beta[\}_{n=1}^\infty$ is uniformly distributed. This result is certainly very plausible, but a proper proof is beyond our scope. It involves an analysis via the discrete Fourier transform of the sequence of length $N$, showing that this approaches the Fourier transform of a sequence of numbers that are uniformly distributed on $[0,1]$.

Friday 6 February 2015

Lecture 10

I mentioned  a 318 page book completely devoted to the Cauchy-Schwarz inequality. It is The Cauchy-Schwarz Master Class: An Introduction to the Art of Mathematical Inequalities. Its author, Michael J. Steele, is a past President of the Institute of Mathematical Statistics. One reviewer wrote "... this is one of that handful of mathematics books that you can read almost like a novel.Chapter 1 is free to download and really interesting to read. I can also recommend his highly entertaining web site, which includes material for advanced courses in probability and finance, and also Semi-Random Rants, Videos and Favorite Quotes.

There are many proofs of this inequality. Here are two in addition to the one I did in lectures.

1. Consider $E(tX-Y)^2]=t^2E[X^2]-2t E[XY]+E[Y^2]\geq 0$, and $>0$ unless $tX=Y$. This quadratic has most one real roots and so the discriminant must be nonpositive. This is essentially the proof that was given by Schwarz and is a textbook favorite.

2. We started the lecture with Jensen's inequality. It is natural to ask if the Cauchy-Schwarz inequality can be deduced from it. One version of the Cauchy-Schwarz inequality is

$(\sum_i a_ib_i)^2\leq \sum_ia_i^2\sum_ib_i^2$,

for all numbers $(a_1,\dotsc,a_n)$ and $(b_1,\dotsc,b_n)$. WLOG we may assume all $a_j,b_j> 0$. Consider a random variable $X$ for which $P(X=a_j/b_j)=b_j^2/\sum_ib_i^2$. Apply Jensen's inequality with the choice of convex function $f(x)=x^2$. $Ef(X)\geq f(EX)$ gives the result.

This paper presents 12 different proofs of the Cauchy-Schwarz inequality, and this web page has 10.

Wednesday 4 February 2015

Non-transitive dice

A set of 3 non-transitive dice can be bought from Maths Gear. They have sides like this:

Die A: 333336
Die B: 222555
Die C: 144444   $P(A>B)=P(B>C)=21/36$ and $P(C>A)=25/36$.

These are optimal, in that they maximize the smallest winning probability. However, there are other dice for which the sum of the 3 winning probabilities is greater.

Die A: 114444
Die B: 333333
Die C: 222255   $P(A>B)=P(B>C)=24/36$ and $P(C>A)=20/36$. 

There is an interesting article by James Grime (of the Millennium Mathematics Project) in which he says lots more about non-transitive dice. He has devised a set of 5 non-transitive dice. In another interesting article there is a video in which James discusses non-transitive dice with David Spiegelhalter.

James made a contribution to this blog last year, and you can read his comment at the end of the post here: http://weberprobability.blogspot.co.uk/2014/02/nontransitive-dice.html

Lecture 9

The account of Warren Buffett, Bill Gates and Efron's dice that I gave in lectures you can find here.

The "Names in Boxes Problem", which I talked about today, is a puzzle which I first heard described around a lunch table at AT&T Laboratories, New Jersey, around 2003. The problem was posed in a paper, with a less catchy title:

Anna Gal and Peter Bro Miltersen. The Cell Probe Complexity of Succinct Data Structures.
Proceedings of 30th Intemational Colloquium on Automata. Languages and Programming (ICALP) 2003, 332-344.

It won a best paper award at this ICALP conference. The solution was found by Sven Skyum, a colleague of Miltersen's at the University of Aarhus If you would like to read more about the history of the problem, I recommend this paper: The Locker Problem, (by Max Warshauer, Eucene Curtin, Michael Kleber, Ravi Vakil, The Mathematical Intelligencer, 2006, Volume 28, Issue 1, pp 28-31).

The puzzle is also described in a collection complied by Peter Winkler callled, "Seven Puzzles You Think You Must Not Have Heard Correctly". I think it is remarkable that we can work out all the needed facts about cycle lengths in a random permutation by using only the simple ideas from probability that we have encountered in our course thus far.

Peter is great collector of good puzzles. He has some nice ones that are geographic. How about these (answer without looking at a map):
  1. If you travel due south from Miami, Florida, what part of the South American continent do you first hit?
  2. What two cities, one in North America and one Africa, are closest?
I have been told that Question 2 in "Seven Puzzles You Think You Must Not Have Heard Correctly" has been used by some colleagues as an entrance interview question. But I think that is rather tough!

Monday 2 February 2015

Lecture 8

In Example 8.3 we came to the conclusion that the expected number of couples sitting together is $EN=2$. However, --- as someone pointed out to me afterwards ---, this is wrong for $n=1$, since we would be double-counting when at a table of two places Mr Smith has Mrs Smith on both his left and right. I have changed the notes so that this example specifies $n\geq 2$.

Cycling home I was thinking how to prove that as $n\to\infty$ the distribution of $N$ tends to the Poisson distribution with mean 2. One way to do this would be to calculate $EN^r$ for $r=1,2,\dotsc\,$, just as we have already done for $r=1,2$. That will be difficult, but not impossible. Then we show that as $n\to\infty$ these moments tend to the moments of the Poisson random variable. As we shall see in Lecture 21, there is a 1-1 relationship between distributions and their moments.

Zipf's law.
Simply as a fun fact, I told you today about Zipf's law. This is a probability mass function such that $P(X=i)\propto 1/i^s$, $i=1,2,\dotsc,m$, for some $s > 0$, typically $s=1$. It is very mysterious why this distribution should be a good fit to frequencies with which words are used in a language.

According to the Wikipedia article linked above, "Zipf himself proposed that neither speakers nor hearers using a given language want to work any harder than necessary to reach understanding, and the process that results in approximately equal distribution of effort leads to the observed Zipf distribution". My guess is thinking that he is thinking that to retrieve the $i$th most popular word from your memory has a cost proportional to $i$ (like you were digging down into a list), and that the average workload that each word puts on the brain should be constant, so $ip(i)$ should constant. But this is clearly hand-waving. Maybe you can think of a better explanation

In queueing theory and other theories of congestion, delays and long waiting times can be caused by variability in service times. Suppose customers enter a queue at a constant rate of 1 every 5 minutes. They are served one at a time, service times are independent, taking $1,2,\dotsc$ minutes with probabilities $p_1,p_2,\dotsc$, respectively

An interesting fact (researched extensively during the past 30 years) is that there are large qualitative difference between how such queues behave when $p_i$ decays geometrically or exponentially (as do the Poisson, or geometric distribution, which has $p_i= q^ip$), and when it decays according to a power law, like $p_i\propto 1/i$ or $\propto 1/i^{0.9}$.