Project Page

This semester's project is on modeling movie gross income, is it exponential decay? Details to come. Data from the Internet Movie Data Base
Here is the data on the top 10 movies of 1999 and 2003. Besides the raw data, a green line was constructed using the publically available program gnuplot which is a non-linear fit to exp(a*x+b). You can see in many (most?) cases the decaying exponential is not a bad fit to the data. Some movies open only in NY and LA, a few others build an audience by word of mouth. By in large, blockbusters have a huge advertising budget to make the opening week the biggest grossing week, and the following weeks fall off by a more-or-less constant (percentage) rate.

Why would exponential decay be a good model box office income?

The best known example of exponential decay is from radioactivity. One supposes large numbers of atoms. Over a short time interval each atom has a certain probability of decaying. Each atom is an independent agent and decides to decay or not by the flipping of some nano-coin. Thus the rate of decrease of y, the number of atoms, would proportional to the number atoms. Or y' = -ky. This ODE can be solved, all solutions have the form y = A*exp(-k*t).

By analogy, we can think of the collection of people who will see the movie as independent agents. Over short time periods, each person independently decides to see or not to see the movie by the flipping of some coin. The people who flip the coin are exactly the population which will see the film and not all people. The size of this population is what determines and the eventual gross. The coin flipping probability just determines the `slope' of the curve which gives k and then A can be computed from k and total gross.

How to see that the data is exponential

Plot the ln (or log_10) of the gross versus the time. Exponential decay will show up if the data mostly lines up on a line. You can buy semi-log graph paper, which automatically does the log function for you. The slope of this line and ln y = ln(A*exp(-k*t) = -k*t + ln A is -k. This link gives directions for finding these constants using a TI-89.

Sometimes the double population model is a better fit

Going back to the ratioactive example, if there are two radioactive isotopes with different half-lives. The observed curve of exponential decay will have two straight line segments. An initial one from the shorter half life element. But eventually only the longer half life element remains, and this produces a second line segment with less steep slope.

One can think of the movie going public divided into two groups one of which is much more likely see the movie sooner. Again we have looked at only the people that will purchase tickets, but we have subdivided this population.

The curve would be of the form A_0*exp(-k_0*t) + A_1*exp(-k_1*t) for example here is one attempt. (Perhaps a 3 audience model would be better for this example?)

Work of mouth and increasing grosses

Some movies grow an audience. First assume the best k > 0 has been found using the data after things have become the usual exponential decay. We can estimate the size of the audience a_n at the beginning of the n week, but setting g_n = a_n integral k exp(-k t) dt from 0 to 1. Usually A = a_1, a_2 = A - a_1, a_3 = A - a_1 - a_2, but if the audience is growing a_2 could be much bigger than a_1, so the estimate of the audience at the n-th week would be a_1 + ... + a_n and we would see some growth in these numbers.

For example the following data truncates the first week and some trailing weeks in order to get a good fit. The fit suggests that the k = 0.19768, and the audience is exp(17.656)/0.19768 = 46,548,252/0.19788 = 235,472,742. But the first week's take was only 26,681,262 which would imply that the audience A(1-exp(-0.19768)) = 26,681,262 or A = 155,830,695 so the audience grew 235,472,742 - 155,830,695 = 79,642,046 or roughly a factor of 50% in one week!


Raw Data

Warning, the data from imdb.com has shortcoming which are obvious if you look hard enough. The data below has been slightly modified. The data was not carefully re-checked. The raw data files are in directory/folder