osdl.net!shemminger | e7f02a5 | 2004-08-23 20:21:21 +0000 | [diff] [blame] | 1 | I. About the distribution tables |
| 2 | |
| 3 | The table used for "synthesizing" the distribution is essentially a scaled, |
| 4 | translated, inverse to the cumulative distribution function. |
| 5 | |
| 6 | Here's how to think about it: Let F() be the cumulative distribution |
| 7 | function for a probability distribution X. We'll assume we've scaled |
| 8 | things so that X has mean 0 and standard deviation 1, though that's not |
| 9 | so important here. Then: |
| 10 | |
| 11 | F(x) = P(X <= x) = \int_{-inf}^x f |
| 12 | |
| 13 | where f is the probability density function. |
| 14 | |
| 15 | F is monotonically increasing, so has an inverse function G, with range |
| 16 | 0 to 1. Here, G(t) = the x such that P(X <= x) = t. (In general, G may |
| 17 | have singularities if X has point masses, i.e., points x such that |
| 18 | P(X = x) > 0.) |
| 19 | |
| 20 | Now we create a tabular representation of G as follows: Choose some table |
| 21 | size N, and for the ith entry, put in G(i/N). Let's call this table T. |
| 22 | |
| 23 | The claim now is, I can create a (discrete) random variable Y whose |
| 24 | distribution has the same approximate "shape" as X, simply by letting |
| 25 | Y = T(U), where U is a discrete uniform random variable with range 1 to N. |
| 26 | To see this, it's enough to show that Y's cumulative distribution function, |
| 27 | (let's call it H), is a discrete approximation to F. But |
| 28 | |
| 29 | H(x) = P(Y <= x) |
| 30 | = (# of entries in T <= x) / N -- as Y chosen uniformly from T |
| 31 | = i/N, where i is the largest integer such that G(i/N) <= x |
| 32 | = i/N, where i is the largest integer such that i/N <= F(x) |
| 33 | -- since G and F are inverse functions (and F is |
| 34 | increasing) |
| 35 | = floor(N*F(x))/N |
| 36 | |
| 37 | as desired. |
| 38 | |
| 39 | II. How to create distribution tables (in theory) |
| 40 | |
| 41 | How can we create this table in practice? In some cases, F may have a |
| 42 | simple expression which allows evaluating its inverse directly. The |
| 43 | pareto distribution is one example of this. In other cases, and |
| 44 | especially for matching an experimentally observed distribution, it's |
| 45 | easiest simply to create a table for F and "invert" it. Here, we give |
| 46 | a concrete example, namely how the new "experimental" distribution was |
| 47 | created. |
| 48 | |
| 49 | 1. Collect enough data points to characterize the distribution. Here, I |
| 50 | collected 25,000 "ping" roundtrip times to a "distant" point (time.nist.gov). |
| 51 | That's far more data than is really necessary, but it was fairly painless to |
| 52 | collect it, so... |
| 53 | |
| 54 | 2. Normalize the data so that it has mean 0 and standard deviation 1. |
| 55 | |
| 56 | 3. Determine the cumulative distribution. The code I wrote creates a table |
| 57 | covering the range -10 to +10, with granularity .00005. Obviously, this |
| 58 | is absurdly over-precise, but since it's a one-time only computation, I |
| 59 | figured it hardly mattered. |
| 60 | |
| 61 | 4. Invert the table: for each table entry F(x) = y, make the y*TABLESIZE |
| 62 | (here, 4096) entry be x*TABLEFACTOR (here, 8192). This creates a table |
| 63 | for the ("normalized") inverse of size TABLESIZE, covering its domain 0 |
| 64 | to 1 with granularity 1/TABLESIZE. Note that even with the granularity |
| 65 | used in creating the table for F, it's possible not all the entries in |
| 66 | the table for G will be filled in. So, make a pass through the |
| 67 | inverse's table, filling in any missing entries by linear interpolation. |
| 68 | |
| 69 | III. How to create distribution tables (in practice) |
| 70 | |
| 71 | If you want to do all this yourself, I've provided several tools to help: |
| 72 | |
| 73 | 1. maketable does the steps 2-4 above, and then generates the appropriate |
| 74 | header file. So if you have your own time distribution, you can generate |
| 75 | the header simply by: |
| 76 | |
| 77 | maketable < time.values > header.h |
| 78 | |
| 79 | 2. As explained in the other README file, the somewhat sleazy way I have |
| 80 | of generating correlated values needs correction. You can generate your |
| 81 | own correction tables by compiling makesigtable and makemutable with |
| 82 | your header file. Check the Makefile to see how this is done. |
| 83 | |
| 84 | 3. Warning: maketable, makesigtable and especially makemutable do |
| 85 | enormous amounts of floating point arithmetic. Don't try running |
| 86 | these on an old 486. (NIST Net itself will run fine on such a |
| 87 | system, since in operation, it just needs to do a few simple integral |
| 88 | calculations. But getting there takes some work.) |
| 89 | |
| 90 | 4. The tables produced are all normalized for mean 0 and standard |
| 91 | deviation 1. How do you know what values to use for real? Here, I've |
| 92 | provided a simple "stats" utility. Give it a series of floating point |
| 93 | values, and it will return their mean (mu), standard deviation (sigma), |
| 94 | and correlation coefficient (rho). You can then plug these values |
| 95 | directly into NIST Net. |