| I. About the distribution tables |
| |
| The table used for "synthesizing" the distribution is essentially a scaled, |
| translated, inverse to the cumulative distribution function. |
| |
| Here's how to think about it: Let F() be the cumulative distribution |
| function for a probability distribution X. We'll assume we've scaled |
| things so that X has mean 0 and standard deviation 1, though that's not |
| so important here. Then: |
| |
| F(x) = P(X <= x) = \int_{-inf}^x f |
| |
| where f is the probability density function. |
| |
| F is monotonically increasing, so has an inverse function G, with range |
| 0 to 1. Here, G(t) = the x such that P(X <= x) = t. (In general, G may |
| have singularities if X has point masses, i.e., points x such that |
| P(X = x) > 0.) |
| |
| Now we create a tabular representation of G as follows: Choose some table |
| size N, and for the ith entry, put in G(i/N). Let's call this table T. |
| |
| The claim now is, I can create a (discrete) random variable Y whose |
| distribution has the same approximate "shape" as X, simply by letting |
| Y = T(U), where U is a discrete uniform random variable with range 1 to N. |
| To see this, it's enough to show that Y's cumulative distribution function, |
| (let's call it H), is a discrete approximation to F. But |
| |
| H(x) = P(Y <= x) |
| = (# of entries in T <= x) / N -- as Y chosen uniformly from T |
| = i/N, where i is the largest integer such that G(i/N) <= x |
| = i/N, where i is the largest integer such that i/N <= F(x) |
| -- since G and F are inverse functions (and F is |
| increasing) |
| = floor(N*F(x))/N |
| |
| as desired. |
| |
| II. How to create distribution tables (in theory) |
| |
| How can we create this table in practice? In some cases, F may have a |
| simple expression which allows evaluating its inverse directly. The |
| Pareto distribution is one example of this. In other cases, and |
| especially for matching an experimentally observed distribution, it's |
| easiest simply to create a table for F and "invert" it. Here, we give |
| a concrete example, namely how the new "experimental" distribution was |
| created. |
| |
| 1. Collect enough data points to characterize the distribution. Here, I |
| collected 25,000 "ping" roundtrip times to a "distant" point (time.nist.gov). |
| That's far more data than is really necessary, but it was fairly painless to |
| collect it, so... |
| |
| 2. Normalize the data so that it has mean 0 and standard deviation 1. |
| |
| 3. Determine the cumulative distribution. The code I wrote creates a table |
| covering the range -10 to +10, with granularity .00005. Obviously, this |
| is absurdly over-precise, but since it's a one-time only computation, I |
| figured it hardly mattered. |
| |
| 4. Invert the table: for each table entry F(x) = y, make the y*TABLESIZE |
| (here, 4096) entry be x*TABLEFACTOR (here, 8192). This creates a table |
| for the ("normalized") inverse of size TABLESIZE, covering its domain 0 |
| to 1 with granularity 1/TABLESIZE. Note that even with the granularity |
| used in creating the table for F, it's possible not all the entries in |
| the table for G will be filled in. So, make a pass through the |
| inverse's table, filling in any missing entries by linear interpolation. |
| |
| III. How to create distribution tables (in practice) |
| |
| If you want to do all this yourself, I've provided several tools to help: |
| |
| 1. maketable does the steps 2-4 above, and then generates the appropriate |
| header file. So if you have your own time distribution, you can generate |
| the header simply by: |
| |
| maketable < time.values > header.h |
| |
| 2. As explained in the other README file, the somewhat sleazy way I have |
| of generating correlated values needs correction. You can generate your |
| own correction tables by compiling makesigtable and makemutable with |
| your header file. Check the Makefile to see how this is done. |
| |
| 3. Warning: maketable, makesigtable and especially makemutable do |
| enormous amounts of floating point arithmetic. Don't try running |
| these on an old 486. (NIST Net itself will run fine on such a |
| system, since in operation, it just needs to do a few simple integral |
| calculations. But getting there takes some work.) |
| |
| 4. The tables produced are all normalized for mean 0 and standard |
| deviation 1. How do you know what values to use for real? Here, I've |
| provided a simple "stats" utility. Give it a series of floating point |
| values, and it will return their mean (mu), standard deviation (sigma), |
| and correlation coefficient (rho). You can then plug these values |
| directly into NIST Net. |