blob: 157500ed4b4a107c5f66327ebad7d15d997eaf48 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
40======================= =============================================
41:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Steven D'Aprano22873182016-08-24 02:34:25 +100043:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070044:func:`median` Median (middle value) of data.
45:func:`median_low` Low median of data.
46:func:`median_high` High median of data.
47:func:`median_grouped` Median, or 50th percentile, of grouped data.
48:func:`mode` Mode (most common value) of discrete data.
49======================= =============================================
50
Georg Brandleb2aeec2013-10-21 08:57:26 +020051Measures of spread
52------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070053
Georg Brandleb2aeec2013-10-21 08:57:26 +020054These functions calculate a measure of how much the population or sample
55tends to deviate from the typical or average values.
56
57======================= =============================================
58:func:`pstdev` Population standard deviation of data.
59:func:`pvariance` Population variance of data.
60:func:`stdev` Sample standard deviation of data.
61:func:`variance` Sample variance of data.
62======================= =============================================
63
64
65Function details
66----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070067
Georg Brandle051b552013-11-04 07:30:50 +010068Note: The functions do not require the data given to them to be sorted.
69However, for reading convenience, most of the examples show sorted sequences.
70
Larry Hastingsf5e987b2013-10-19 11:50:09 -070071.. function:: mean(data)
72
Raymond Hettinger6da90782016-11-21 16:31:02 -080073 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074
Georg Brandleb2aeec2013-10-21 08:57:26 +020075 The arithmetic mean is the sum of the data divided by the number of data
76 points. It is commonly called "the average", although it is only one of many
77 different mathematical averages. It is a measure of the central location of
78 the data.
79
80 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070081
82 Some examples of use:
83
84 .. doctest::
85
86 >>> mean([1, 2, 3, 4, 4])
87 2.8
88 >>> mean([-1.0, 2.5, 3.25, 5.75])
89 2.625
90
91 >>> from fractions import Fraction as F
92 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
93 Fraction(13, 21)
94
95 >>> from decimal import Decimal as D
96 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
97 Decimal('0.5625')
98
99 .. note::
100
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200101 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200102 for central location: the mean is not necessarily a typical example of the
103 data points. For more robust, although less efficient, measures of
104 central location, see :func:`median` and :func:`mode`. (In this case,
105 "efficient" refers to statistical efficiency rather than computational
106 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700107
Georg Brandleb2aeec2013-10-21 08:57:26 +0200108 The sample mean gives an unbiased estimate of the true population mean,
109 which means that, taken on average over all the possible samples,
110 ``mean(sample)`` converges on the true mean of the entire population. If
111 *data* represents the entire population rather than a sample, then
112 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700113
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700114
Raymond Hettinger47d99872019-02-21 15:06:29 -0800115.. function:: fmean(data)
116
117 Convert *data* to floats and compute the arithmetic mean.
118
119 This runs faster than the :func:`mean` function and it always returns a
120 :class:`float`. The result is highly accurate but not as perfect as
121 :func:`mean`. If the input dataset is empty, raises a
122 :exc:`StatisticsError`.
123
124 .. doctest::
125
126 >>> fmean([3.5, 4.0, 5.25])
127 4.25
128
129 .. versionadded:: 3.8
130
131
Steven D'Aprano22873182016-08-24 02:34:25 +1000132.. function:: harmonic_mean(data)
133
134 Return the harmonic mean of *data*, a sequence or iterator of
135 real-valued numbers.
136
137 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500138 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000139 data. For example, the harmonic mean of three values *a*, *b* and *c*
140 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
141
142 The harmonic mean is a type of average, a measure of the central
143 location of the data. It is often appropriate when averaging quantities
144 which are rates or ratios, for example speeds. For example:
145
146 Suppose an investor purchases an equal value of shares in each of
147 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
148 What is the average P/E ratio for the investor's portfolio?
149
150 .. doctest::
151
152 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
153 3.6
154
155 Using the arithmetic mean would give an average of about 5.167, which
156 is too high.
157
Zachary Warec019bd32016-08-23 13:23:31 -0500158 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000159 is less than zero.
160
Zachary Warec019bd32016-08-23 13:23:31 -0500161 .. versionadded:: 3.6
162
Steven D'Aprano22873182016-08-24 02:34:25 +1000163
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700164.. function:: median(data)
165
Georg Brandleb2aeec2013-10-21 08:57:26 +0200166 Return the median (middle value) of numeric data, using the common "mean of
167 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800168 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700169
Georg Brandleb2aeec2013-10-21 08:57:26 +0200170 The median is a robust measure of central location, and is less affected by
171 the presence of outliers in your data. When the number of data points is
172 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700173
174 .. doctest::
175
176 >>> median([1, 3, 5])
177 3
178
Georg Brandleb2aeec2013-10-21 08:57:26 +0200179 When the number of data points is even, the median is interpolated by taking
180 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700181
182 .. doctest::
183
184 >>> median([1, 3, 5, 7])
185 4.0
186
Georg Brandleb2aeec2013-10-21 08:57:26 +0200187 This is suited for when your data is discrete, and you don't mind that the
188 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700189
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300190 If your data is ordinal (supports order operations) but not numeric (doesn't
191 support addition), you should use :func:`median_low` or :func:`median_high`
192 instead.
193
Berker Peksag9c1dba22014-09-28 00:00:58 +0300194 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700195
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700196
197.. function:: median_low(data)
198
Georg Brandleb2aeec2013-10-21 08:57:26 +0200199 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800200 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700201
Georg Brandleb2aeec2013-10-21 08:57:26 +0200202 The low median is always a member of the data set. When the number of data
203 points is odd, the middle value is returned. When it is even, the smaller of
204 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700205
206 .. doctest::
207
208 >>> median_low([1, 3, 5])
209 3
210 >>> median_low([1, 3, 5, 7])
211 3
212
Georg Brandleb2aeec2013-10-21 08:57:26 +0200213 Use the low median when your data are discrete and you prefer the median to
214 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700215
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
217.. function:: median_high(data)
218
Georg Brandleb2aeec2013-10-21 08:57:26 +0200219 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800220 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700221
Georg Brandleb2aeec2013-10-21 08:57:26 +0200222 The high median is always a member of the data set. When the number of data
223 points is odd, the middle value is returned. When it is even, the larger of
224 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700225
226 .. doctest::
227
228 >>> median_high([1, 3, 5])
229 3
230 >>> median_high([1, 3, 5, 7])
231 5
232
Georg Brandleb2aeec2013-10-21 08:57:26 +0200233 Use the high median when your data are discrete and you prefer the median to
234 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700235
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700236
Georg Brandleb2aeec2013-10-21 08:57:26 +0200237.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700238
Georg Brandleb2aeec2013-10-21 08:57:26 +0200239 Return the median of grouped continuous data, calculated as the 50th
240 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800241 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700242
243 .. doctest::
244
245 >>> median_grouped([52, 52, 53, 54])
246 52.5
247
Georg Brandleb2aeec2013-10-21 08:57:26 +0200248 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200249 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
250 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
251 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200252 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700253
254 .. doctest::
255
256 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
257 3.7
258
Georg Brandleb2aeec2013-10-21 08:57:26 +0200259 Optional argument *interval* represents the class interval, and defaults
260 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700261
262 .. doctest::
263
264 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
265 3.25
266 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
267 3.5
268
269 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200270 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700271
272 .. impl-detail::
273
Georg Brandleb2aeec2013-10-21 08:57:26 +0200274 Under some circumstances, :func:`median_grouped` may coerce data points to
275 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700276
277 .. seealso::
278
Georg Brandleb2aeec2013-10-21 08:57:26 +0200279 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
280 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700281
Georg Brandleb2aeec2013-10-21 08:57:26 +0200282 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100283 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200284 function in the Gnome Gnumeric spreadsheet, including `this discussion
285 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700286
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700287
288.. function:: mode(data)
289
Georg Brandleb2aeec2013-10-21 08:57:26 +0200290 Return the most common data point from discrete or nominal *data*. The mode
291 (when it exists) is the most typical value, and is a robust measure of
292 central location.
293
294 If *data* is empty, or if there is not exactly one most common value,
295 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700296
297 ``mode`` assumes discrete data, and returns a single value. This is the
298 standard treatment of the mode as commonly taught in schools:
299
300 .. doctest::
301
302 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
303 3
304
305 The mode is unique in that it is the only statistic which also applies
306 to nominal (non-numeric) data:
307
308 .. doctest::
309
310 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
311 'red'
312
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700313
Georg Brandleb2aeec2013-10-21 08:57:26 +0200314.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700315
Georg Brandleb2aeec2013-10-21 08:57:26 +0200316 Return the population standard deviation (the square root of the population
317 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700318
319 .. doctest::
320
321 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
322 0.986893273527251
323
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700324
Georg Brandleb2aeec2013-10-21 08:57:26 +0200325.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700326
Georg Brandleb2aeec2013-10-21 08:57:26 +0200327 Return the population variance of *data*, a non-empty iterable of real-valued
328 numbers. Variance, or second moment about the mean, is a measure of the
329 variability (spread or dispersion) of data. A large variance indicates that
330 the data is spread out; a small variance indicates it is clustered closely
331 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700332
Georg Brandleb2aeec2013-10-21 08:57:26 +0200333 If the optional second argument *mu* is given, it should be the mean of
334 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700335 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700336
Georg Brandleb2aeec2013-10-21 08:57:26 +0200337 Use this function to calculate the variance from the entire population. To
338 estimate the variance from a sample, the :func:`variance` function is usually
339 a better choice.
340
341 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700342
343 Examples:
344
345 .. doctest::
346
347 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
348 >>> pvariance(data)
349 1.25
350
Georg Brandleb2aeec2013-10-21 08:57:26 +0200351 If you have already calculated the mean of your data, you can pass it as the
352 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700353
354 .. doctest::
355
356 >>> mu = mean(data)
357 >>> pvariance(data, mu)
358 1.25
359
Georg Brandleb2aeec2013-10-21 08:57:26 +0200360 This function does not attempt to verify that you have passed the actual mean
361 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
362 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700363
364 Decimals and Fractions are supported:
365
366 .. doctest::
367
368 >>> from decimal import Decimal as D
369 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
370 Decimal('24.815')
371
372 >>> from fractions import Fraction as F
373 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
374 Fraction(13, 72)
375
376 .. note::
377
Georg Brandleb2aeec2013-10-21 08:57:26 +0200378 When called with the entire population, this gives the population variance
379 σ². When called on a sample instead, this is the biased sample variance
380 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700381
Georg Brandleb2aeec2013-10-21 08:57:26 +0200382 If you somehow know the true population mean μ, you may use this function
383 to calculate the variance of a sample, giving the known population mean as
384 the second argument. Provided the data points are representative
385 (e.g. independent and identically distributed), the result will be an
386 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700387
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700388
Georg Brandleb2aeec2013-10-21 08:57:26 +0200389.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700390
Georg Brandleb2aeec2013-10-21 08:57:26 +0200391 Return the sample standard deviation (the square root of the sample
392 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700393
394 .. doctest::
395
396 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
397 1.0810874155219827
398
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700399
Georg Brandleb2aeec2013-10-21 08:57:26 +0200400.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700401
Georg Brandleb2aeec2013-10-21 08:57:26 +0200402 Return the sample variance of *data*, an iterable of at least two real-valued
403 numbers. Variance, or second moment about the mean, is a measure of the
404 variability (spread or dispersion) of data. A large variance indicates that
405 the data is spread out; a small variance indicates it is clustered closely
406 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700407
Georg Brandleb2aeec2013-10-21 08:57:26 +0200408 If the optional second argument *xbar* is given, it should be the mean of
409 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700410 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700411
Georg Brandleb2aeec2013-10-21 08:57:26 +0200412 Use this function when your data is a sample from a population. To calculate
413 the variance from the entire population, see :func:`pvariance`.
414
415 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700416
417 Examples:
418
419 .. doctest::
420
421 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
422 >>> variance(data)
423 1.3720238095238095
424
Georg Brandleb2aeec2013-10-21 08:57:26 +0200425 If you have already calculated the mean of your data, you can pass it as the
426 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700427
428 .. doctest::
429
430 >>> m = mean(data)
431 >>> variance(data, m)
432 1.3720238095238095
433
Georg Brandleb2aeec2013-10-21 08:57:26 +0200434 This function does not attempt to verify that you have passed the actual mean
435 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700436 impossible results.
437
438 Decimal and Fraction values are supported:
439
440 .. doctest::
441
442 >>> from decimal import Decimal as D
443 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
444 Decimal('31.01875')
445
446 >>> from fractions import Fraction as F
447 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
448 Fraction(67, 108)
449
450 .. note::
451
Georg Brandleb2aeec2013-10-21 08:57:26 +0200452 This is the sample variance s² with Bessel's correction, also known as
453 variance with N-1 degrees of freedom. Provided that the data points are
454 representative (e.g. independent and identically distributed), the result
455 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700456
Georg Brandleb2aeec2013-10-21 08:57:26 +0200457 If you somehow know the actual population mean μ you should pass it to the
458 :func:`pvariance` function as the *mu* parameter to get the variance of a
459 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700460
461Exceptions
462----------
463
464A single exception is defined:
465
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400466.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700467
Benjamin Peterson44c30652013-10-20 17:52:09 -0400468 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700469
Raymond Hettinger11c79532019-02-23 14:44:07 -0800470
471:class:`NormalDist` objects
472===========================
473
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800474:class:`NormalDist` is a tool for creating and manipulating normal
475distributions of a `random variable
476<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
477composite class that treats the mean and standard deviation of data
478measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800479
480Normal distributions arise from the `Central Limit Theorem
481<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800482of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800483
484.. class:: NormalDist(mu=0.0, sigma=1.0)
485
486 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800487 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800488 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800489 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800490
491 If *sigma* is negative, raises :exc:`StatisticsError`.
492
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800493 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800494
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800495 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800496 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
497 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800498
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800499 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800500
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800501 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800502 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
503 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800504
505 .. attribute:: variance
506
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800507 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800508 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
509 distribution. Equal to the square of the standard deviation.
510
511 .. classmethod:: NormalDist.from_samples(data)
512
513 Class method that makes a normal distribution instance
514 from sample data. The *data* can be any :term:`iterable`
515 and should consist of values that can be converted to type
516 :class:`float`.
517
518 If *data* does not contain at least two elements, raises
519 :exc:`StatisticsError` because it takes at least one point to estimate
520 a central value and at least two points to estimate dispersion.
521
522 .. method:: NormalDist.samples(n, seed=None)
523
524 Generates *n* random samples for a given mean and standard deviation.
525 Returns a :class:`list` of :class:`float` values.
526
527 If *seed* is given, creates a new instance of the underlying random
528 number generator. This is useful for creating reproducible results,
529 even in a multi-threading context.
530
531 .. method:: NormalDist.pdf(x)
532
533 Using a `probability density function (pdf)
534 <https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800535 compute the relative likelihood that a random variable *X* will be near
Raymond Hettinger11c79532019-02-23 14:44:07 -0800536 the given value *x*. Mathematically, it is the ratio ``P(x <= X <
537 x+dx) / dx``.
538
539 Note the relative likelihood of *x* can be greater than `1.0`. The
540 probability for a specific point on a continuous distribution is `0.0`,
541 so the :func:`pdf` is used instead. It gives the probability of a
542 sample occurring in a narrow range around *x* and then dividing that
543 probability by the width of the range (hence the word "density").
544
545 .. method:: NormalDist.cdf(x)
546
547 Using a `cumulative distribution function (cdf)
548 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800549 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800550 equal to *x*. Mathematically, it is written ``P(X <= x)``.
551
Raymond Hettinger318d5372019-03-06 22:59:40 -0800552 .. method:: NormalDist.overlap(other)
553
554 Compute the `overlapping coefficient (OVL)
555 <http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
556 between two normal distributions.
557
558 Measures the agreement between two normal probability distributions.
559 Returns a value between 0.0 and 1.0 giving the overlapping area for
560 two probability density functions.
561
562 In this `example from John M. Linacre
563 <https://www.rasch.org/rmt/rmt101r.htm>`_ about 80% of each
564 distribution overlaps the other:
565
566 .. doctest::
567
568 >>> N1 = NormalDist(2.4, 1.6)
569 >>> N2 = NormalDist(3.2, 2.0)
570 >>> ovl = N1.overlap(N2)
571 >>> f'{ovl * 100.0 :.1f}%'
572 '80.4%'
573
Raymond Hettinger11c79532019-02-23 14:44:07 -0800574 Instances of :class:`NormalDist` support addition, subtraction,
575 multiplication and division by a constant. These operations
576 are used for translation and scaling. For example:
577
578 .. doctest::
579
580 >>> temperature_february = NormalDist(5, 2.5) # Celsius
581 >>> temperature_february * (9/5) + 32 # Fahrenheit
582 NormalDist(mu=41.0, sigma=4.5)
583
584 Dividing a constant by an instance of :class:`NormalDist` is not supported.
585
586 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800587 variables, it is possible to `add and subtract two independent normally
588 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800589 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
590 represented as instances of :class:`NormalDist`. For example:
591
592 .. doctest::
593
594 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
595 >>> drug_effects = NormalDist(0.4, 0.15)
596 >>> combined = birth_weights + drug_effects
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800597 >>> f'mean: {combined.mean :.1f} standard deviation: {combined.stdev :.1f}'
598 'mean: 3.1 standard deviation: 0.5'
Raymond Hettinger11c79532019-02-23 14:44:07 -0800599
600 .. versionadded:: 3.8
601
602
603:class:`NormalDist` Examples and Recipes
604----------------------------------------
605
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800606:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800607
608For example, given `historical data for SAT exams
609<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800610are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger11c79532019-02-23 14:44:07 -0800611determine the percentage of students with scores between 1100 and 1200:
612
613.. doctest::
614
615 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800616 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800617 >>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800618 '18.4% score between 1100 and 1200'
Raymond Hettinger11c79532019-02-23 14:44:07 -0800619
Raymond Hettinger318d5372019-03-06 22:59:40 -0800620What percentage of men and women will have the same height in `two normally
621distributed populations with known means and standard deviations
622<http://www.usablestats.com/lessons/normal>`_?
623
624 >>> men = NormalDist(70, 4)
625 >>> women = NormalDist(65, 3.5)
626 >>> ovl = men.overlap(women)
627 >>> round(ovl * 100.0, 1)
628 50.3
629
Raymond Hettinger11c79532019-02-23 14:44:07 -0800630To estimate the distribution for a model than isn't easy to solve
631analytically, :class:`NormalDist` can generate input samples for a `Monte
632Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
633model:
634
635.. doctest::
636
637 >>> n = 100_000
638 >>> X = NormalDist(350, 15).samples(n)
639 >>> Y = NormalDist(47, 17).samples(n)
640 >>> Z = NormalDist(62, 6).samples(n)
641 >>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
642 >>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
643 NormalDist(mu=267.6516398754636, sigma=101.357284306067)
644
645Normal distributions commonly arise in machine learning problems.
646
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800647Wikipedia has a `nice example of a Naive Bayesian Classifier
648<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge is to
649predict a person's gender from measurements of normally distributed features
650including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800651
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800652We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800653measurements are assumed to be normally distributed, so we summarize the data
654with :class:`NormalDist`:
655
656.. doctest::
657
658 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
659 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
660 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
661 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
662 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
663 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
664
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800665Next, we encounter a new person whose feature measurements are known but whose
666gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800667
668.. doctest::
669
670 >>> ht = 6.0 # height
671 >>> wt = 130 # weight
672 >>> fs = 8 # foot size
673
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800674Starting with a 50% `prior probability
675<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
676we compute the posterior as the prior times the product of likelihoods for the
677feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800678
679.. doctest::
680
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800681 >>> prior_male = 0.5
682 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800683 >>> posterior_male = (prior_male * height_male.pdf(ht) *
684 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
685
686 >>> posterior_female = (prior_female * height_female.pdf(ht) *
687 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
688
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800689The final prediction goes to the largest posterior. This is known as the
690`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800691<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
692
693.. doctest::
694
695 >>> 'male' if posterior_male > posterior_female else 'female'
696 'female'
697
698
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700699..
700 # This modelines must appear within the last ten lines of the file.
701 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;