blob: 1d52d98b29971b9a7e58d607205d370cc3d651c7 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
Raymond Hettingerfc06a192019-03-12 00:43:27 -070040======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070041:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Steven D'Aprano22873182016-08-24 02:34:25 +100043:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070044:func:`median` Median (middle value) of data.
45:func:`median_low` Low median of data.
46:func:`median_high` High median of data.
47:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070048:func:`mode` Single mode (most common value) of discrete or nominal data.
49:func:`multimode` List of modes (most common values) of discrete or nomimal data.
50======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070051
Georg Brandleb2aeec2013-10-21 08:57:26 +020052Measures of spread
53------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070054
Georg Brandleb2aeec2013-10-21 08:57:26 +020055These functions calculate a measure of how much the population or sample
56tends to deviate from the typical or average values.
57
58======================= =============================================
59:func:`pstdev` Population standard deviation of data.
60:func:`pvariance` Population variance of data.
61:func:`stdev` Sample standard deviation of data.
62:func:`variance` Sample variance of data.
63======================= =============================================
64
65
66Function details
67----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070068
Georg Brandle051b552013-11-04 07:30:50 +010069Note: The functions do not require the data given to them to be sorted.
70However, for reading convenience, most of the examples show sorted sequences.
71
Larry Hastingsf5e987b2013-10-19 11:50:09 -070072.. function:: mean(data)
73
Raymond Hettinger6da90782016-11-21 16:31:02 -080074 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070075
Georg Brandleb2aeec2013-10-21 08:57:26 +020076 The arithmetic mean is the sum of the data divided by the number of data
77 points. It is commonly called "the average", although it is only one of many
78 different mathematical averages. It is a measure of the central location of
79 the data.
80
81 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070082
83 Some examples of use:
84
85 .. doctest::
86
87 >>> mean([1, 2, 3, 4, 4])
88 2.8
89 >>> mean([-1.0, 2.5, 3.25, 5.75])
90 2.625
91
92 >>> from fractions import Fraction as F
93 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
94 Fraction(13, 21)
95
96 >>> from decimal import Decimal as D
97 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
98 Decimal('0.5625')
99
100 .. note::
101
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200102 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200103 for central location: the mean is not necessarily a typical example of the
104 data points. For more robust, although less efficient, measures of
105 central location, see :func:`median` and :func:`mode`. (In this case,
106 "efficient" refers to statistical efficiency rather than computational
107 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700108
Georg Brandleb2aeec2013-10-21 08:57:26 +0200109 The sample mean gives an unbiased estimate of the true population mean,
110 which means that, taken on average over all the possible samples,
111 ``mean(sample)`` converges on the true mean of the entire population. If
112 *data* represents the entire population rather than a sample, then
113 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700114
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700115
Raymond Hettinger47d99872019-02-21 15:06:29 -0800116.. function:: fmean(data)
117
118 Convert *data* to floats and compute the arithmetic mean.
119
120 This runs faster than the :func:`mean` function and it always returns a
121 :class:`float`. The result is highly accurate but not as perfect as
122 :func:`mean`. If the input dataset is empty, raises a
123 :exc:`StatisticsError`.
124
125 .. doctest::
126
127 >>> fmean([3.5, 4.0, 5.25])
128 4.25
129
130 .. versionadded:: 3.8
131
132
Steven D'Aprano22873182016-08-24 02:34:25 +1000133.. function:: harmonic_mean(data)
134
135 Return the harmonic mean of *data*, a sequence or iterator of
136 real-valued numbers.
137
138 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500139 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000140 data. For example, the harmonic mean of three values *a*, *b* and *c*
141 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
142
143 The harmonic mean is a type of average, a measure of the central
144 location of the data. It is often appropriate when averaging quantities
145 which are rates or ratios, for example speeds. For example:
146
147 Suppose an investor purchases an equal value of shares in each of
148 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
149 What is the average P/E ratio for the investor's portfolio?
150
151 .. doctest::
152
153 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
154 3.6
155
156 Using the arithmetic mean would give an average of about 5.167, which
157 is too high.
158
Zachary Warec019bd32016-08-23 13:23:31 -0500159 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000160 is less than zero.
161
Zachary Warec019bd32016-08-23 13:23:31 -0500162 .. versionadded:: 3.6
163
Steven D'Aprano22873182016-08-24 02:34:25 +1000164
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700165.. function:: median(data)
166
Georg Brandleb2aeec2013-10-21 08:57:26 +0200167 Return the median (middle value) of numeric data, using the common "mean of
168 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800169 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700170
Georg Brandleb2aeec2013-10-21 08:57:26 +0200171 The median is a robust measure of central location, and is less affected by
172 the presence of outliers in your data. When the number of data points is
173 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700174
175 .. doctest::
176
177 >>> median([1, 3, 5])
178 3
179
Georg Brandleb2aeec2013-10-21 08:57:26 +0200180 When the number of data points is even, the median is interpolated by taking
181 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700182
183 .. doctest::
184
185 >>> median([1, 3, 5, 7])
186 4.0
187
Georg Brandleb2aeec2013-10-21 08:57:26 +0200188 This is suited for when your data is discrete, and you don't mind that the
189 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700190
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300191 If your data is ordinal (supports order operations) but not numeric (doesn't
192 support addition), you should use :func:`median_low` or :func:`median_high`
193 instead.
194
Berker Peksag9c1dba22014-09-28 00:00:58 +0300195 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700196
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700197
198.. function:: median_low(data)
199
Georg Brandleb2aeec2013-10-21 08:57:26 +0200200 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800201 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700202
Georg Brandleb2aeec2013-10-21 08:57:26 +0200203 The low median is always a member of the data set. When the number of data
204 points is odd, the middle value is returned. When it is even, the smaller of
205 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700206
207 .. doctest::
208
209 >>> median_low([1, 3, 5])
210 3
211 >>> median_low([1, 3, 5, 7])
212 3
213
Georg Brandleb2aeec2013-10-21 08:57:26 +0200214 Use the low median when your data are discrete and you prefer the median to
215 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700217
218.. function:: median_high(data)
219
Georg Brandleb2aeec2013-10-21 08:57:26 +0200220 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800221 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700222
Georg Brandleb2aeec2013-10-21 08:57:26 +0200223 The high median is always a member of the data set. When the number of data
224 points is odd, the middle value is returned. When it is even, the larger of
225 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700226
227 .. doctest::
228
229 >>> median_high([1, 3, 5])
230 3
231 >>> median_high([1, 3, 5, 7])
232 5
233
Georg Brandleb2aeec2013-10-21 08:57:26 +0200234 Use the high median when your data are discrete and you prefer the median to
235 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700236
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700237
Georg Brandleb2aeec2013-10-21 08:57:26 +0200238.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700239
Georg Brandleb2aeec2013-10-21 08:57:26 +0200240 Return the median of grouped continuous data, calculated as the 50th
241 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800242 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700243
244 .. doctest::
245
246 >>> median_grouped([52, 52, 53, 54])
247 52.5
248
Georg Brandleb2aeec2013-10-21 08:57:26 +0200249 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200250 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
251 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
252 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200253 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700254
255 .. doctest::
256
257 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
258 3.7
259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260 Optional argument *interval* represents the class interval, and defaults
261 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700262
263 .. doctest::
264
265 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
266 3.25
267 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
268 3.5
269
270 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200271 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700272
273 .. impl-detail::
274
Georg Brandleb2aeec2013-10-21 08:57:26 +0200275 Under some circumstances, :func:`median_grouped` may coerce data points to
276 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700277
278 .. seealso::
279
Georg Brandleb2aeec2013-10-21 08:57:26 +0200280 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
281 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
Georg Brandleb2aeec2013-10-21 08:57:26 +0200283 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100284 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200285 function in the Gnome Gnumeric spreadsheet, including `this discussion
286 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700287
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700288
289.. function:: mode(data)
290
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700291 Return the single most common data point from discrete or nominal *data*.
292 The mode (when it exists) is the most typical value and serves as a
293 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200294
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700295 If there are multiple modes, returns the first one encountered in the *data*.
296 If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700297
298 ``mode`` assumes discrete data, and returns a single value. This is the
299 standard treatment of the mode as commonly taught in schools:
300
301 .. doctest::
302
303 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
304 3
305
306 The mode is unique in that it is the only statistic which also applies
307 to nominal (non-numeric) data:
308
309 .. doctest::
310
311 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
312 'red'
313
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700314 .. versionchanged:: 3.8
315 Now handles multimodal datasets by returning the first mode encountered.
316 Formerly, it raised :exc:`StatisticsError` when more than one mode was
317 found.
318
319
320.. function:: multimode(data)
321
322 Return a list of the most frequently occurring values in the order they
323 were first encountered in the *data*. Will return more than one result if
324 there are multiple modes or an empty list if the *data* is empty:
325
326 .. doctest::
327
328 >>> multimode('aabbbbccddddeeffffgg')
329 ['b', 'd', 'f']
330 >>> multimode('')
331 []
332
333 .. versionadded:: 3.8
334
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700335
Georg Brandleb2aeec2013-10-21 08:57:26 +0200336.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700337
Georg Brandleb2aeec2013-10-21 08:57:26 +0200338 Return the population standard deviation (the square root of the population
339 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700340
341 .. doctest::
342
343 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
344 0.986893273527251
345
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700346
Georg Brandleb2aeec2013-10-21 08:57:26 +0200347.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700348
Georg Brandleb2aeec2013-10-21 08:57:26 +0200349 Return the population variance of *data*, a non-empty iterable of real-valued
350 numbers. Variance, or second moment about the mean, is a measure of the
351 variability (spread or dispersion) of data. A large variance indicates that
352 the data is spread out; a small variance indicates it is clustered closely
353 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700354
Georg Brandleb2aeec2013-10-21 08:57:26 +0200355 If the optional second argument *mu* is given, it should be the mean of
356 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700357 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700358
Georg Brandleb2aeec2013-10-21 08:57:26 +0200359 Use this function to calculate the variance from the entire population. To
360 estimate the variance from a sample, the :func:`variance` function is usually
361 a better choice.
362
363 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700364
365 Examples:
366
367 .. doctest::
368
369 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
370 >>> pvariance(data)
371 1.25
372
Georg Brandleb2aeec2013-10-21 08:57:26 +0200373 If you have already calculated the mean of your data, you can pass it as the
374 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700375
376 .. doctest::
377
378 >>> mu = mean(data)
379 >>> pvariance(data, mu)
380 1.25
381
Georg Brandleb2aeec2013-10-21 08:57:26 +0200382 This function does not attempt to verify that you have passed the actual mean
383 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
384 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700385
386 Decimals and Fractions are supported:
387
388 .. doctest::
389
390 >>> from decimal import Decimal as D
391 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
392 Decimal('24.815')
393
394 >>> from fractions import Fraction as F
395 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
396 Fraction(13, 72)
397
398 .. note::
399
Georg Brandleb2aeec2013-10-21 08:57:26 +0200400 When called with the entire population, this gives the population variance
401 σ². When called on a sample instead, this is the biased sample variance
402 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700403
Georg Brandleb2aeec2013-10-21 08:57:26 +0200404 If you somehow know the true population mean μ, you may use this function
405 to calculate the variance of a sample, giving the known population mean as
406 the second argument. Provided the data points are representative
407 (e.g. independent and identically distributed), the result will be an
408 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700409
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700410
Georg Brandleb2aeec2013-10-21 08:57:26 +0200411.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700412
Georg Brandleb2aeec2013-10-21 08:57:26 +0200413 Return the sample standard deviation (the square root of the sample
414 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700415
416 .. doctest::
417
418 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
419 1.0810874155219827
420
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700421
Georg Brandleb2aeec2013-10-21 08:57:26 +0200422.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700423
Georg Brandleb2aeec2013-10-21 08:57:26 +0200424 Return the sample variance of *data*, an iterable of at least two real-valued
425 numbers. Variance, or second moment about the mean, is a measure of the
426 variability (spread or dispersion) of data. A large variance indicates that
427 the data is spread out; a small variance indicates it is clustered closely
428 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700429
Georg Brandleb2aeec2013-10-21 08:57:26 +0200430 If the optional second argument *xbar* is given, it should be the mean of
431 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700432 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700433
Georg Brandleb2aeec2013-10-21 08:57:26 +0200434 Use this function when your data is a sample from a population. To calculate
435 the variance from the entire population, see :func:`pvariance`.
436
437 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700438
439 Examples:
440
441 .. doctest::
442
443 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
444 >>> variance(data)
445 1.3720238095238095
446
Georg Brandleb2aeec2013-10-21 08:57:26 +0200447 If you have already calculated the mean of your data, you can pass it as the
448 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700449
450 .. doctest::
451
452 >>> m = mean(data)
453 >>> variance(data, m)
454 1.3720238095238095
455
Georg Brandleb2aeec2013-10-21 08:57:26 +0200456 This function does not attempt to verify that you have passed the actual mean
457 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700458 impossible results.
459
460 Decimal and Fraction values are supported:
461
462 .. doctest::
463
464 >>> from decimal import Decimal as D
465 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
466 Decimal('31.01875')
467
468 >>> from fractions import Fraction as F
469 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
470 Fraction(67, 108)
471
472 .. note::
473
Georg Brandleb2aeec2013-10-21 08:57:26 +0200474 This is the sample variance s² with Bessel's correction, also known as
475 variance with N-1 degrees of freedom. Provided that the data points are
476 representative (e.g. independent and identically distributed), the result
477 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700478
Georg Brandleb2aeec2013-10-21 08:57:26 +0200479 If you somehow know the actual population mean μ you should pass it to the
480 :func:`pvariance` function as the *mu* parameter to get the variance of a
481 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700482
483Exceptions
484----------
485
486A single exception is defined:
487
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400488.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700489
Benjamin Peterson44c30652013-10-20 17:52:09 -0400490 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700491
Raymond Hettinger11c79532019-02-23 14:44:07 -0800492
493:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700494---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800495
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800496:class:`NormalDist` is a tool for creating and manipulating normal
497distributions of a `random variable
498<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
499composite class that treats the mean and standard deviation of data
500measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800501
502Normal distributions arise from the `Central Limit Theorem
503<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800504of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800505
506.. class:: NormalDist(mu=0.0, sigma=1.0)
507
508 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800509 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800510 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800511 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800512
513 If *sigma* is negative, raises :exc:`StatisticsError`.
514
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800515 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800516
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800517 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800518 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
519 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800520
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800521 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800522
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800523 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800524 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
525 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800526
527 .. attribute:: variance
528
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800529 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800530 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
531 distribution. Equal to the square of the standard deviation.
532
533 .. classmethod:: NormalDist.from_samples(data)
534
Raymond Hettingercc353a02019-03-10 23:43:33 -0700535 Makes a normal distribution instance computed from sample data. The
536 *data* can be any :term:`iterable` and should consist of values that
537 can be converted to type :class:`float`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800538
539 If *data* does not contain at least two elements, raises
540 :exc:`StatisticsError` because it takes at least one point to estimate
541 a central value and at least two points to estimate dispersion.
542
543 .. method:: NormalDist.samples(n, seed=None)
544
545 Generates *n* random samples for a given mean and standard deviation.
546 Returns a :class:`list` of :class:`float` values.
547
548 If *seed* is given, creates a new instance of the underlying random
549 number generator. This is useful for creating reproducible results,
550 even in a multi-threading context.
551
552 .. method:: NormalDist.pdf(x)
553
554 Using a `probability density function (pdf)
555 <https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800556 compute the relative likelihood that a random variable *X* will be near
Raymond Hettinger11c79532019-02-23 14:44:07 -0800557 the given value *x*. Mathematically, it is the ratio ``P(x <= X <
558 x+dx) / dx``.
559
Raymond Hettingercc353a02019-03-10 23:43:33 -0700560 The relative likelihood is computed as the probability of a sample
561 occurring in a narrow range divided by the width of the range (hence
562 the word "density"). Since the likelihood is relative to other points,
563 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800564
565 .. method:: NormalDist.cdf(x)
566
567 Using a `cumulative distribution function (cdf)
568 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800569 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800570 equal to *x*. Mathematically, it is written ``P(X <= x)``.
571
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700572 .. method:: NormalDist.inv_cdf(p)
573
574 Compute the inverse cumulative distribution function, also known as the
575 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
576 or the `percent-point
577 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
578 function. Mathematically, it is written ``x : P(X <= x) = p``.
579
580 Finds the value *x* of the random variable *X* such that the
581 probability of the variable being less than or equal to that value
582 equals the given probability *p*.
583
Raymond Hettinger318d5372019-03-06 22:59:40 -0800584 .. method:: NormalDist.overlap(other)
585
586 Compute the `overlapping coefficient (OVL)
587 <http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger14bab7a2019-03-07 08:54:31 -0800588 between two normal distributions, giving a measure of agreement.
589 Returns a value between 0.0 and 1.0 giving `the overlapping area for
590 two probability density functions
591 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800592
Raymond Hettinger11c79532019-02-23 14:44:07 -0800593 Instances of :class:`NormalDist` support addition, subtraction,
594 multiplication and division by a constant. These operations
595 are used for translation and scaling. For example:
596
597 .. doctest::
598
599 >>> temperature_february = NormalDist(5, 2.5) # Celsius
600 >>> temperature_february * (9/5) + 32 # Fahrenheit
601 NormalDist(mu=41.0, sigma=4.5)
602
Raymond Hettingercc353a02019-03-10 23:43:33 -0700603 Dividing a constant by an instance of :class:`NormalDist` is not supported
604 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800605
606 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800607 variables, it is possible to `add and subtract two independent normally
608 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800609 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
610 represented as instances of :class:`NormalDist`. For example:
611
612 .. doctest::
613
614 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
615 >>> drug_effects = NormalDist(0.4, 0.15)
616 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700617 >>> round(combined.mean, 1)
618 3.1
619 >>> round(combined.stdev, 1)
620 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800621
622 .. versionadded:: 3.8
623
624
625:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700626^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800627
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800628:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800629
630For example, given `historical data for SAT exams
631<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800632are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettingercc353a02019-03-10 23:43:33 -0700633determine the percentage of students with scores between 1100 and 1200, after
634rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800635
636.. doctest::
637
638 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800639 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700640 >>> round(fraction * 100.0, 1)
641 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800642
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700643Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
644<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
645
646.. doctest::
647
648 >>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
649 [928, 1060, 1192]
650 >>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
651 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
652
Raymond Hettinger318d5372019-03-06 22:59:40 -0800653What percentage of men and women will have the same height in `two normally
654distributed populations with known means and standard deviations
655<http://www.usablestats.com/lessons/normal>`_?
656
657 >>> men = NormalDist(70, 4)
658 >>> women = NormalDist(65, 3.5)
659 >>> ovl = men.overlap(women)
660 >>> round(ovl * 100.0, 1)
661 50.3
662
Raymond Hettinger11c79532019-02-23 14:44:07 -0800663To estimate the distribution for a model than isn't easy to solve
664analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700665Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800666
667.. doctest::
668
Raymond Hettingercc353a02019-03-10 23:43:33 -0700669 >>> def model(x, y, z):
670 ... return (3*x + 7*x*y - 5*y) / (11 * z)
671 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800672 >>> n = 100_000
Raymond Hettingercc353a02019-03-10 23:43:33 -0700673 >>> X = NormalDist(10, 2.5).samples(n)
674 >>> Y = NormalDist(15, 1.75).samples(n)
675 >>> Z = NormalDist(5, 1.25).samples(n)
676 >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
677 NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800678
679Normal distributions commonly arise in machine learning problems.
680
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800681Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800682<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
683The challenge is to predict a person's gender from measurements of normally
684distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800685
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800686We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800687measurements are assumed to be normally distributed, so we summarize the data
688with :class:`NormalDist`:
689
690.. doctest::
691
692 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
693 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
694 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
695 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
696 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
697 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
698
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800699Next, we encounter a new person whose feature measurements are known but whose
700gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800701
702.. doctest::
703
704 >>> ht = 6.0 # height
705 >>> wt = 130 # weight
706 >>> fs = 8 # foot size
707
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800708Starting with a 50% `prior probability
709<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
710we compute the posterior as the prior times the product of likelihoods for the
711feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800712
713.. doctest::
714
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800715 >>> prior_male = 0.5
716 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800717 >>> posterior_male = (prior_male * height_male.pdf(ht) *
718 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
719
720 >>> posterior_female = (prior_female * height_female.pdf(ht) *
721 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
722
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800723The final prediction goes to the largest posterior. This is known as the
724`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800725<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
726
727.. doctest::
728
729 >>> 'male' if posterior_male > posterior_female else 'female'
730 'female'
731
732
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700733..
734 # This modelines must appear within the last ten lines of the file.
735 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;