blob: 8bb2bdf7b697deabd8e6850f3e11f983fc4e8e40 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
Raymond Hettingerfc06a192019-03-12 00:43:27 -070040======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070041:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070043:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100044:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`median` Median (middle value) of data.
46:func:`median_low` Low median of data.
47:func:`median_high` High median of data.
48:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070049:func:`mode` Single mode (most common value) of discrete or nominal data.
50:func:`multimode` List of modes (most common values) of discrete or nomimal data.
51======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070052
Georg Brandleb2aeec2013-10-21 08:57:26 +020053Measures of spread
54------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070055
Georg Brandleb2aeec2013-10-21 08:57:26 +020056These functions calculate a measure of how much the population or sample
57tends to deviate from the typical or average values.
58
59======================= =============================================
60:func:`pstdev` Population standard deviation of data.
61:func:`pvariance` Population variance of data.
62:func:`stdev` Sample standard deviation of data.
63:func:`variance` Sample variance of data.
64======================= =============================================
65
66
67Function details
68----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070069
Georg Brandle051b552013-11-04 07:30:50 +010070Note: The functions do not require the data given to them to be sorted.
71However, for reading convenience, most of the examples show sorted sequences.
72
Larry Hastingsf5e987b2013-10-19 11:50:09 -070073.. function:: mean(data)
74
Raymond Hettinger6da90782016-11-21 16:31:02 -080075 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070076
Georg Brandleb2aeec2013-10-21 08:57:26 +020077 The arithmetic mean is the sum of the data divided by the number of data
78 points. It is commonly called "the average", although it is only one of many
79 different mathematical averages. It is a measure of the central location of
80 the data.
81
82 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070083
84 Some examples of use:
85
86 .. doctest::
87
88 >>> mean([1, 2, 3, 4, 4])
89 2.8
90 >>> mean([-1.0, 2.5, 3.25, 5.75])
91 2.625
92
93 >>> from fractions import Fraction as F
94 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
95 Fraction(13, 21)
96
97 >>> from decimal import Decimal as D
98 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
99 Decimal('0.5625')
100
101 .. note::
102
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200103 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200104 for central location: the mean is not necessarily a typical example of the
105 data points. For more robust, although less efficient, measures of
106 central location, see :func:`median` and :func:`mode`. (In this case,
107 "efficient" refers to statistical efficiency rather than computational
108 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700109
Georg Brandleb2aeec2013-10-21 08:57:26 +0200110 The sample mean gives an unbiased estimate of the true population mean,
111 which means that, taken on average over all the possible samples,
112 ``mean(sample)`` converges on the true mean of the entire population. If
113 *data* represents the entire population rather than a sample, then
114 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700115
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700116
Raymond Hettinger47d99872019-02-21 15:06:29 -0800117.. function:: fmean(data)
118
119 Convert *data* to floats and compute the arithmetic mean.
120
121 This runs faster than the :func:`mean` function and it always returns a
122 :class:`float`. The result is highly accurate but not as perfect as
123 :func:`mean`. If the input dataset is empty, raises a
124 :exc:`StatisticsError`.
125
126 .. doctest::
127
128 >>> fmean([3.5, 4.0, 5.25])
129 4.25
130
131 .. versionadded:: 3.8
132
133
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700134.. function:: geometric_mean(data)
135
136 Convert *data* to floats and compute the geometric mean.
137
138 Raises a :exc:`StatisticsError` if the input dataset is empty,
139 if it contains a zero, or if it contains a negative value.
140
141 No special efforts are made to achieve exact results.
142 (However, this may change in the future.)
143
144 .. doctest::
145
146 >>> round(geometric_mean([54, 24, 36]), 9)
147 36.0
148
149 .. versionadded:: 3.8
150
151
Steven D'Aprano22873182016-08-24 02:34:25 +1000152.. function:: harmonic_mean(data)
153
154 Return the harmonic mean of *data*, a sequence or iterator of
155 real-valued numbers.
156
157 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500158 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000159 data. For example, the harmonic mean of three values *a*, *b* and *c*
160 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
161
162 The harmonic mean is a type of average, a measure of the central
163 location of the data. It is often appropriate when averaging quantities
164 which are rates or ratios, for example speeds. For example:
165
166 Suppose an investor purchases an equal value of shares in each of
167 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
168 What is the average P/E ratio for the investor's portfolio?
169
170 .. doctest::
171
172 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
173 3.6
174
175 Using the arithmetic mean would give an average of about 5.167, which
176 is too high.
177
Zachary Warec019bd32016-08-23 13:23:31 -0500178 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000179 is less than zero.
180
Zachary Warec019bd32016-08-23 13:23:31 -0500181 .. versionadded:: 3.6
182
Steven D'Aprano22873182016-08-24 02:34:25 +1000183
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700184.. function:: median(data)
185
Georg Brandleb2aeec2013-10-21 08:57:26 +0200186 Return the median (middle value) of numeric data, using the common "mean of
187 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800188 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700189
Georg Brandleb2aeec2013-10-21 08:57:26 +0200190 The median is a robust measure of central location, and is less affected by
191 the presence of outliers in your data. When the number of data points is
192 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700193
194 .. doctest::
195
196 >>> median([1, 3, 5])
197 3
198
Georg Brandleb2aeec2013-10-21 08:57:26 +0200199 When the number of data points is even, the median is interpolated by taking
200 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700201
202 .. doctest::
203
204 >>> median([1, 3, 5, 7])
205 4.0
206
Georg Brandleb2aeec2013-10-21 08:57:26 +0200207 This is suited for when your data is discrete, and you don't mind that the
208 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700209
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300210 If your data is ordinal (supports order operations) but not numeric (doesn't
211 support addition), you should use :func:`median_low` or :func:`median_high`
212 instead.
213
Berker Peksag9c1dba22014-09-28 00:00:58 +0300214 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700215
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
217.. function:: median_low(data)
218
Georg Brandleb2aeec2013-10-21 08:57:26 +0200219 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800220 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700221
Georg Brandleb2aeec2013-10-21 08:57:26 +0200222 The low median is always a member of the data set. When the number of data
223 points is odd, the middle value is returned. When it is even, the smaller of
224 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700225
226 .. doctest::
227
228 >>> median_low([1, 3, 5])
229 3
230 >>> median_low([1, 3, 5, 7])
231 3
232
Georg Brandleb2aeec2013-10-21 08:57:26 +0200233 Use the low median when your data are discrete and you prefer the median to
234 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700235
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700236
237.. function:: median_high(data)
238
Georg Brandleb2aeec2013-10-21 08:57:26 +0200239 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800240 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700241
Georg Brandleb2aeec2013-10-21 08:57:26 +0200242 The high median is always a member of the data set. When the number of data
243 points is odd, the middle value is returned. When it is even, the larger of
244 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700245
246 .. doctest::
247
248 >>> median_high([1, 3, 5])
249 3
250 >>> median_high([1, 3, 5, 7])
251 5
252
Georg Brandleb2aeec2013-10-21 08:57:26 +0200253 Use the high median when your data are discrete and you prefer the median to
254 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700255
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700256
Georg Brandleb2aeec2013-10-21 08:57:26 +0200257.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700258
Georg Brandleb2aeec2013-10-21 08:57:26 +0200259 Return the median of grouped continuous data, calculated as the 50th
260 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800261 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700262
263 .. doctest::
264
265 >>> median_grouped([52, 52, 53, 54])
266 52.5
267
Georg Brandleb2aeec2013-10-21 08:57:26 +0200268 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200269 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
270 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
271 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200272 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700273
274 .. doctest::
275
276 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
277 3.7
278
Georg Brandleb2aeec2013-10-21 08:57:26 +0200279 Optional argument *interval* represents the class interval, and defaults
280 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700281
282 .. doctest::
283
284 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
285 3.25
286 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
287 3.5
288
289 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200290 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700291
292 .. impl-detail::
293
Georg Brandleb2aeec2013-10-21 08:57:26 +0200294 Under some circumstances, :func:`median_grouped` may coerce data points to
295 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700296
297 .. seealso::
298
Georg Brandleb2aeec2013-10-21 08:57:26 +0200299 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
300 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700301
Georg Brandleb2aeec2013-10-21 08:57:26 +0200302 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100303 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200304 function in the Gnome Gnumeric spreadsheet, including `this discussion
305 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700306
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700307
308.. function:: mode(data)
309
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700310 Return the single most common data point from discrete or nominal *data*.
311 The mode (when it exists) is the most typical value and serves as a
312 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200313
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700314 If there are multiple modes, returns the first one encountered in the *data*.
315 If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700316
317 ``mode`` assumes discrete data, and returns a single value. This is the
318 standard treatment of the mode as commonly taught in schools:
319
320 .. doctest::
321
322 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
323 3
324
325 The mode is unique in that it is the only statistic which also applies
326 to nominal (non-numeric) data:
327
328 .. doctest::
329
330 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
331 'red'
332
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700333 .. versionchanged:: 3.8
334 Now handles multimodal datasets by returning the first mode encountered.
335 Formerly, it raised :exc:`StatisticsError` when more than one mode was
336 found.
337
338
339.. function:: multimode(data)
340
341 Return a list of the most frequently occurring values in the order they
342 were first encountered in the *data*. Will return more than one result if
343 there are multiple modes or an empty list if the *data* is empty:
344
345 .. doctest::
346
347 >>> multimode('aabbbbccddddeeffffgg')
348 ['b', 'd', 'f']
349 >>> multimode('')
350 []
351
352 .. versionadded:: 3.8
353
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700354
Georg Brandleb2aeec2013-10-21 08:57:26 +0200355.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700356
Georg Brandleb2aeec2013-10-21 08:57:26 +0200357 Return the population standard deviation (the square root of the population
358 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700359
360 .. doctest::
361
362 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
363 0.986893273527251
364
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700365
Georg Brandleb2aeec2013-10-21 08:57:26 +0200366.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700367
Georg Brandleb2aeec2013-10-21 08:57:26 +0200368 Return the population variance of *data*, a non-empty iterable of real-valued
369 numbers. Variance, or second moment about the mean, is a measure of the
370 variability (spread or dispersion) of data. A large variance indicates that
371 the data is spread out; a small variance indicates it is clustered closely
372 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700373
Georg Brandleb2aeec2013-10-21 08:57:26 +0200374 If the optional second argument *mu* is given, it should be the mean of
375 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700376 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700377
Georg Brandleb2aeec2013-10-21 08:57:26 +0200378 Use this function to calculate the variance from the entire population. To
379 estimate the variance from a sample, the :func:`variance` function is usually
380 a better choice.
381
382 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700383
384 Examples:
385
386 .. doctest::
387
388 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
389 >>> pvariance(data)
390 1.25
391
Georg Brandleb2aeec2013-10-21 08:57:26 +0200392 If you have already calculated the mean of your data, you can pass it as the
393 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700394
395 .. doctest::
396
397 >>> mu = mean(data)
398 >>> pvariance(data, mu)
399 1.25
400
Georg Brandleb2aeec2013-10-21 08:57:26 +0200401 This function does not attempt to verify that you have passed the actual mean
402 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
403 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700404
405 Decimals and Fractions are supported:
406
407 .. doctest::
408
409 >>> from decimal import Decimal as D
410 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
411 Decimal('24.815')
412
413 >>> from fractions import Fraction as F
414 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
415 Fraction(13, 72)
416
417 .. note::
418
Georg Brandleb2aeec2013-10-21 08:57:26 +0200419 When called with the entire population, this gives the population variance
420 σ². When called on a sample instead, this is the biased sample variance
421 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700422
Georg Brandleb2aeec2013-10-21 08:57:26 +0200423 If you somehow know the true population mean μ, you may use this function
424 to calculate the variance of a sample, giving the known population mean as
425 the second argument. Provided the data points are representative
426 (e.g. independent and identically distributed), the result will be an
427 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700428
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700429
Georg Brandleb2aeec2013-10-21 08:57:26 +0200430.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700431
Georg Brandleb2aeec2013-10-21 08:57:26 +0200432 Return the sample standard deviation (the square root of the sample
433 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700434
435 .. doctest::
436
437 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
438 1.0810874155219827
439
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700440
Georg Brandleb2aeec2013-10-21 08:57:26 +0200441.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700442
Georg Brandleb2aeec2013-10-21 08:57:26 +0200443 Return the sample variance of *data*, an iterable of at least two real-valued
444 numbers. Variance, or second moment about the mean, is a measure of the
445 variability (spread or dispersion) of data. A large variance indicates that
446 the data is spread out; a small variance indicates it is clustered closely
447 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700448
Georg Brandleb2aeec2013-10-21 08:57:26 +0200449 If the optional second argument *xbar* is given, it should be the mean of
450 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700451 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700452
Georg Brandleb2aeec2013-10-21 08:57:26 +0200453 Use this function when your data is a sample from a population. To calculate
454 the variance from the entire population, see :func:`pvariance`.
455
456 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700457
458 Examples:
459
460 .. doctest::
461
462 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
463 >>> variance(data)
464 1.3720238095238095
465
Georg Brandleb2aeec2013-10-21 08:57:26 +0200466 If you have already calculated the mean of your data, you can pass it as the
467 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700468
469 .. doctest::
470
471 >>> m = mean(data)
472 >>> variance(data, m)
473 1.3720238095238095
474
Georg Brandleb2aeec2013-10-21 08:57:26 +0200475 This function does not attempt to verify that you have passed the actual mean
476 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700477 impossible results.
478
479 Decimal and Fraction values are supported:
480
481 .. doctest::
482
483 >>> from decimal import Decimal as D
484 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
485 Decimal('31.01875')
486
487 >>> from fractions import Fraction as F
488 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
489 Fraction(67, 108)
490
491 .. note::
492
Georg Brandleb2aeec2013-10-21 08:57:26 +0200493 This is the sample variance s² with Bessel's correction, also known as
494 variance with N-1 degrees of freedom. Provided that the data points are
495 representative (e.g. independent and identically distributed), the result
496 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700497
Georg Brandleb2aeec2013-10-21 08:57:26 +0200498 If you somehow know the actual population mean μ you should pass it to the
499 :func:`pvariance` function as the *mu* parameter to get the variance of a
500 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700501
502Exceptions
503----------
504
505A single exception is defined:
506
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400507.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700508
Benjamin Peterson44c30652013-10-20 17:52:09 -0400509 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700510
Raymond Hettinger11c79532019-02-23 14:44:07 -0800511
512:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700513---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800514
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800515:class:`NormalDist` is a tool for creating and manipulating normal
516distributions of a `random variable
517<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
518composite class that treats the mean and standard deviation of data
519measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800520
521Normal distributions arise from the `Central Limit Theorem
522<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800523of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800524
525.. class:: NormalDist(mu=0.0, sigma=1.0)
526
527 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800528 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800529 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800530 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800531
532 If *sigma* is negative, raises :exc:`StatisticsError`.
533
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800534 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800535
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800536 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800537 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
538 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800539
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800540 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800541
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800542 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800543 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
544 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800545
546 .. attribute:: variance
547
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800548 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800549 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
550 distribution. Equal to the square of the standard deviation.
551
552 .. classmethod:: NormalDist.from_samples(data)
553
Raymond Hettingercc353a02019-03-10 23:43:33 -0700554 Makes a normal distribution instance computed from sample data. The
555 *data* can be any :term:`iterable` and should consist of values that
556 can be converted to type :class:`float`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800557
558 If *data* does not contain at least two elements, raises
559 :exc:`StatisticsError` because it takes at least one point to estimate
560 a central value and at least two points to estimate dispersion.
561
562 .. method:: NormalDist.samples(n, seed=None)
563
564 Generates *n* random samples for a given mean and standard deviation.
565 Returns a :class:`list` of :class:`float` values.
566
567 If *seed* is given, creates a new instance of the underlying random
568 number generator. This is useful for creating reproducible results,
569 even in a multi-threading context.
570
571 .. method:: NormalDist.pdf(x)
572
573 Using a `probability density function (pdf)
574 <https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800575 compute the relative likelihood that a random variable *X* will be near
Raymond Hettinger11c79532019-02-23 14:44:07 -0800576 the given value *x*. Mathematically, it is the ratio ``P(x <= X <
577 x+dx) / dx``.
578
Raymond Hettingercc353a02019-03-10 23:43:33 -0700579 The relative likelihood is computed as the probability of a sample
580 occurring in a narrow range divided by the width of the range (hence
581 the word "density"). Since the likelihood is relative to other points,
582 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800583
584 .. method:: NormalDist.cdf(x)
585
586 Using a `cumulative distribution function (cdf)
587 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800588 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800589 equal to *x*. Mathematically, it is written ``P(X <= x)``.
590
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700591 .. method:: NormalDist.inv_cdf(p)
592
593 Compute the inverse cumulative distribution function, also known as the
594 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
595 or the `percent-point
596 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
597 function. Mathematically, it is written ``x : P(X <= x) = p``.
598
599 Finds the value *x* of the random variable *X* such that the
600 probability of the variable being less than or equal to that value
601 equals the given probability *p*.
602
Raymond Hettinger318d5372019-03-06 22:59:40 -0800603 .. method:: NormalDist.overlap(other)
604
605 Compute the `overlapping coefficient (OVL)
606 <http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger14bab7a2019-03-07 08:54:31 -0800607 between two normal distributions, giving a measure of agreement.
608 Returns a value between 0.0 and 1.0 giving `the overlapping area for
609 two probability density functions
610 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800611
Raymond Hettinger11c79532019-02-23 14:44:07 -0800612 Instances of :class:`NormalDist` support addition, subtraction,
613 multiplication and division by a constant. These operations
614 are used for translation and scaling. For example:
615
616 .. doctest::
617
618 >>> temperature_february = NormalDist(5, 2.5) # Celsius
619 >>> temperature_february * (9/5) + 32 # Fahrenheit
620 NormalDist(mu=41.0, sigma=4.5)
621
Raymond Hettingercc353a02019-03-10 23:43:33 -0700622 Dividing a constant by an instance of :class:`NormalDist` is not supported
623 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800624
625 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800626 variables, it is possible to `add and subtract two independent normally
627 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800628 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
629 represented as instances of :class:`NormalDist`. For example:
630
631 .. doctest::
632
633 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
634 >>> drug_effects = NormalDist(0.4, 0.15)
635 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700636 >>> round(combined.mean, 1)
637 3.1
638 >>> round(combined.stdev, 1)
639 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800640
641 .. versionadded:: 3.8
642
643
644:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700645^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800646
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800647:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800648
649For example, given `historical data for SAT exams
650<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800651are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettingercc353a02019-03-10 23:43:33 -0700652determine the percentage of students with scores between 1100 and 1200, after
653rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800654
655.. doctest::
656
657 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800658 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700659 >>> round(fraction * 100.0, 1)
660 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800661
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700662Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
663<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
664
665.. doctest::
666
667 >>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
668 [928, 1060, 1192]
669 >>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
670 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
671
Raymond Hettinger318d5372019-03-06 22:59:40 -0800672What percentage of men and women will have the same height in `two normally
673distributed populations with known means and standard deviations
674<http://www.usablestats.com/lessons/normal>`_?
675
676 >>> men = NormalDist(70, 4)
677 >>> women = NormalDist(65, 3.5)
678 >>> ovl = men.overlap(women)
679 >>> round(ovl * 100.0, 1)
680 50.3
681
Raymond Hettinger11c79532019-02-23 14:44:07 -0800682To estimate the distribution for a model than isn't easy to solve
683analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700684Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800685
686.. doctest::
687
Raymond Hettingercc353a02019-03-10 23:43:33 -0700688 >>> def model(x, y, z):
689 ... return (3*x + 7*x*y - 5*y) / (11 * z)
690 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800691 >>> n = 100_000
Raymond Hettingercc353a02019-03-10 23:43:33 -0700692 >>> X = NormalDist(10, 2.5).samples(n)
693 >>> Y = NormalDist(15, 1.75).samples(n)
694 >>> Z = NormalDist(5, 1.25).samples(n)
695 >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
696 NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800697
698Normal distributions commonly arise in machine learning problems.
699
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800700Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800701<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
702The challenge is to predict a person's gender from measurements of normally
703distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800704
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800705We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800706measurements are assumed to be normally distributed, so we summarize the data
707with :class:`NormalDist`:
708
709.. doctest::
710
711 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
712 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
713 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
714 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
715 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
716 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
717
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800718Next, we encounter a new person whose feature measurements are known but whose
719gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800720
721.. doctest::
722
723 >>> ht = 6.0 # height
724 >>> wt = 130 # weight
725 >>> fs = 8 # foot size
726
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800727Starting with a 50% `prior probability
728<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
729we compute the posterior as the prior times the product of likelihoods for the
730feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800731
732.. doctest::
733
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800734 >>> prior_male = 0.5
735 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800736 >>> posterior_male = (prior_male * height_male.pdf(ht) *
737 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
738
739 >>> posterior_female = (prior_female * height_female.pdf(ht) *
740 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
741
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800742The final prediction goes to the largest posterior. This is known as the
743`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800744<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
745
746.. doctest::
747
748 >>> 'male' if posterior_male > posterior_female else 'female'
749 'female'
750
751
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700752..
753 # This modelines must appear within the last ten lines of the file.
754 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;