blob: 62d856bf248c5ad512b5a09792280f417779b311 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
Raymond Hettingere4810b22019-09-05 00:18:47 -070029 currently unsupported. Collections with a mix of types are also undefined
30 and implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, for
32 example: ``map(float, input_data)``.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100033
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
Raymond Hettingerfc06a192019-03-12 00:43:27 -070040======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070041:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070043:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100044:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`median` Median (middle value) of data.
46:func:`median_low` Low median of data.
47:func:`median_high` High median of data.
48:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070049:func:`mode` Single mode (most common value) of discrete or nominal data.
50:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070051:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070052======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070053
Georg Brandleb2aeec2013-10-21 08:57:26 +020054Measures of spread
55------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070056
Georg Brandleb2aeec2013-10-21 08:57:26 +020057These functions calculate a measure of how much the population or sample
58tends to deviate from the typical or average values.
59
60======================= =============================================
61:func:`pstdev` Population standard deviation of data.
62:func:`pvariance` Population variance of data.
63:func:`stdev` Sample standard deviation of data.
64:func:`variance` Sample variance of data.
65======================= =============================================
66
67
68Function details
69----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070070
Georg Brandle051b552013-11-04 07:30:50 +010071Note: The functions do not require the data given to them to be sorted.
72However, for reading convenience, most of the examples show sorted sequences.
73
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074.. function:: mean(data)
75
Raymond Hettinger6da90782016-11-21 16:31:02 -080076 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070077
Georg Brandleb2aeec2013-10-21 08:57:26 +020078 The arithmetic mean is the sum of the data divided by the number of data
79 points. It is commonly called "the average", although it is only one of many
80 different mathematical averages. It is a measure of the central location of
81 the data.
82
83 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070084
85 Some examples of use:
86
87 .. doctest::
88
89 >>> mean([1, 2, 3, 4, 4])
90 2.8
91 >>> mean([-1.0, 2.5, 3.25, 5.75])
92 2.625
93
94 >>> from fractions import Fraction as F
95 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
96 Fraction(13, 21)
97
98 >>> from decimal import Decimal as D
99 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
100 Decimal('0.5625')
101
102 .. note::
103
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200104 The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettingere4810b22019-09-05 00:18:47 -0700105 for central location: the mean is not necessarily a typical example of
106 the data points. For more robust measures of central location, see
107 :func:`median` and :func:`mode`.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700108
Georg Brandleb2aeec2013-10-21 08:57:26 +0200109 The sample mean gives an unbiased estimate of the true population mean,
110 which means that, taken on average over all the possible samples,
111 ``mean(sample)`` converges on the true mean of the entire population. If
112 *data* represents the entire population rather than a sample, then
113 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700114
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700115
Raymond Hettinger47d99872019-02-21 15:06:29 -0800116.. function:: fmean(data)
117
118 Convert *data* to floats and compute the arithmetic mean.
119
120 This runs faster than the :func:`mean` function and it always returns a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700121 :class:`float`. The *data* may be a sequence or iterator. If the input
122 dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger47d99872019-02-21 15:06:29 -0800123
124 .. doctest::
125
126 >>> fmean([3.5, 4.0, 5.25])
127 4.25
128
129 .. versionadded:: 3.8
130
131
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700132.. function:: geometric_mean(data)
133
134 Convert *data* to floats and compute the geometric mean.
135
Raymond Hettingere4810b22019-09-05 00:18:47 -0700136 The geometric mean indicates the central tendency or typical value of the
137 *data* using the product of the values (as opposed to the arithmetic mean
138 which uses their sum).
139
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700140 Raises a :exc:`StatisticsError` if the input dataset is empty,
141 if it contains a zero, or if it contains a negative value.
Raymond Hettingere4810b22019-09-05 00:18:47 -0700142 The *data* may be a sequence or iterator.
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700143
144 No special efforts are made to achieve exact results.
145 (However, this may change in the future.)
146
147 .. doctest::
148
Raymond Hettingere4810b22019-09-05 00:18:47 -0700149 >>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700150 36.0
151
152 .. versionadded:: 3.8
153
154
Steven D'Aprano22873182016-08-24 02:34:25 +1000155.. function:: harmonic_mean(data)
156
157 Return the harmonic mean of *data*, a sequence or iterator of
158 real-valued numbers.
159
160 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500161 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000162 data. For example, the harmonic mean of three values *a*, *b* and *c*
163 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
164
165 The harmonic mean is a type of average, a measure of the central
166 location of the data. It is often appropriate when averaging quantities
167 which are rates or ratios, for example speeds. For example:
168
169 Suppose an investor purchases an equal value of shares in each of
170 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
171 What is the average P/E ratio for the investor's portfolio?
172
173 .. doctest::
174
175 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
176 3.6
177
178 Using the arithmetic mean would give an average of about 5.167, which
Raymond Hettingere4810b22019-09-05 00:18:47 -0700179 is well over the aggregate P/E ratio.
Steven D'Aprano22873182016-08-24 02:34:25 +1000180
Zachary Warec019bd32016-08-23 13:23:31 -0500181 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000182 is less than zero.
183
Zachary Warec019bd32016-08-23 13:23:31 -0500184 .. versionadded:: 3.6
185
Steven D'Aprano22873182016-08-24 02:34:25 +1000186
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700187.. function:: median(data)
188
Georg Brandleb2aeec2013-10-21 08:57:26 +0200189 Return the median (middle value) of numeric data, using the common "mean of
190 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800191 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700192
Georg Brandleb2aeec2013-10-21 08:57:26 +0200193 The median is a robust measure of central location, and is less affected by
194 the presence of outliers in your data. When the number of data points is
195 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700196
197 .. doctest::
198
199 >>> median([1, 3, 5])
200 3
201
Georg Brandleb2aeec2013-10-21 08:57:26 +0200202 When the number of data points is even, the median is interpolated by taking
203 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700204
205 .. doctest::
206
207 >>> median([1, 3, 5, 7])
208 4.0
209
Georg Brandleb2aeec2013-10-21 08:57:26 +0200210 This is suited for when your data is discrete, and you don't mind that the
211 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700212
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300213 If your data is ordinal (supports order operations) but not numeric (doesn't
214 support addition), you should use :func:`median_low` or :func:`median_high`
215 instead.
216
Berker Peksag9c1dba22014-09-28 00:00:58 +0300217 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700218
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700219
220.. function:: median_low(data)
221
Georg Brandleb2aeec2013-10-21 08:57:26 +0200222 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800223 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700224
Georg Brandleb2aeec2013-10-21 08:57:26 +0200225 The low median is always a member of the data set. When the number of data
226 points is odd, the middle value is returned. When it is even, the smaller of
227 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700228
229 .. doctest::
230
231 >>> median_low([1, 3, 5])
232 3
233 >>> median_low([1, 3, 5, 7])
234 3
235
Georg Brandleb2aeec2013-10-21 08:57:26 +0200236 Use the low median when your data are discrete and you prefer the median to
237 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700238
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700239
240.. function:: median_high(data)
241
Georg Brandleb2aeec2013-10-21 08:57:26 +0200242 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800243 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700244
Georg Brandleb2aeec2013-10-21 08:57:26 +0200245 The high median is always a member of the data set. When the number of data
246 points is odd, the middle value is returned. When it is even, the larger of
247 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700248
249 .. doctest::
250
251 >>> median_high([1, 3, 5])
252 3
253 >>> median_high([1, 3, 5, 7])
254 5
255
Georg Brandleb2aeec2013-10-21 08:57:26 +0200256 Use the high median when your data are discrete and you prefer the median to
257 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700258
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700261
Georg Brandleb2aeec2013-10-21 08:57:26 +0200262 Return the median of grouped continuous data, calculated as the 50th
263 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800264 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700265
266 .. doctest::
267
268 >>> median_grouped([52, 52, 53, 54])
269 52.5
270
Georg Brandleb2aeec2013-10-21 08:57:26 +0200271 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200272 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
273 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
274 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200275 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700276
277 .. doctest::
278
279 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
280 3.7
281
Georg Brandleb2aeec2013-10-21 08:57:26 +0200282 Optional argument *interval* represents the class interval, and defaults
283 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700284
285 .. doctest::
286
287 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
288 3.25
289 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
290 3.5
291
292 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200293 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700294
295 .. impl-detail::
296
Georg Brandleb2aeec2013-10-21 08:57:26 +0200297 Under some circumstances, :func:`median_grouped` may coerce data points to
298 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700299
300 .. seealso::
301
Georg Brandleb2aeec2013-10-21 08:57:26 +0200302 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
303 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700304
Georg Brandleb2aeec2013-10-21 08:57:26 +0200305 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100306 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200307 function in the Gnome Gnumeric spreadsheet, including `this discussion
308 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700309
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700310
311.. function:: mode(data)
312
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700313 Return the single most common data point from discrete or nominal *data*.
314 The mode (when it exists) is the most typical value and serves as a
315 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200316
Raymond Hettingere4810b22019-09-05 00:18:47 -0700317 If there are multiple modes with the same frequency, returns the first one
318 encountered in the *data*. If the smallest or largest of those is
319 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
320 If the input *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700321
322 ``mode`` assumes discrete data, and returns a single value. This is the
323 standard treatment of the mode as commonly taught in schools:
324
325 .. doctest::
326
327 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
328 3
329
Raymond Hettingere4810b22019-09-05 00:18:47 -0700330 The mode is unique in that it is the only statistic in this package that
331 also applies to nominal (non-numeric) data:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700332
333 .. doctest::
334
335 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
336 'red'
337
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700338 .. versionchanged:: 3.8
339 Now handles multimodal datasets by returning the first mode encountered.
340 Formerly, it raised :exc:`StatisticsError` when more than one mode was
341 found.
342
343
344.. function:: multimode(data)
345
346 Return a list of the most frequently occurring values in the order they
347 were first encountered in the *data*. Will return more than one result if
348 there are multiple modes or an empty list if the *data* is empty:
349
350 .. doctest::
351
352 >>> multimode('aabbbbccddddeeffffgg')
353 ['b', 'd', 'f']
354 >>> multimode('')
355 []
356
357 .. versionadded:: 3.8
358
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700359
Georg Brandleb2aeec2013-10-21 08:57:26 +0200360.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700361
Georg Brandleb2aeec2013-10-21 08:57:26 +0200362 Return the population standard deviation (the square root of the population
363 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700364
365 .. doctest::
366
367 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
368 0.986893273527251
369
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700370
Georg Brandleb2aeec2013-10-21 08:57:26 +0200371.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700372
Raymond Hettingere4810b22019-09-05 00:18:47 -0700373 Return the population variance of *data*, a non-empty sequence or iterator
374 of real-valued numbers. Variance, or second moment about the mean, is a
375 measure of the variability (spread or dispersion) of data. A large
376 variance indicates that the data is spread out; a small variance indicates
377 it is clustered closely around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700378
Raymond Hettingere4810b22019-09-05 00:18:47 -0700379 If the optional second argument *mu* is given, it is typically the mean of
380 the *data*. It can also be used to compute the second moment around a
381 point that is not the mean. If it is missing or ``None`` (the default),
382 the arithmetic mean is automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700383
Georg Brandleb2aeec2013-10-21 08:57:26 +0200384 Use this function to calculate the variance from the entire population. To
385 estimate the variance from a sample, the :func:`variance` function is usually
386 a better choice.
387
388 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700389
390 Examples:
391
392 .. doctest::
393
394 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
395 >>> pvariance(data)
396 1.25
397
Georg Brandleb2aeec2013-10-21 08:57:26 +0200398 If you have already calculated the mean of your data, you can pass it as the
399 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700400
401 .. doctest::
402
403 >>> mu = mean(data)
404 >>> pvariance(data, mu)
405 1.25
406
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700407 Decimals and Fractions are supported:
408
409 .. doctest::
410
411 >>> from decimal import Decimal as D
412 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
413 Decimal('24.815')
414
415 >>> from fractions import Fraction as F
416 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
417 Fraction(13, 72)
418
419 .. note::
420
Georg Brandleb2aeec2013-10-21 08:57:26 +0200421 When called with the entire population, this gives the population variance
422 σ². When called on a sample instead, this is the biased sample variance
423 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700424
Raymond Hettingere4810b22019-09-05 00:18:47 -0700425 If you somehow know the true population mean μ, you may use this
426 function to calculate the variance of a sample, giving the known
427 population mean as the second argument. Provided the data points are a
428 random sample of the population, the result will be an unbiased estimate
429 of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700430
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700431
Georg Brandleb2aeec2013-10-21 08:57:26 +0200432.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700433
Georg Brandleb2aeec2013-10-21 08:57:26 +0200434 Return the sample standard deviation (the square root of the sample
435 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700436
437 .. doctest::
438
439 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
440 1.0810874155219827
441
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700442
Georg Brandleb2aeec2013-10-21 08:57:26 +0200443.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700444
Georg Brandleb2aeec2013-10-21 08:57:26 +0200445 Return the sample variance of *data*, an iterable of at least two real-valued
446 numbers. Variance, or second moment about the mean, is a measure of the
447 variability (spread or dispersion) of data. A large variance indicates that
448 the data is spread out; a small variance indicates it is clustered closely
449 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700450
Georg Brandleb2aeec2013-10-21 08:57:26 +0200451 If the optional second argument *xbar* is given, it should be the mean of
452 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700453 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700454
Georg Brandleb2aeec2013-10-21 08:57:26 +0200455 Use this function when your data is a sample from a population. To calculate
456 the variance from the entire population, see :func:`pvariance`.
457
458 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700459
460 Examples:
461
462 .. doctest::
463
464 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
465 >>> variance(data)
466 1.3720238095238095
467
Georg Brandleb2aeec2013-10-21 08:57:26 +0200468 If you have already calculated the mean of your data, you can pass it as the
469 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700470
471 .. doctest::
472
473 >>> m = mean(data)
474 >>> variance(data, m)
475 1.3720238095238095
476
Georg Brandleb2aeec2013-10-21 08:57:26 +0200477 This function does not attempt to verify that you have passed the actual mean
478 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700479 impossible results.
480
481 Decimal and Fraction values are supported:
482
483 .. doctest::
484
485 >>> from decimal import Decimal as D
486 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
487 Decimal('31.01875')
488
489 >>> from fractions import Fraction as F
490 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
491 Fraction(67, 108)
492
493 .. note::
494
Georg Brandleb2aeec2013-10-21 08:57:26 +0200495 This is the sample variance s² with Bessel's correction, also known as
496 variance with N-1 degrees of freedom. Provided that the data points are
497 representative (e.g. independent and identically distributed), the result
498 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700499
Georg Brandleb2aeec2013-10-21 08:57:26 +0200500 If you somehow know the actual population mean μ you should pass it to the
501 :func:`pvariance` function as the *mu* parameter to get the variance of a
502 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700503
Raymond Hettingere4810b22019-09-05 00:18:47 -0700504.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700505
Raymond Hettingere4810b22019-09-05 00:18:47 -0700506 Divide *data* into *n* continuous intervals with equal probability.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700507 Returns a list of ``n - 1`` cut points separating the intervals.
508
509 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
510 *n* to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettingere4810b22019-09-05 00:18:47 -0700511 *data* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700512 is not least 1.
513
Raymond Hettingere4810b22019-09-05 00:18:47 -0700514 The *data* can be any iterable containing sample data or it can be an
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700515 instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
Raymond Hettingere4810b22019-09-05 00:18:47 -0700516 results, the number of data points in *data* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700517 Raises :exc:`StatisticsError` if there are not at least two data points.
518
519 For sample data, the cut points are linearly interpolated from the
520 two nearest data points. For example, if a cut point falls one-third
521 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700522 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700523
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700524 The *method* for computing quantiles can be varied depending on
Raymond Hettingere4810b22019-09-05 00:18:47 -0700525 whether the data in *data* includes or excludes the lowest and
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700526 highest possible values from the population.
527
528 The default *method* is "exclusive" and is used for data sampled from
529 a population that can have more extreme values than found in the
530 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700531 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
532 sample values, the method sorts them and assigns the following
533 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700534
535 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700536 data or for samples that are known to include the most extreme values
Raymond Hettingere4810b22019-09-05 00:18:47 -0700537 from the population. The minimum value in *data* is treated as the 0th
Raymond Hettingerb530a442019-07-21 16:32:00 -0700538 percentile and the maximum value is treated as the 100th percentile.
539 The portion of the population falling below the *i-th* of *m* sorted
540 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
541 values, the method sorts them and assigns the following percentiles:
542 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700543
Raymond Hettingere4810b22019-09-05 00:18:47 -0700544 If *data* is an instance of a class that defines an
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700545 :meth:`~inv_cdf` method, setting *method* has no effect.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700546
547 .. doctest::
548
549 # Decile cut points for empirically sampled data
550 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
551 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
552 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
553 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
554 ... 103, 107, 101, 81, 109, 104]
555 >>> [round(q, 1) for q in quantiles(data, n=10)]
556 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
557
Min ho Kimc4cacc82019-07-31 08:16:13 +1000558 >>> # Quartile cut points for the standard normal distribution
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700559 >>> Z = NormalDist()
560 >>> [round(q, 4) for q in quantiles(Z, n=4)]
561 [-0.6745, 0.0, 0.6745]
562
563 .. versionadded:: 3.8
564
565
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700566Exceptions
567----------
568
569A single exception is defined:
570
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400571.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700572
Benjamin Peterson44c30652013-10-20 17:52:09 -0400573 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700574
Raymond Hettinger11c79532019-02-23 14:44:07 -0800575
576:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700577---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800578
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800579:class:`NormalDist` is a tool for creating and manipulating normal
580distributions of a `random variable
581<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700582class that treats the mean and standard deviation of data
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800583measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800584
585Normal distributions arise from the `Central Limit Theorem
586<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800587of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800588
589.. class:: NormalDist(mu=0.0, sigma=1.0)
590
591 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800592 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800593 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800594 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800595
596 If *sigma* is negative, raises :exc:`StatisticsError`.
597
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800598 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800599
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800600 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800601 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
602 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800603
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800604 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800605
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800606 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800607 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
608 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800609
610 .. attribute:: variance
611
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800612 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800613 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
614 distribution. Equal to the square of the standard deviation.
615
616 .. classmethod:: NormalDist.from_samples(data)
617
Raymond Hettingere4810b22019-09-05 00:18:47 -0700618 Makes a normal distribution instance with *mu* and *sigma* parameters
619 estimated from the *data* using :func:`fmean` and :func:`stdev`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800620
Raymond Hettingere4810b22019-09-05 00:18:47 -0700621 The *data* can be any :term:`iterable` and should consist of values
622 that can be converted to type :class:`float`. If *data* does not
623 contain at least two elements, raises :exc:`StatisticsError` because it
624 takes at least one point to estimate a central value and at least two
625 points to estimate dispersion.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800626
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700627 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800628
629 Generates *n* random samples for a given mean and standard deviation.
630 Returns a :class:`list` of :class:`float` values.
631
632 If *seed* is given, creates a new instance of the underlying random
633 number generator. This is useful for creating reproducible results,
634 even in a multi-threading context.
635
636 .. method:: NormalDist.pdf(x)
637
638 Using a `probability density function (pdf)
Raymond Hettingere4810b22019-09-05 00:18:47 -0700639 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
640 the relative likelihood that a random variable *X* will be near the
641 given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
642 X < x+dx) / dx`` as *dx* approaches zero.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800643
Raymond Hettingercc353a02019-03-10 23:43:33 -0700644 The relative likelihood is computed as the probability of a sample
645 occurring in a narrow range divided by the width of the range (hence
646 the word "density"). Since the likelihood is relative to other points,
647 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800648
649 .. method:: NormalDist.cdf(x)
650
651 Using a `cumulative distribution function (cdf)
652 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800653 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800654 equal to *x*. Mathematically, it is written ``P(X <= x)``.
655
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700656 .. method:: NormalDist.inv_cdf(p)
657
658 Compute the inverse cumulative distribution function, also known as the
659 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
660 or the `percent-point
661 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
662 function. Mathematically, it is written ``x : P(X <= x) = p``.
663
664 Finds the value *x* of the random variable *X* such that the
665 probability of the variable being less than or equal to that value
666 equals the given probability *p*.
667
Raymond Hettinger318d5372019-03-06 22:59:40 -0800668 .. method:: NormalDist.overlap(other)
669
Raymond Hettingere4810b22019-09-05 00:18:47 -0700670 Measures the agreement between two normal probability distributions.
671 Returns a value between 0.0 and 1.0 giving `the overlapping area for
672 the two probability density functions
673 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800674
Raymond Hettinger11c79532019-02-23 14:44:07 -0800675 Instances of :class:`NormalDist` support addition, subtraction,
676 multiplication and division by a constant. These operations
677 are used for translation and scaling. For example:
678
679 .. doctest::
680
681 >>> temperature_february = NormalDist(5, 2.5) # Celsius
682 >>> temperature_february * (9/5) + 32 # Fahrenheit
683 NormalDist(mu=41.0, sigma=4.5)
684
Raymond Hettingercc353a02019-03-10 23:43:33 -0700685 Dividing a constant by an instance of :class:`NormalDist` is not supported
686 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800687
688 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800689 variables, it is possible to `add and subtract two independent normally
690 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800691 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
692 represented as instances of :class:`NormalDist`. For example:
693
694 .. doctest::
695
696 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
697 >>> drug_effects = NormalDist(0.4, 0.15)
698 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700699 >>> round(combined.mean, 1)
700 3.1
701 >>> round(combined.stdev, 1)
702 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800703
704 .. versionadded:: 3.8
705
706
707:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700708^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800709
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800710:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800711
712For example, given `historical data for SAT exams
713<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800714are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700715determine the percentage of students with test scores between 1100 and
7161200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800717
718.. doctest::
719
720 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800721 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700722 >>> round(fraction * 100.0, 1)
723 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800724
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700725Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
726<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
727
728.. doctest::
729
730 >>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
731 [928, 1060, 1192]
732 >>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
733 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
734
Raymond Hettinger11c79532019-02-23 14:44:07 -0800735To estimate the distribution for a model than isn't easy to solve
736analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700737Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800738
739.. doctest::
740
Raymond Hettingercc353a02019-03-10 23:43:33 -0700741 >>> def model(x, y, z):
742 ... return (3*x + 7*x*y - 5*y) / (11 * z)
743 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800744 >>> n = 100_000
Raymond Hettingere4810b22019-09-05 00:18:47 -0700745 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
746 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
747 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
748 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
749 [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger11c79532019-02-23 14:44:07 -0800750
751Normal distributions commonly arise in machine learning problems.
752
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800753Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800754<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
755The challenge is to predict a person's gender from measurements of normally
756distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800757
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800758We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800759measurements are assumed to be normally distributed, so we summarize the data
760with :class:`NormalDist`:
761
762.. doctest::
763
764 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
765 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
766 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
767 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
768 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
769 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
770
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800771Next, we encounter a new person whose feature measurements are known but whose
772gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800773
774.. doctest::
775
776 >>> ht = 6.0 # height
777 >>> wt = 130 # weight
778 >>> fs = 8 # foot size
779
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800780Starting with a 50% `prior probability
781<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
782we compute the posterior as the prior times the product of likelihoods for the
783feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800784
785.. doctest::
786
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800787 >>> prior_male = 0.5
788 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800789 >>> posterior_male = (prior_male * height_male.pdf(ht) *
790 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
791
792 >>> posterior_female = (prior_female * height_female.pdf(ht) *
793 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
794
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800795The final prediction goes to the largest posterior. This is known as the
796`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800797<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
798
799.. doctest::
800
801 >>> 'male' if posterior_male > posterior_female else 'female'
802 'female'
803
804
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700805..
806 # This modelines must appear within the last ten lines of the file.
807 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;