blob: a702b2463c39b1678801723dbe2698d3d3105361 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070022numeric (:class:`~numbers.Real`-valued) data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070023
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070024The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100029
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070030Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported. Collections with a mix of types are also undefined
34and implementation-dependent. If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100037
Larry Hastingsf5e987b2013-10-19 11:50:09 -070038Averages and measures of central location
39-----------------------------------------
40
41These functions calculate an average or typical value from a population
42or sample.
43
Raymond Hettingerfc06a192019-03-12 00:43:27 -070044======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080046:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070047:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100048:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070049:func:`median` Median (middle value) of data.
50:func:`median_low` Low median of data.
51:func:`median_high` High median of data.
52:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070053:func:`mode` Single mode (most common value) of discrete or nominal data.
54:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070055:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070056======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070057
Georg Brandleb2aeec2013-10-21 08:57:26 +020058Measures of spread
59------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070060
Georg Brandleb2aeec2013-10-21 08:57:26 +020061These functions calculate a measure of how much the population or sample
62tends to deviate from the typical or average values.
63
64======================= =============================================
65:func:`pstdev` Population standard deviation of data.
66:func:`pvariance` Population variance of data.
67:func:`stdev` Sample standard deviation of data.
68:func:`variance` Sample variance of data.
69======================= =============================================
70
71
72Function details
73----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074
Georg Brandle051b552013-11-04 07:30:50 +010075Note: The functions do not require the data given to them to be sorted.
76However, for reading convenience, most of the examples show sorted sequences.
77
Larry Hastingsf5e987b2013-10-19 11:50:09 -070078.. function:: mean(data)
79
Raymond Hettinger6da90782016-11-21 16:31:02 -080080 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070081
Georg Brandleb2aeec2013-10-21 08:57:26 +020082 The arithmetic mean is the sum of the data divided by the number of data
83 points. It is commonly called "the average", although it is only one of many
84 different mathematical averages. It is a measure of the central location of
85 the data.
86
87 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070088
89 Some examples of use:
90
91 .. doctest::
92
93 >>> mean([1, 2, 3, 4, 4])
94 2.8
95 >>> mean([-1.0, 2.5, 3.25, 5.75])
96 2.625
97
98 >>> from fractions import Fraction as F
99 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
100 Fraction(13, 21)
101
102 >>> from decimal import Decimal as D
103 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
104 Decimal('0.5625')
105
106 .. note::
107
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200108 The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettingere4810b22019-09-05 00:18:47 -0700109 for central location: the mean is not necessarily a typical example of
110 the data points. For more robust measures of central location, see
111 :func:`median` and :func:`mode`.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
Georg Brandleb2aeec2013-10-21 08:57:26 +0200113 The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700114 so that when taken on average over all the possible samples,
Georg Brandleb2aeec2013-10-21 08:57:26 +0200115 ``mean(sample)`` converges on the true mean of the entire population. If
116 *data* represents the entire population rather than a sample, then
117 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700118
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700119
Raymond Hettinger47d99872019-02-21 15:06:29 -0800120.. function:: fmean(data)
121
122 Convert *data* to floats and compute the arithmetic mean.
123
124 This runs faster than the :func:`mean` function and it always returns a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700125 :class:`float`. The *data* may be a sequence or iterator. If the input
126 dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger47d99872019-02-21 15:06:29 -0800127
128 .. doctest::
129
130 >>> fmean([3.5, 4.0, 5.25])
131 4.25
132
133 .. versionadded:: 3.8
134
135
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700136.. function:: geometric_mean(data)
137
138 Convert *data* to floats and compute the geometric mean.
139
Raymond Hettingere4810b22019-09-05 00:18:47 -0700140 The geometric mean indicates the central tendency or typical value of the
141 *data* using the product of the values (as opposed to the arithmetic mean
142 which uses their sum).
143
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700144 Raises a :exc:`StatisticsError` if the input dataset is empty,
145 if it contains a zero, or if it contains a negative value.
Raymond Hettingere4810b22019-09-05 00:18:47 -0700146 The *data* may be a sequence or iterator.
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700147
148 No special efforts are made to achieve exact results.
149 (However, this may change in the future.)
150
151 .. doctest::
152
Raymond Hettingere4810b22019-09-05 00:18:47 -0700153 >>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700154 36.0
155
156 .. versionadded:: 3.8
157
158
Steven D'Aprano22873182016-08-24 02:34:25 +1000159.. function:: harmonic_mean(data)
160
161 Return the harmonic mean of *data*, a sequence or iterator of
162 real-valued numbers.
163
164 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500165 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000166 data. For example, the harmonic mean of three values *a*, *b* and *c*
167 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
168
169 The harmonic mean is a type of average, a measure of the central
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700170 location of the data. It is often appropriate when averaging
171 rates or ratios, for example speeds.
172
173 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
174 What is the average speed?
175
176 .. doctest::
177
178 >>> harmonic_mean([40, 60])
179 48.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000180
181 Suppose an investor purchases an equal value of shares in each of
182 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
183 What is the average P/E ratio for the investor's portfolio?
184
185 .. doctest::
186
187 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
188 3.6
189
Zachary Warec019bd32016-08-23 13:23:31 -0500190 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000191 is less than zero.
192
Zachary Warec019bd32016-08-23 13:23:31 -0500193 .. versionadded:: 3.6
194
Steven D'Aprano22873182016-08-24 02:34:25 +1000195
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700196.. function:: median(data)
197
Georg Brandleb2aeec2013-10-21 08:57:26 +0200198 Return the median (middle value) of numeric data, using the common "mean of
199 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800200 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700201
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700202 The median is a robust measure of central location and is less affected by
203 the presence of outliers. When the number of data points is odd, the
204 middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700205
206 .. doctest::
207
208 >>> median([1, 3, 5])
209 3
210
Georg Brandleb2aeec2013-10-21 08:57:26 +0200211 When the number of data points is even, the median is interpolated by taking
212 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700213
214 .. doctest::
215
216 >>> median([1, 3, 5, 7])
217 4.0
218
Georg Brandleb2aeec2013-10-21 08:57:26 +0200219 This is suited for when your data is discrete, and you don't mind that the
220 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700221
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700222 If the data is ordinal (supports order operations) but not numeric (doesn't
223 support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300224 instead.
225
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700226.. function:: median_low(data)
227
Georg Brandleb2aeec2013-10-21 08:57:26 +0200228 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800229 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700230
Georg Brandleb2aeec2013-10-21 08:57:26 +0200231 The low median is always a member of the data set. When the number of data
232 points is odd, the middle value is returned. When it is even, the smaller of
233 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700234
235 .. doctest::
236
237 >>> median_low([1, 3, 5])
238 3
239 >>> median_low([1, 3, 5, 7])
240 3
241
Georg Brandleb2aeec2013-10-21 08:57:26 +0200242 Use the low median when your data are discrete and you prefer the median to
243 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700244
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700245
246.. function:: median_high(data)
247
Georg Brandleb2aeec2013-10-21 08:57:26 +0200248 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800249 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700250
Georg Brandleb2aeec2013-10-21 08:57:26 +0200251 The high median is always a member of the data set. When the number of data
252 points is odd, the middle value is returned. When it is even, the larger of
253 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700254
255 .. doctest::
256
257 >>> median_high([1, 3, 5])
258 3
259 >>> median_high([1, 3, 5, 7])
260 5
261
Georg Brandleb2aeec2013-10-21 08:57:26 +0200262 Use the high median when your data are discrete and you prefer the median to
263 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700264
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700265
Georg Brandleb2aeec2013-10-21 08:57:26 +0200266.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700267
Georg Brandleb2aeec2013-10-21 08:57:26 +0200268 Return the median of grouped continuous data, calculated as the 50th
269 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800270 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700271
272 .. doctest::
273
274 >>> median_grouped([52, 52, 53, 54])
275 52.5
276
Georg Brandleb2aeec2013-10-21 08:57:26 +0200277 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200278 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
279 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
280 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200281 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
283 .. doctest::
284
285 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
286 3.7
287
Georg Brandleb2aeec2013-10-21 08:57:26 +0200288 Optional argument *interval* represents the class interval, and defaults
289 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700290
291 .. doctest::
292
293 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
294 3.25
295 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
296 3.5
297
298 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200299 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700300
301 .. impl-detail::
302
Georg Brandleb2aeec2013-10-21 08:57:26 +0200303 Under some circumstances, :func:`median_grouped` may coerce data points to
304 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700305
306 .. seealso::
307
Georg Brandleb2aeec2013-10-21 08:57:26 +0200308 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
309 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700310
Georg Brandleb2aeec2013-10-21 08:57:26 +0200311 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100312 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200313 function in the Gnome Gnumeric spreadsheet, including `this discussion
314 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700315
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700316
317.. function:: mode(data)
318
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700319 Return the single most common data point from discrete or nominal *data*.
320 The mode (when it exists) is the most typical value and serves as a
321 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200322
Raymond Hettingere4810b22019-09-05 00:18:47 -0700323 If there are multiple modes with the same frequency, returns the first one
324 encountered in the *data*. If the smallest or largest of those is
325 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
326 If the input *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700327
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700328 ``mode`` assumes discrete data and returns a single value. This is the
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700329 standard treatment of the mode as commonly taught in schools:
330
331 .. doctest::
332
333 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
334 3
335
Raymond Hettingere4810b22019-09-05 00:18:47 -0700336 The mode is unique in that it is the only statistic in this package that
337 also applies to nominal (non-numeric) data:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700338
339 .. doctest::
340
341 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
342 'red'
343
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700344 .. versionchanged:: 3.8
345 Now handles multimodal datasets by returning the first mode encountered.
346 Formerly, it raised :exc:`StatisticsError` when more than one mode was
347 found.
348
349
350.. function:: multimode(data)
351
352 Return a list of the most frequently occurring values in the order they
353 were first encountered in the *data*. Will return more than one result if
354 there are multiple modes or an empty list if the *data* is empty:
355
356 .. doctest::
357
358 >>> multimode('aabbbbccddddeeffffgg')
359 ['b', 'd', 'f']
360 >>> multimode('')
361 []
362
363 .. versionadded:: 3.8
364
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700365
Georg Brandleb2aeec2013-10-21 08:57:26 +0200366.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700367
Georg Brandleb2aeec2013-10-21 08:57:26 +0200368 Return the population standard deviation (the square root of the population
369 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700370
371 .. doctest::
372
373 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
374 0.986893273527251
375
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700376
Georg Brandleb2aeec2013-10-21 08:57:26 +0200377.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700378
Raymond Hettingere4810b22019-09-05 00:18:47 -0700379 Return the population variance of *data*, a non-empty sequence or iterator
380 of real-valued numbers. Variance, or second moment about the mean, is a
381 measure of the variability (spread or dispersion) of data. A large
382 variance indicates that the data is spread out; a small variance indicates
383 it is clustered closely around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700384
Raymond Hettingere4810b22019-09-05 00:18:47 -0700385 If the optional second argument *mu* is given, it is typically the mean of
386 the *data*. It can also be used to compute the second moment around a
387 point that is not the mean. If it is missing or ``None`` (the default),
388 the arithmetic mean is automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700389
Georg Brandleb2aeec2013-10-21 08:57:26 +0200390 Use this function to calculate the variance from the entire population. To
391 estimate the variance from a sample, the :func:`variance` function is usually
392 a better choice.
393
394 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700395
396 Examples:
397
398 .. doctest::
399
400 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
401 >>> pvariance(data)
402 1.25
403
Georg Brandleb2aeec2013-10-21 08:57:26 +0200404 If you have already calculated the mean of your data, you can pass it as the
405 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700406
407 .. doctest::
408
409 >>> mu = mean(data)
410 >>> pvariance(data, mu)
411 1.25
412
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700413 Decimals and Fractions are supported:
414
415 .. doctest::
416
417 >>> from decimal import Decimal as D
418 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
419 Decimal('24.815')
420
421 >>> from fractions import Fraction as F
422 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
423 Fraction(13, 72)
424
425 .. note::
426
Georg Brandleb2aeec2013-10-21 08:57:26 +0200427 When called with the entire population, this gives the population variance
428 σ². When called on a sample instead, this is the biased sample variance
429 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700430
Raymond Hettingere4810b22019-09-05 00:18:47 -0700431 If you somehow know the true population mean μ, you may use this
432 function to calculate the variance of a sample, giving the known
433 population mean as the second argument. Provided the data points are a
434 random sample of the population, the result will be an unbiased estimate
435 of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700436
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700437
Georg Brandleb2aeec2013-10-21 08:57:26 +0200438.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700439
Georg Brandleb2aeec2013-10-21 08:57:26 +0200440 Return the sample standard deviation (the square root of the sample
441 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700442
443 .. doctest::
444
445 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
446 1.0810874155219827
447
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700448
Georg Brandleb2aeec2013-10-21 08:57:26 +0200449.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700450
Georg Brandleb2aeec2013-10-21 08:57:26 +0200451 Return the sample variance of *data*, an iterable of at least two real-valued
452 numbers. Variance, or second moment about the mean, is a measure of the
453 variability (spread or dispersion) of data. A large variance indicates that
454 the data is spread out; a small variance indicates it is clustered closely
455 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700456
Georg Brandleb2aeec2013-10-21 08:57:26 +0200457 If the optional second argument *xbar* is given, it should be the mean of
458 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700459 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700460
Georg Brandleb2aeec2013-10-21 08:57:26 +0200461 Use this function when your data is a sample from a population. To calculate
462 the variance from the entire population, see :func:`pvariance`.
463
464 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700465
466 Examples:
467
468 .. doctest::
469
470 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
471 >>> variance(data)
472 1.3720238095238095
473
Georg Brandleb2aeec2013-10-21 08:57:26 +0200474 If you have already calculated the mean of your data, you can pass it as the
475 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700476
477 .. doctest::
478
479 >>> m = mean(data)
480 >>> variance(data, m)
481 1.3720238095238095
482
Georg Brandleb2aeec2013-10-21 08:57:26 +0200483 This function does not attempt to verify that you have passed the actual mean
484 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700485 impossible results.
486
487 Decimal and Fraction values are supported:
488
489 .. doctest::
490
491 >>> from decimal import Decimal as D
492 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
493 Decimal('31.01875')
494
495 >>> from fractions import Fraction as F
496 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
497 Fraction(67, 108)
498
499 .. note::
500
Georg Brandleb2aeec2013-10-21 08:57:26 +0200501 This is the sample variance s² with Bessel's correction, also known as
502 variance with N-1 degrees of freedom. Provided that the data points are
503 representative (e.g. independent and identically distributed), the result
504 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700505
Georg Brandleb2aeec2013-10-21 08:57:26 +0200506 If you somehow know the actual population mean μ you should pass it to the
507 :func:`pvariance` function as the *mu* parameter to get the variance of a
508 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700509
Raymond Hettingere4810b22019-09-05 00:18:47 -0700510.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700511
Raymond Hettingere4810b22019-09-05 00:18:47 -0700512 Divide *data* into *n* continuous intervals with equal probability.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700513 Returns a list of ``n - 1`` cut points separating the intervals.
514
515 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
516 *n* to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700517 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700518 is not least 1.
519
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700520 The *data* can be any iterable containing sample data. For meaningful
Raymond Hettingere4810b22019-09-05 00:18:47 -0700521 results, the number of data points in *data* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700522 Raises :exc:`StatisticsError` if there are not at least two data points.
523
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700524 The cut points are linearly interpolated from the
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700525 two nearest data points. For example, if a cut point falls one-third
526 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700527 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700528
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700529 The *method* for computing quantiles can be varied depending on
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700530 whether the *data* includes or excludes the lowest and
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700531 highest possible values from the population.
532
533 The default *method* is "exclusive" and is used for data sampled from
534 a population that can have more extreme values than found in the
535 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700536 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
537 sample values, the method sorts them and assigns the following
538 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700539
540 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700541 data or for samples that are known to include the most extreme values
Raymond Hettingere4810b22019-09-05 00:18:47 -0700542 from the population. The minimum value in *data* is treated as the 0th
Raymond Hettingerb530a442019-07-21 16:32:00 -0700543 percentile and the maximum value is treated as the 100th percentile.
544 The portion of the population falling below the *i-th* of *m* sorted
545 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
546 values, the method sorts them and assigns the following percentiles:
547 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700548
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700549 .. doctest::
550
551 # Decile cut points for empirically sampled data
552 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
553 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
554 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
555 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
556 ... 103, 107, 101, 81, 109, 104]
557 >>> [round(q, 1) for q in quantiles(data, n=10)]
558 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
559
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700560 .. versionadded:: 3.8
561
562
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700563Exceptions
564----------
565
566A single exception is defined:
567
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400568.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700569
Benjamin Peterson44c30652013-10-20 17:52:09 -0400570 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700571
Raymond Hettinger11c79532019-02-23 14:44:07 -0800572
573:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700574---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800575
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800576:class:`NormalDist` is a tool for creating and manipulating normal
577distributions of a `random variable
578<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700579class that treats the mean and standard deviation of data
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800580measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800581
582Normal distributions arise from the `Central Limit Theorem
583<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800584of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800585
586.. class:: NormalDist(mu=0.0, sigma=1.0)
587
588 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800589 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800590 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800591 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800592
593 If *sigma* is negative, raises :exc:`StatisticsError`.
594
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800595 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800596
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800597 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800598 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
599 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800600
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700601 .. attribute:: median
602
603 A read-only property for the `median
604 <https://en.wikipedia.org/wiki/Median>`_ of a normal
605 distribution.
606
607 .. attribute:: mode
608
609 A read-only property for the `mode
610 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
611 distribution.
612
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800613 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800614
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800615 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800616 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
617 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800618
619 .. attribute:: variance
620
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800621 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800622 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
623 distribution. Equal to the square of the standard deviation.
624
625 .. classmethod:: NormalDist.from_samples(data)
626
Raymond Hettingere4810b22019-09-05 00:18:47 -0700627 Makes a normal distribution instance with *mu* and *sigma* parameters
628 estimated from the *data* using :func:`fmean` and :func:`stdev`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800629
Raymond Hettingere4810b22019-09-05 00:18:47 -0700630 The *data* can be any :term:`iterable` and should consist of values
631 that can be converted to type :class:`float`. If *data* does not
632 contain at least two elements, raises :exc:`StatisticsError` because it
633 takes at least one point to estimate a central value and at least two
634 points to estimate dispersion.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800635
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700636 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800637
638 Generates *n* random samples for a given mean and standard deviation.
639 Returns a :class:`list` of :class:`float` values.
640
641 If *seed* is given, creates a new instance of the underlying random
642 number generator. This is useful for creating reproducible results,
643 even in a multi-threading context.
644
645 .. method:: NormalDist.pdf(x)
646
647 Using a `probability density function (pdf)
Raymond Hettingere4810b22019-09-05 00:18:47 -0700648 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
649 the relative likelihood that a random variable *X* will be near the
650 given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
651 X < x+dx) / dx`` as *dx* approaches zero.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800652
Raymond Hettingercc353a02019-03-10 23:43:33 -0700653 The relative likelihood is computed as the probability of a sample
654 occurring in a narrow range divided by the width of the range (hence
655 the word "density"). Since the likelihood is relative to other points,
656 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800657
658 .. method:: NormalDist.cdf(x)
659
660 Using a `cumulative distribution function (cdf)
661 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800662 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800663 equal to *x*. Mathematically, it is written ``P(X <= x)``.
664
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700665 .. method:: NormalDist.inv_cdf(p)
666
667 Compute the inverse cumulative distribution function, also known as the
668 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
669 or the `percent-point
670 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
671 function. Mathematically, it is written ``x : P(X <= x) = p``.
672
673 Finds the value *x* of the random variable *X* such that the
674 probability of the variable being less than or equal to that value
675 equals the given probability *p*.
676
Raymond Hettinger318d5372019-03-06 22:59:40 -0800677 .. method:: NormalDist.overlap(other)
678
Raymond Hettingere4810b22019-09-05 00:18:47 -0700679 Measures the agreement between two normal probability distributions.
680 Returns a value between 0.0 and 1.0 giving `the overlapping area for
681 the two probability density functions
682 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800683
Raymond Hettinger8a6cbf82019-10-13 19:53:30 -0700684 .. method:: NormalDist.quantiles(n=4)
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700685
686 Divide the normal distribution into *n* continuous intervals with
687 equal probability. Returns a list of (n - 1) cut points separating
688 the intervals.
689
690 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles.
691 Set *n* to 100 for percentiles which gives the 99 cuts points that
692 separate the normal distribution into 100 equal sized groups.
693
Raymond Hettinger11c79532019-02-23 14:44:07 -0800694 Instances of :class:`NormalDist` support addition, subtraction,
695 multiplication and division by a constant. These operations
696 are used for translation and scaling. For example:
697
698 .. doctest::
699
700 >>> temperature_february = NormalDist(5, 2.5) # Celsius
701 >>> temperature_february * (9/5) + 32 # Fahrenheit
702 NormalDist(mu=41.0, sigma=4.5)
703
Raymond Hettingercc353a02019-03-10 23:43:33 -0700704 Dividing a constant by an instance of :class:`NormalDist` is not supported
705 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800706
707 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800708 variables, it is possible to `add and subtract two independent normally
709 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800710 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
711 represented as instances of :class:`NormalDist`. For example:
712
713 .. doctest::
714
715 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
716 >>> drug_effects = NormalDist(0.4, 0.15)
717 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700718 >>> round(combined.mean, 1)
719 3.1
720 >>> round(combined.stdev, 1)
721 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800722
723 .. versionadded:: 3.8
724
725
726:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700727^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800728
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800729:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800730
731For example, given `historical data for SAT exams
732<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800733are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700734determine the percentage of students with test scores between 1100 and
7351200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800736
737.. doctest::
738
739 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800740 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700741 >>> round(fraction * 100.0, 1)
742 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800743
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700744Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
745<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
746
747.. doctest::
748
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700749 >>> list(map(round, sat.quantiles()))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700750 [928, 1060, 1192]
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700751 >>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700752 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
753
Raymond Hettinger11c79532019-02-23 14:44:07 -0800754To estimate the distribution for a model than isn't easy to solve
755analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700756Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800757
758.. doctest::
759
Raymond Hettingercc353a02019-03-10 23:43:33 -0700760 >>> def model(x, y, z):
761 ... return (3*x + 7*x*y - 5*y) / (11 * z)
762 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800763 >>> n = 100_000
Raymond Hettingere4810b22019-09-05 00:18:47 -0700764 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
765 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
766 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
767 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
768 [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger11c79532019-02-23 14:44:07 -0800769
770Normal distributions commonly arise in machine learning problems.
771
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800772Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800773<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
774The challenge is to predict a person's gender from measurements of normally
775distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800776
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800777We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800778measurements are assumed to be normally distributed, so we summarize the data
779with :class:`NormalDist`:
780
781.. doctest::
782
783 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
784 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
785 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
786 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
787 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
788 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
789
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800790Next, we encounter a new person whose feature measurements are known but whose
791gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800792
793.. doctest::
794
795 >>> ht = 6.0 # height
796 >>> wt = 130 # weight
797 >>> fs = 8 # foot size
798
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800799Starting with a 50% `prior probability
800<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
801we compute the posterior as the prior times the product of likelihoods for the
802feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800803
804.. doctest::
805
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800806 >>> prior_male = 0.5
807 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800808 >>> posterior_male = (prior_male * height_male.pdf(ht) *
809 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
810
811 >>> posterior_female = (prior_female * height_female.pdf(ht) *
812 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
813
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800814The final prediction goes to the largest posterior. This is known as the
815`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800816<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
817
818.. doctest::
819
820 >>> 'male' if posterior_male > posterior_female else 'female'
821 'female'
822
823
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700824..
825 # This modelines must appear within the last ten lines of the file.
826 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;