blob: 2aa778c4d06deb94d545ecdb6a4c3513256ad604 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
40======================= =============================================
41:func:`mean` Arithmetic mean ("average") of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100042:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070043:func:`median` Median (middle value) of data.
44:func:`median_low` Low median of data.
45:func:`median_high` High median of data.
46:func:`median_grouped` Median, or 50th percentile, of grouped data.
47:func:`mode` Mode (most common value) of discrete data.
48======================= =============================================
49
Georg Brandleb2aeec2013-10-21 08:57:26 +020050Measures of spread
51------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070052
Georg Brandleb2aeec2013-10-21 08:57:26 +020053These functions calculate a measure of how much the population or sample
54tends to deviate from the typical or average values.
55
56======================= =============================================
57:func:`pstdev` Population standard deviation of data.
58:func:`pvariance` Population variance of data.
59:func:`stdev` Sample standard deviation of data.
60:func:`variance` Sample variance of data.
61======================= =============================================
62
63
64Function details
65----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070066
Georg Brandle051b552013-11-04 07:30:50 +010067Note: The functions do not require the data given to them to be sorted.
68However, for reading convenience, most of the examples show sorted sequences.
69
Larry Hastingsf5e987b2013-10-19 11:50:09 -070070.. function:: mean(data)
71
Raymond Hettinger6da90782016-11-21 16:31:02 -080072 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070073
Georg Brandleb2aeec2013-10-21 08:57:26 +020074 The arithmetic mean is the sum of the data divided by the number of data
75 points. It is commonly called "the average", although it is only one of many
76 different mathematical averages. It is a measure of the central location of
77 the data.
78
79 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070080
81 Some examples of use:
82
83 .. doctest::
84
85 >>> mean([1, 2, 3, 4, 4])
86 2.8
87 >>> mean([-1.0, 2.5, 3.25, 5.75])
88 2.625
89
90 >>> from fractions import Fraction as F
91 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
92 Fraction(13, 21)
93
94 >>> from decimal import Decimal as D
95 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
96 Decimal('0.5625')
97
98 .. note::
99
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200100 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200101 for central location: the mean is not necessarily a typical example of the
102 data points. For more robust, although less efficient, measures of
103 central location, see :func:`median` and :func:`mode`. (In this case,
104 "efficient" refers to statistical efficiency rather than computational
105 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700106
Georg Brandleb2aeec2013-10-21 08:57:26 +0200107 The sample mean gives an unbiased estimate of the true population mean,
108 which means that, taken on average over all the possible samples,
109 ``mean(sample)`` converges on the true mean of the entire population. If
110 *data* represents the entire population rather than a sample, then
111 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700113
Steven D'Aprano22873182016-08-24 02:34:25 +1000114.. function:: harmonic_mean(data)
115
116 Return the harmonic mean of *data*, a sequence or iterator of
117 real-valued numbers.
118
119 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500120 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000121 data. For example, the harmonic mean of three values *a*, *b* and *c*
122 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
123
124 The harmonic mean is a type of average, a measure of the central
125 location of the data. It is often appropriate when averaging quantities
126 which are rates or ratios, for example speeds. For example:
127
128 Suppose an investor purchases an equal value of shares in each of
129 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
130 What is the average P/E ratio for the investor's portfolio?
131
132 .. doctest::
133
134 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
135 3.6
136
137 Using the arithmetic mean would give an average of about 5.167, which
138 is too high.
139
Zachary Warec019bd32016-08-23 13:23:31 -0500140 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000141 is less than zero.
142
Zachary Warec019bd32016-08-23 13:23:31 -0500143 .. versionadded:: 3.6
144
Steven D'Aprano22873182016-08-24 02:34:25 +1000145
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700146.. function:: median(data)
147
Georg Brandleb2aeec2013-10-21 08:57:26 +0200148 Return the median (middle value) of numeric data, using the common "mean of
149 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800150 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700151
Georg Brandleb2aeec2013-10-21 08:57:26 +0200152 The median is a robust measure of central location, and is less affected by
153 the presence of outliers in your data. When the number of data points is
154 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700155
156 .. doctest::
157
158 >>> median([1, 3, 5])
159 3
160
Georg Brandleb2aeec2013-10-21 08:57:26 +0200161 When the number of data points is even, the median is interpolated by taking
162 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700163
164 .. doctest::
165
166 >>> median([1, 3, 5, 7])
167 4.0
168
Georg Brandleb2aeec2013-10-21 08:57:26 +0200169 This is suited for when your data is discrete, and you don't mind that the
170 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700171
Berker Peksag9c1dba22014-09-28 00:00:58 +0300172 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700173
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700174
175.. function:: median_low(data)
176
Georg Brandleb2aeec2013-10-21 08:57:26 +0200177 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800178 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700179
Georg Brandleb2aeec2013-10-21 08:57:26 +0200180 The low median is always a member of the data set. When the number of data
181 points is odd, the middle value is returned. When it is even, the smaller of
182 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700183
184 .. doctest::
185
186 >>> median_low([1, 3, 5])
187 3
188 >>> median_low([1, 3, 5, 7])
189 3
190
Georg Brandleb2aeec2013-10-21 08:57:26 +0200191 Use the low median when your data are discrete and you prefer the median to
192 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700193
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700194
195.. function:: median_high(data)
196
Georg Brandleb2aeec2013-10-21 08:57:26 +0200197 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800198 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700199
Georg Brandleb2aeec2013-10-21 08:57:26 +0200200 The high median is always a member of the data set. When the number of data
201 points is odd, the middle value is returned. When it is even, the larger of
202 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700203
204 .. doctest::
205
206 >>> median_high([1, 3, 5])
207 3
208 >>> median_high([1, 3, 5, 7])
209 5
210
Georg Brandleb2aeec2013-10-21 08:57:26 +0200211 Use the high median when your data are discrete and you prefer the median to
212 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700213
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700214
Georg Brandleb2aeec2013-10-21 08:57:26 +0200215.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
Georg Brandleb2aeec2013-10-21 08:57:26 +0200217 Return the median of grouped continuous data, calculated as the 50th
218 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800219 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700220
221 .. doctest::
222
223 >>> median_grouped([52, 52, 53, 54])
224 52.5
225
Georg Brandleb2aeec2013-10-21 08:57:26 +0200226 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200227 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
228 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
229 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200230 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700231
232 .. doctest::
233
234 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
235 3.7
236
Georg Brandleb2aeec2013-10-21 08:57:26 +0200237 Optional argument *interval* represents the class interval, and defaults
238 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700239
240 .. doctest::
241
242 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
243 3.25
244 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
245 3.5
246
247 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200248 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700249
250 .. impl-detail::
251
Georg Brandleb2aeec2013-10-21 08:57:26 +0200252 Under some circumstances, :func:`median_grouped` may coerce data points to
253 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700254
255 .. seealso::
256
Georg Brandleb2aeec2013-10-21 08:57:26 +0200257 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
258 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Serhiy Storchaka6dff0202016-05-07 10:49:07 +0300260 * Calculating the `median <https://www.ualberta.ca/~opscan/median.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700261
Georg Brandleb2aeec2013-10-21 08:57:26 +0200262 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100263 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200264 function in the Gnome Gnumeric spreadsheet, including `this discussion
265 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700266
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700267
268.. function:: mode(data)
269
Georg Brandleb2aeec2013-10-21 08:57:26 +0200270 Return the most common data point from discrete or nominal *data*. The mode
271 (when it exists) is the most typical value, and is a robust measure of
272 central location.
273
274 If *data* is empty, or if there is not exactly one most common value,
275 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700276
277 ``mode`` assumes discrete data, and returns a single value. This is the
278 standard treatment of the mode as commonly taught in schools:
279
280 .. doctest::
281
282 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
283 3
284
285 The mode is unique in that it is the only statistic which also applies
286 to nominal (non-numeric) data:
287
288 .. doctest::
289
290 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
291 'red'
292
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700293
Georg Brandleb2aeec2013-10-21 08:57:26 +0200294.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700295
Georg Brandleb2aeec2013-10-21 08:57:26 +0200296 Return the population standard deviation (the square root of the population
297 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700298
299 .. doctest::
300
301 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
302 0.986893273527251
303
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700304
Georg Brandleb2aeec2013-10-21 08:57:26 +0200305.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700306
Georg Brandleb2aeec2013-10-21 08:57:26 +0200307 Return the population variance of *data*, a non-empty iterable of real-valued
308 numbers. Variance, or second moment about the mean, is a measure of the
309 variability (spread or dispersion) of data. A large variance indicates that
310 the data is spread out; a small variance indicates it is clustered closely
311 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700312
Georg Brandleb2aeec2013-10-21 08:57:26 +0200313 If the optional second argument *mu* is given, it should be the mean of
314 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700315 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700316
Georg Brandleb2aeec2013-10-21 08:57:26 +0200317 Use this function to calculate the variance from the entire population. To
318 estimate the variance from a sample, the :func:`variance` function is usually
319 a better choice.
320
321 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700322
323 Examples:
324
325 .. doctest::
326
327 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
328 >>> pvariance(data)
329 1.25
330
Georg Brandleb2aeec2013-10-21 08:57:26 +0200331 If you have already calculated the mean of your data, you can pass it as the
332 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700333
334 .. doctest::
335
336 >>> mu = mean(data)
337 >>> pvariance(data, mu)
338 1.25
339
Georg Brandleb2aeec2013-10-21 08:57:26 +0200340 This function does not attempt to verify that you have passed the actual mean
341 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
342 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700343
344 Decimals and Fractions are supported:
345
346 .. doctest::
347
348 >>> from decimal import Decimal as D
349 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
350 Decimal('24.815')
351
352 >>> from fractions import Fraction as F
353 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
354 Fraction(13, 72)
355
356 .. note::
357
Georg Brandleb2aeec2013-10-21 08:57:26 +0200358 When called with the entire population, this gives the population variance
359 σ². When called on a sample instead, this is the biased sample variance
360 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700361
Georg Brandleb2aeec2013-10-21 08:57:26 +0200362 If you somehow know the true population mean μ, you may use this function
363 to calculate the variance of a sample, giving the known population mean as
364 the second argument. Provided the data points are representative
365 (e.g. independent and identically distributed), the result will be an
366 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700367
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700368
Georg Brandleb2aeec2013-10-21 08:57:26 +0200369.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700370
Georg Brandleb2aeec2013-10-21 08:57:26 +0200371 Return the sample standard deviation (the square root of the sample
372 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700373
374 .. doctest::
375
376 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
377 1.0810874155219827
378
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700379
Georg Brandleb2aeec2013-10-21 08:57:26 +0200380.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700381
Georg Brandleb2aeec2013-10-21 08:57:26 +0200382 Return the sample variance of *data*, an iterable of at least two real-valued
383 numbers. Variance, or second moment about the mean, is a measure of the
384 variability (spread or dispersion) of data. A large variance indicates that
385 the data is spread out; a small variance indicates it is clustered closely
386 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700387
Georg Brandleb2aeec2013-10-21 08:57:26 +0200388 If the optional second argument *xbar* is given, it should be the mean of
389 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700390 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700391
Georg Brandleb2aeec2013-10-21 08:57:26 +0200392 Use this function when your data is a sample from a population. To calculate
393 the variance from the entire population, see :func:`pvariance`.
394
395 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700396
397 Examples:
398
399 .. doctest::
400
401 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
402 >>> variance(data)
403 1.3720238095238095
404
Georg Brandleb2aeec2013-10-21 08:57:26 +0200405 If you have already calculated the mean of your data, you can pass it as the
406 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700407
408 .. doctest::
409
410 >>> m = mean(data)
411 >>> variance(data, m)
412 1.3720238095238095
413
Georg Brandleb2aeec2013-10-21 08:57:26 +0200414 This function does not attempt to verify that you have passed the actual mean
415 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700416 impossible results.
417
418 Decimal and Fraction values are supported:
419
420 .. doctest::
421
422 >>> from decimal import Decimal as D
423 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
424 Decimal('31.01875')
425
426 >>> from fractions import Fraction as F
427 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
428 Fraction(67, 108)
429
430 .. note::
431
Georg Brandleb2aeec2013-10-21 08:57:26 +0200432 This is the sample variance s² with Bessel's correction, also known as
433 variance with N-1 degrees of freedom. Provided that the data points are
434 representative (e.g. independent and identically distributed), the result
435 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700436
Georg Brandleb2aeec2013-10-21 08:57:26 +0200437 If you somehow know the actual population mean μ you should pass it to the
438 :func:`pvariance` function as the *mu* parameter to get the variance of a
439 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700440
441Exceptions
442----------
443
444A single exception is defined:
445
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400446.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700447
Benjamin Peterson44c30652013-10-20 17:52:09 -0400448 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700449
450..
451 # This modelines must appear within the last ten lines of the file.
452 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;