blob: 26bb592b23812b8d36447d72820122ff1f18556d [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
40======================= =============================================
41:func:`mean` Arithmetic mean ("average") of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100042:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070043:func:`median` Median (middle value) of data.
44:func:`median_low` Low median of data.
45:func:`median_high` High median of data.
46:func:`median_grouped` Median, or 50th percentile, of grouped data.
47:func:`mode` Mode (most common value) of discrete data.
48======================= =============================================
49
Georg Brandleb2aeec2013-10-21 08:57:26 +020050Measures of spread
51------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070052
Georg Brandleb2aeec2013-10-21 08:57:26 +020053These functions calculate a measure of how much the population or sample
54tends to deviate from the typical or average values.
55
56======================= =============================================
57:func:`pstdev` Population standard deviation of data.
58:func:`pvariance` Population variance of data.
59:func:`stdev` Sample standard deviation of data.
60:func:`variance` Sample variance of data.
61======================= =============================================
62
63
64Function details
65----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070066
Georg Brandle051b552013-11-04 07:30:50 +010067Note: The functions do not require the data given to them to be sorted.
68However, for reading convenience, most of the examples show sorted sequences.
69
Larry Hastingsf5e987b2013-10-19 11:50:09 -070070.. function:: mean(data)
71
Raymond Hettinger6da90782016-11-21 16:31:02 -080072 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070073
Georg Brandleb2aeec2013-10-21 08:57:26 +020074 The arithmetic mean is the sum of the data divided by the number of data
75 points. It is commonly called "the average", although it is only one of many
76 different mathematical averages. It is a measure of the central location of
77 the data.
78
79 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070080
81 Some examples of use:
82
83 .. doctest::
84
85 >>> mean([1, 2, 3, 4, 4])
86 2.8
87 >>> mean([-1.0, 2.5, 3.25, 5.75])
88 2.625
89
90 >>> from fractions import Fraction as F
91 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
92 Fraction(13, 21)
93
94 >>> from decimal import Decimal as D
95 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
96 Decimal('0.5625')
97
98 .. note::
99
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200100 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200101 for central location: the mean is not necessarily a typical example of the
102 data points. For more robust, although less efficient, measures of
103 central location, see :func:`median` and :func:`mode`. (In this case,
104 "efficient" refers to statistical efficiency rather than computational
105 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700106
Georg Brandleb2aeec2013-10-21 08:57:26 +0200107 The sample mean gives an unbiased estimate of the true population mean,
108 which means that, taken on average over all the possible samples,
109 ``mean(sample)`` converges on the true mean of the entire population. If
110 *data* represents the entire population rather than a sample, then
111 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700113
Steven D'Aprano22873182016-08-24 02:34:25 +1000114.. function:: harmonic_mean(data)
115
116 Return the harmonic mean of *data*, a sequence or iterator of
117 real-valued numbers.
118
119 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500120 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000121 data. For example, the harmonic mean of three values *a*, *b* and *c*
122 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
123
124 The harmonic mean is a type of average, a measure of the central
125 location of the data. It is often appropriate when averaging quantities
126 which are rates or ratios, for example speeds. For example:
127
128 Suppose an investor purchases an equal value of shares in each of
129 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
130 What is the average P/E ratio for the investor's portfolio?
131
132 .. doctest::
133
134 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
135 3.6
136
137 Using the arithmetic mean would give an average of about 5.167, which
138 is too high.
139
Zachary Warec019bd32016-08-23 13:23:31 -0500140 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000141 is less than zero.
142
Zachary Warec019bd32016-08-23 13:23:31 -0500143 .. versionadded:: 3.6
144
Steven D'Aprano22873182016-08-24 02:34:25 +1000145
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700146.. function:: median(data)
147
Georg Brandleb2aeec2013-10-21 08:57:26 +0200148 Return the median (middle value) of numeric data, using the common "mean of
149 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800150 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700151
Georg Brandleb2aeec2013-10-21 08:57:26 +0200152 The median is a robust measure of central location, and is less affected by
153 the presence of outliers in your data. When the number of data points is
154 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700155
156 .. doctest::
157
158 >>> median([1, 3, 5])
159 3
160
Georg Brandleb2aeec2013-10-21 08:57:26 +0200161 When the number of data points is even, the median is interpolated by taking
162 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700163
164 .. doctest::
165
166 >>> median([1, 3, 5, 7])
167 4.0
168
Georg Brandleb2aeec2013-10-21 08:57:26 +0200169 This is suited for when your data is discrete, and you don't mind that the
170 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700171
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300172 If your data is ordinal (supports order operations) but not numeric (doesn't
173 support addition), you should use :func:`median_low` or :func:`median_high`
174 instead.
175
Berker Peksag9c1dba22014-09-28 00:00:58 +0300176 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700177
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700178
179.. function:: median_low(data)
180
Georg Brandleb2aeec2013-10-21 08:57:26 +0200181 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800182 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700183
Georg Brandleb2aeec2013-10-21 08:57:26 +0200184 The low median is always a member of the data set. When the number of data
185 points is odd, the middle value is returned. When it is even, the smaller of
186 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700187
188 .. doctest::
189
190 >>> median_low([1, 3, 5])
191 3
192 >>> median_low([1, 3, 5, 7])
193 3
194
Georg Brandleb2aeec2013-10-21 08:57:26 +0200195 Use the low median when your data are discrete and you prefer the median to
196 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700197
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700198
199.. function:: median_high(data)
200
Georg Brandleb2aeec2013-10-21 08:57:26 +0200201 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800202 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700203
Georg Brandleb2aeec2013-10-21 08:57:26 +0200204 The high median is always a member of the data set. When the number of data
205 points is odd, the middle value is returned. When it is even, the larger of
206 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700207
208 .. doctest::
209
210 >>> median_high([1, 3, 5])
211 3
212 >>> median_high([1, 3, 5, 7])
213 5
214
Georg Brandleb2aeec2013-10-21 08:57:26 +0200215 Use the high median when your data are discrete and you prefer the median to
216 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700217
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700218
Georg Brandleb2aeec2013-10-21 08:57:26 +0200219.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700220
Georg Brandleb2aeec2013-10-21 08:57:26 +0200221 Return the median of grouped continuous data, calculated as the 50th
222 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800223 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700224
225 .. doctest::
226
227 >>> median_grouped([52, 52, 53, 54])
228 52.5
229
Georg Brandleb2aeec2013-10-21 08:57:26 +0200230 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200231 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
232 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
233 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200234 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700235
236 .. doctest::
237
238 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
239 3.7
240
Georg Brandleb2aeec2013-10-21 08:57:26 +0200241 Optional argument *interval* represents the class interval, and defaults
242 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700243
244 .. doctest::
245
246 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
247 3.25
248 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
249 3.5
250
251 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200252 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700253
254 .. impl-detail::
255
Georg Brandleb2aeec2013-10-21 08:57:26 +0200256 Under some circumstances, :func:`median_grouped` may coerce data points to
257 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700258
259 .. seealso::
260
Georg Brandleb2aeec2013-10-21 08:57:26 +0200261 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
262 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700263
Georg Brandleb2aeec2013-10-21 08:57:26 +0200264 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100265 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200266 function in the Gnome Gnumeric spreadsheet, including `this discussion
267 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700268
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700269
270.. function:: mode(data)
271
Georg Brandleb2aeec2013-10-21 08:57:26 +0200272 Return the most common data point from discrete or nominal *data*. The mode
273 (when it exists) is the most typical value, and is a robust measure of
274 central location.
275
276 If *data* is empty, or if there is not exactly one most common value,
277 :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700278
279 ``mode`` assumes discrete data, and returns a single value. This is the
280 standard treatment of the mode as commonly taught in schools:
281
282 .. doctest::
283
284 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
285 3
286
287 The mode is unique in that it is the only statistic which also applies
288 to nominal (non-numeric) data:
289
290 .. doctest::
291
292 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
293 'red'
294
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700295
Georg Brandleb2aeec2013-10-21 08:57:26 +0200296.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700297
Georg Brandleb2aeec2013-10-21 08:57:26 +0200298 Return the population standard deviation (the square root of the population
299 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700300
301 .. doctest::
302
303 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
304 0.986893273527251
305
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700306
Georg Brandleb2aeec2013-10-21 08:57:26 +0200307.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700308
Georg Brandleb2aeec2013-10-21 08:57:26 +0200309 Return the population variance of *data*, a non-empty iterable of real-valued
310 numbers. Variance, or second moment about the mean, is a measure of the
311 variability (spread or dispersion) of data. A large variance indicates that
312 the data is spread out; a small variance indicates it is clustered closely
313 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700314
Georg Brandleb2aeec2013-10-21 08:57:26 +0200315 If the optional second argument *mu* is given, it should be the mean of
316 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700317 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700318
Georg Brandleb2aeec2013-10-21 08:57:26 +0200319 Use this function to calculate the variance from the entire population. To
320 estimate the variance from a sample, the :func:`variance` function is usually
321 a better choice.
322
323 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700324
325 Examples:
326
327 .. doctest::
328
329 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
330 >>> pvariance(data)
331 1.25
332
Georg Brandleb2aeec2013-10-21 08:57:26 +0200333 If you have already calculated the mean of your data, you can pass it as the
334 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700335
336 .. doctest::
337
338 >>> mu = mean(data)
339 >>> pvariance(data, mu)
340 1.25
341
Georg Brandleb2aeec2013-10-21 08:57:26 +0200342 This function does not attempt to verify that you have passed the actual mean
343 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
344 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700345
346 Decimals and Fractions are supported:
347
348 .. doctest::
349
350 >>> from decimal import Decimal as D
351 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
352 Decimal('24.815')
353
354 >>> from fractions import Fraction as F
355 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
356 Fraction(13, 72)
357
358 .. note::
359
Georg Brandleb2aeec2013-10-21 08:57:26 +0200360 When called with the entire population, this gives the population variance
361 σ². When called on a sample instead, this is the biased sample variance
362 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700363
Georg Brandleb2aeec2013-10-21 08:57:26 +0200364 If you somehow know the true population mean μ, you may use this function
365 to calculate the variance of a sample, giving the known population mean as
366 the second argument. Provided the data points are representative
367 (e.g. independent and identically distributed), the result will be an
368 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700369
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700370
Georg Brandleb2aeec2013-10-21 08:57:26 +0200371.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700372
Georg Brandleb2aeec2013-10-21 08:57:26 +0200373 Return the sample standard deviation (the square root of the sample
374 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700375
376 .. doctest::
377
378 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
379 1.0810874155219827
380
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700381
Georg Brandleb2aeec2013-10-21 08:57:26 +0200382.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700383
Georg Brandleb2aeec2013-10-21 08:57:26 +0200384 Return the sample variance of *data*, an iterable of at least two real-valued
385 numbers. Variance, or second moment about the mean, is a measure of the
386 variability (spread or dispersion) of data. A large variance indicates that
387 the data is spread out; a small variance indicates it is clustered closely
388 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700389
Georg Brandleb2aeec2013-10-21 08:57:26 +0200390 If the optional second argument *xbar* is given, it should be the mean of
391 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700392 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700393
Georg Brandleb2aeec2013-10-21 08:57:26 +0200394 Use this function when your data is a sample from a population. To calculate
395 the variance from the entire population, see :func:`pvariance`.
396
397 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700398
399 Examples:
400
401 .. doctest::
402
403 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
404 >>> variance(data)
405 1.3720238095238095
406
Georg Brandleb2aeec2013-10-21 08:57:26 +0200407 If you have already calculated the mean of your data, you can pass it as the
408 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700409
410 .. doctest::
411
412 >>> m = mean(data)
413 >>> variance(data, m)
414 1.3720238095238095
415
Georg Brandleb2aeec2013-10-21 08:57:26 +0200416 This function does not attempt to verify that you have passed the actual mean
417 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700418 impossible results.
419
420 Decimal and Fraction values are supported:
421
422 .. doctest::
423
424 >>> from decimal import Decimal as D
425 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
426 Decimal('31.01875')
427
428 >>> from fractions import Fraction as F
429 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
430 Fraction(67, 108)
431
432 .. note::
433
Georg Brandleb2aeec2013-10-21 08:57:26 +0200434 This is the sample variance s² with Bessel's correction, also known as
435 variance with N-1 degrees of freedom. Provided that the data points are
436 representative (e.g. independent and identically distributed), the result
437 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700438
Georg Brandleb2aeec2013-10-21 08:57:26 +0200439 If you somehow know the actual population mean μ you should pass it to the
440 :func:`pvariance` function as the *mu* parameter to get the variance of a
441 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700442
443Exceptions
444----------
445
446A single exception is defined:
447
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400448.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700449
Benjamin Peterson44c30652013-10-20 17:52:09 -0400450 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700451
452..
453 # This modelines must appear within the last ten lines of the file.
454 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;