blob: fc99d818d57864da7d89a86f15eec6ea2336b7f8 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
6.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
7.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
8
9.. versionadded:: 3.4
10
11.. testsetup:: *
12
13 from statistics import *
14 __name__ = '<doctest>'
15
16**Source code:** :source:`Lib/statistics.py`
17
18--------------
19
20This module provides functions for calculating mathematical statistics of
21numeric (:class:`Real`-valued) data.
22
23Averages and measures of central location
24-----------------------------------------
25
26These functions calculate an average or typical value from a population
27or sample.
28
29======================= =============================================
30:func:`mean` Arithmetic mean ("average") of data.
31:func:`median` Median (middle value) of data.
32:func:`median_low` Low median of data.
33:func:`median_high` High median of data.
34:func:`median_grouped` Median, or 50th percentile, of grouped data.
35:func:`mode` Mode (most common value) of discrete data.
36======================= =============================================
37
38:func:`mean`
39~~~~~~~~~~~~
40
41The :func:`mean` function calculates the arithmetic mean, commonly known
42as the average, of its iterable argument:
43
44.. function:: mean(data)
45
46 Return the sample arithmetic mean of *data*, a sequence or iterator
47 of real-valued numbers.
48
49 The arithmetic mean is the sum of the data divided by the number of
50 data points. It is commonly called "the average", although it is only
51 one of many different mathematical averages. It is a measure of the
52 central location of the data.
53
54 Some examples of use:
55
56 .. doctest::
57
58 >>> mean([1, 2, 3, 4, 4])
59 2.8
60 >>> mean([-1.0, 2.5, 3.25, 5.75])
61 2.625
62
63 >>> from fractions import Fraction as F
64 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
65 Fraction(13, 21)
66
67 >>> from decimal import Decimal as D
68 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
69 Decimal('0.5625')
70
71 .. note::
72
73 The mean is strongly effected by outliers and is not a robust
74 estimator for central location: the mean is not necessarily a
75 typical example of the data points. For more robust, although less
76 efficient, measures of central location, see :func:`median` and
77 :func:`mode`. (In this case, "efficient" refers to statistical
78 efficiency rather than computational efficiency.)
79
80 The sample mean gives an unbiased estimate of the true population
81 mean, which means that, taken on average over all the possible
82 samples, ``mean(sample)`` converges on the true mean of the entire
83 population. If *data* represents the entire population rather than
84 a sample, then ``mean(data)`` is equivalent to calculating the true
85 population mean μ.
86
87 If ``data`` is empty, :exc:`StatisticsError` will be raised.
88
89:func:`median`
90~~~~~~~~~~~~~~
91
92The :func:`median` function calculates the median, or middle, data point,
93using the common "mean of middle two" method.
94
95 .. seealso::
96
97 :func:`median_low`
98
99 :func:`median_high`
100
101 :func:`median_grouped`
102
103.. function:: median(data)
104
105 Return the median (middle value) of numeric data.
106
107 The median is a robust measure of central location, and is less affected
108 by the presence of outliers in your data. When the number of data points
109 is odd, the middle data point is returned:
110
111 .. doctest::
112
113 >>> median([1, 3, 5])
114 3
115
116 When the number of data points is even, the median is interpolated by
117 taking the average of the two middle values:
118
119 .. doctest::
120
121 >>> median([1, 3, 5, 7])
122 4.0
123
124 This is suited for when your data is discrete, and you don't mind that
125 the median may not be an actual data point.
126
127 If data is empty, :exc:`StatisticsError` is raised.
128
129:func:`median_low`
130~~~~~~~~~~~~~~~~~~
131
132The :func:`median_low` function calculates the low median without
133interpolation.
134
135.. function:: median_low(data)
136
137 Return the low median of numeric data.
138
139 The low median is always a member of the data set. When the number
140 of data points is odd, the middle value is returned. When it is
141 even, the smaller of the two middle values is returned.
142
143 .. doctest::
144
145 >>> median_low([1, 3, 5])
146 3
147 >>> median_low([1, 3, 5, 7])
148 3
149
150 Use the low median when your data are discrete and you prefer the median
151 to be an actual data point rather than interpolated.
152
153 If data is empty, :exc:`StatisticsError` is raised.
154
155:func:`median_high`
156~~~~~~~~~~~~~~~~~~~
157
158The :func:`median_high` function calculates the high median without
159interpolation.
160
161.. function:: median_high(data)
162
163 Return the high median of data.
164
165 The high median is always a member of the data set. When the number of
166 data points is odd, the middle value is returned. When it is even, the
167 larger of the two middle values is returned.
168
169 .. doctest::
170
171 >>> median_high([1, 3, 5])
172 3
173 >>> median_high([1, 3, 5, 7])
174 5
175
176 Use the high median when your data are discrete and you prefer the median
177 to be an actual data point rather than interpolated.
178
179 If data is empty, :exc:`StatisticsError` is raised.
180
181:func:`median_grouped`
182~~~~~~~~~~~~~~~~~~~~~~
183
184The :func:`median_grouped` function calculates the median of grouped data
185as the 50th percentile, using interpolation.
186
187.. function:: median_grouped(data [, interval])
188
189 Return the median of grouped continuous data, calculated as the
190 50th percentile.
191
192 .. doctest::
193
194 >>> median_grouped([52, 52, 53, 54])
195 52.5
196
197 In the following example, the data are rounded, so that each value
198 represents the midpoint of data classes, e.g. 1 is the midpoint of the
199 class 0.5-1.5, 2 is the midpoint of 1.5-2.5, 3 is the midpoint of
200 2.5-3.5, etc. With the data given, the middle value falls somewhere in
201 the class 3.5-4.5, and interpolation is used to estimate it:
202
203 .. doctest::
204
205 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
206 3.7
207
208 Optional argument ``interval`` represents the class interval, and
209 defaults to 1. Changing the class interval naturally will change the
210 interpolation:
211
212 .. doctest::
213
214 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
215 3.25
216 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
217 3.5
218
219 This function does not check whether the data points are at least
220 ``interval`` apart.
221
222 .. impl-detail::
223
224 Under some circumstances, :func:`median_grouped` may coerce data
225 points to floats. This behaviour is likely to change in the future.
226
227 .. seealso::
228
229 * "Statistics for the Behavioral Sciences", Frederick J Gravetter
230 and Larry B Wallnau (8th Edition).
231
232 * Calculating the `median <http://www.ualberta.ca/~opscan/median.html>`_.
233
234 * The `SSMEDIAN <https://projects.gnome.org/gnumeric/doc/gnumeric-function-SSMEDIAN.shtml>`_
235 function in the Gnome Gnumeric spreadsheet, including
236 `this discussion <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
237
238 If data is empty, :exc:`StatisticsError` is raised.
239
240:func:`mode`
241~~~~~~~~~~~~
242
243The :func:`mode` function calculates the mode, or most common element, of
244discrete or nominal data. The mode (when it exists) is the most typical
245value, and is a robust measure of central location.
246
247.. function:: mode(data)
248
249 Return the most common data point from discrete or nominal data.
250
251 ``mode`` assumes discrete data, and returns a single value. This is the
252 standard treatment of the mode as commonly taught in schools:
253
254 .. doctest::
255
256 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
257 3
258
259 The mode is unique in that it is the only statistic which also applies
260 to nominal (non-numeric) data:
261
262 .. doctest::
263
264 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
265 'red'
266
267 If data is empty, or if there is not exactly one most common value,
268 :exc:`StatisticsError` is raised.
269
270Measures of spread
271------------------
272
273These functions calculate a measure of how much the population or sample
274tends to deviate from the typical or average values.
275
276======================= =============================================
277:func:`pstdev` Population standard deviation of data.
278:func:`pvariance` Population variance of data.
279:func:`stdev` Sample standard deviation of data.
280:func:`variance` Sample variance of data.
281======================= =============================================
282
283:func:`pstdev`
284~~~~~~~~~~~~~~
285
286The :func:`pstdev` function calculates the standard deviation of a
287population. The standard deviation is equivalent to the square root of
288the variance.
289
290.. function:: pstdev(data [, mu])
291
292 Return the square root of the population variance. See :func:`pvariance`
293 for arguments and other details.
294
295 .. doctest::
296
297 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
298 0.986893273527251
299
300:func:`pvariance`
301~~~~~~~~~~~~~~~~~
302
303The :func:`pvariance` function calculates the variance of a population.
304Variance, or second moment about the mean, is a measure of the variability
305(spread or dispersion) of data. A large variance indicates that the data is
306spread out; a small variance indicates it is clustered closely around the
307mean.
308
309.. function:: pvariance(data [, mu])
310
311 Return the population variance of *data*, a non-empty iterable of
312 real-valued numbers.
313
314 If the optional second argument *mu* is given, it should be the mean
315 of *data*. If it is missing or None (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700316 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700317
318 Use this function to calculate the variance from the entire population.
319 To estimate the variance from a sample, the :func:`variance` function is
320 usually a better choice.
321
322 Examples:
323
324 .. doctest::
325
326 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
327 >>> pvariance(data)
328 1.25
329
330 If you have already calculated the mean of your data, you can pass
331 it as the optional second argument *mu* to avoid recalculation:
332
333 .. doctest::
334
335 >>> mu = mean(data)
336 >>> pvariance(data, mu)
337 1.25
338
339 This function does not attempt to verify that you have passed the actual
340 mean as *mu*. Using arbitrary values for *mu* may lead to invalid or
341 impossible results.
342
343 Decimals and Fractions are supported:
344
345 .. doctest::
346
347 >>> from decimal import Decimal as D
348 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
349 Decimal('24.815')
350
351 >>> from fractions import Fraction as F
352 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
353 Fraction(13, 72)
354
355 .. note::
356
357 When called with the entire population, this gives the population
358 variance σ². When called on a sample instead, this is the biased
359 sample variance s², also known as variance with N degrees of freedom.
360
361 If you somehow know the true population mean μ, you may use this
362 function to calculate the variance of a sample, giving the known
363 population mean as the second argument. Provided the data points are
364 representative (e.g. independent and identically distributed), the
365 result will be an unbiased estimate of the population variance.
366
367 Raises :exc:`StatisticsError` if *data* is empty.
368
369:func:`stdev`
370~~~~~~~~~~~~~~
371
372The :func:`stdev` function calculates the standard deviation of a sample.
373The standard deviation is equivalent to the square root of the variance.
374
375.. function:: stdev(data [, xbar])
376
377 Return the square root of the sample variance. See :func:`variance` for
378 arguments and other details.
379
380 .. doctest::
381
382 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
383 1.0810874155219827
384
385:func:`variance`
386~~~~~~~~~~~~~~~~~
387
388The :func:`variance` function calculates the variance of a sample. Variance,
389or second moment about the mean, is a measure of the variability (spread or
390dispersion) of data. A large variance indicates that the data is spread out;
391a small variance indicates it is clustered closely around the mean.
392
393.. function:: variance(data [, xbar])
394
395 Return the sample variance of *data*, an iterable of at least two
396 real-valued numbers.
397
398 If the optional second argument *xbar* is given, it should be the mean
399 of *data*. If it is missing or None (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700400 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700401
402 Use this function when your data is a sample from a population. To
403 calculate the variance from the entire population, see :func:`pvariance`.
404
405 Examples:
406
407 .. doctest::
408
409 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
410 >>> variance(data)
411 1.3720238095238095
412
413 If you have already calculated the mean of your data, you can pass
414 it as the optional second argument *xbar* to avoid recalculation:
415
416 .. doctest::
417
418 >>> m = mean(data)
419 >>> variance(data, m)
420 1.3720238095238095
421
422 This function does not attempt to verify that you have passed the actual
423 mean as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
424 impossible results.
425
426 Decimal and Fraction values are supported:
427
428 .. doctest::
429
430 >>> from decimal import Decimal as D
431 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
432 Decimal('31.01875')
433
434 >>> from fractions import Fraction as F
435 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
436 Fraction(67, 108)
437
438 .. note::
439
440 This is the sample variance s² with Bessel's correction, also known
441 as variance with N-1 degrees of freedom. Provided that the data
442 points are representative (e.g. independent and identically
443 distributed), the result should be an unbiased estimate of the true
444 population variance.
445
446 If you somehow know the actual population mean μ you should pass it
447 to the :func:`pvariance` function as the *mu* parameter to get
448 the variance of a sample.
449
450 Raises :exc:`StatisticsError` if *data* has fewer than two values.
451
452Exceptions
453----------
454
455A single exception is defined:
456
Benjamin Peterson44c30652013-10-20 17:52:09 -0400457.. exception:: `StatisticsError`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700458
Benjamin Peterson44c30652013-10-20 17:52:09 -0400459 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700460
461..
462 # This modelines must appear within the last ten lines of the file.
463 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;