blob: bc841fda72f887fb8cd40a7d9fe8d3bdac68f68f [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
5 :synopsis: mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
22numeric (:class:`Real`-valued) data.
23
Nick Coghlan73afe2a2014-02-08 19:58:04 +100024.. note::
25
26 Unless explicitly noted otherwise, these functions support :class:`int`,
27 :class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
28 Behaviour with other types (whether in the numeric tower or not) is
29 currently unsupported. Mixed types are also undefined and
30 implementation-dependent. If your input data consists of mixed types,
31 you may be able to use :func:`map` to ensure a consistent result, e.g.
32 ``map(float, input_data)``.
33
Larry Hastingsf5e987b2013-10-19 11:50:09 -070034Averages and measures of central location
35-----------------------------------------
36
37These functions calculate an average or typical value from a population
38or sample.
39
Raymond Hettingerfc06a192019-03-12 00:43:27 -070040======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070041:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080042:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070043:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100044:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`median` Median (middle value) of data.
46:func:`median_low` Low median of data.
47:func:`median_high` High median of data.
48:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070049:func:`mode` Single mode (most common value) of discrete or nominal data.
50:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070051:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070052======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070053
Georg Brandleb2aeec2013-10-21 08:57:26 +020054Measures of spread
55------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070056
Georg Brandleb2aeec2013-10-21 08:57:26 +020057These functions calculate a measure of how much the population or sample
58tends to deviate from the typical or average values.
59
60======================= =============================================
61:func:`pstdev` Population standard deviation of data.
62:func:`pvariance` Population variance of data.
63:func:`stdev` Sample standard deviation of data.
64:func:`variance` Sample variance of data.
65======================= =============================================
66
67
68Function details
69----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070070
Georg Brandle051b552013-11-04 07:30:50 +010071Note: The functions do not require the data given to them to be sorted.
72However, for reading convenience, most of the examples show sorted sequences.
73
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074.. function:: mean(data)
75
Raymond Hettinger6da90782016-11-21 16:31:02 -080076 Return the sample arithmetic mean of *data* which can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070077
Georg Brandleb2aeec2013-10-21 08:57:26 +020078 The arithmetic mean is the sum of the data divided by the number of data
79 points. It is commonly called "the average", although it is only one of many
80 different mathematical averages. It is a measure of the central location of
81 the data.
82
83 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070084
85 Some examples of use:
86
87 .. doctest::
88
89 >>> mean([1, 2, 3, 4, 4])
90 2.8
91 >>> mean([-1.0, 2.5, 3.25, 5.75])
92 2.625
93
94 >>> from fractions import Fraction as F
95 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
96 Fraction(13, 21)
97
98 >>> from decimal import Decimal as D
99 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
100 Decimal('0.5625')
101
102 .. note::
103
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200104 The mean is strongly affected by outliers and is not a robust estimator
Georg Brandleb2aeec2013-10-21 08:57:26 +0200105 for central location: the mean is not necessarily a typical example of the
106 data points. For more robust, although less efficient, measures of
107 central location, see :func:`median` and :func:`mode`. (In this case,
108 "efficient" refers to statistical efficiency rather than computational
109 efficiency.)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700110
Georg Brandleb2aeec2013-10-21 08:57:26 +0200111 The sample mean gives an unbiased estimate of the true population mean,
112 which means that, taken on average over all the possible samples,
113 ``mean(sample)`` converges on the true mean of the entire population. If
114 *data* represents the entire population rather than a sample, then
115 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700116
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700117
Raymond Hettinger47d99872019-02-21 15:06:29 -0800118.. function:: fmean(data)
119
120 Convert *data* to floats and compute the arithmetic mean.
121
122 This runs faster than the :func:`mean` function and it always returns a
123 :class:`float`. The result is highly accurate but not as perfect as
124 :func:`mean`. If the input dataset is empty, raises a
125 :exc:`StatisticsError`.
126
127 .. doctest::
128
129 >>> fmean([3.5, 4.0, 5.25])
130 4.25
131
132 .. versionadded:: 3.8
133
134
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700135.. function:: geometric_mean(data)
136
137 Convert *data* to floats and compute the geometric mean.
138
139 Raises a :exc:`StatisticsError` if the input dataset is empty,
140 if it contains a zero, or if it contains a negative value.
141
142 No special efforts are made to achieve exact results.
143 (However, this may change in the future.)
144
145 .. doctest::
146
147 >>> round(geometric_mean([54, 24, 36]), 9)
148 36.0
149
150 .. versionadded:: 3.8
151
152
Steven D'Aprano22873182016-08-24 02:34:25 +1000153.. function:: harmonic_mean(data)
154
155 Return the harmonic mean of *data*, a sequence or iterator of
156 real-valued numbers.
157
158 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500159 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000160 data. For example, the harmonic mean of three values *a*, *b* and *c*
161 will be equivalent to ``3/(1/a + 1/b + 1/c)``.
162
163 The harmonic mean is a type of average, a measure of the central
164 location of the data. It is often appropriate when averaging quantities
165 which are rates or ratios, for example speeds. For example:
166
167 Suppose an investor purchases an equal value of shares in each of
168 three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
169 What is the average P/E ratio for the investor's portfolio?
170
171 .. doctest::
172
173 >>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
174 3.6
175
176 Using the arithmetic mean would give an average of about 5.167, which
177 is too high.
178
Zachary Warec019bd32016-08-23 13:23:31 -0500179 :exc:`StatisticsError` is raised if *data* is empty, or any element
Steven D'Aprano22873182016-08-24 02:34:25 +1000180 is less than zero.
181
Zachary Warec019bd32016-08-23 13:23:31 -0500182 .. versionadded:: 3.6
183
Steven D'Aprano22873182016-08-24 02:34:25 +1000184
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700185.. function:: median(data)
186
Georg Brandleb2aeec2013-10-21 08:57:26 +0200187 Return the median (middle value) of numeric data, using the common "mean of
188 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger6da90782016-11-21 16:31:02 -0800189 *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700190
Georg Brandleb2aeec2013-10-21 08:57:26 +0200191 The median is a robust measure of central location, and is less affected by
192 the presence of outliers in your data. When the number of data points is
193 odd, the middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700194
195 .. doctest::
196
197 >>> median([1, 3, 5])
198 3
199
Georg Brandleb2aeec2013-10-21 08:57:26 +0200200 When the number of data points is even, the median is interpolated by taking
201 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700202
203 .. doctest::
204
205 >>> median([1, 3, 5, 7])
206 4.0
207
Georg Brandleb2aeec2013-10-21 08:57:26 +0200208 This is suited for when your data is discrete, and you don't mind that the
209 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700210
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300211 If your data is ordinal (supports order operations) but not numeric (doesn't
212 support addition), you should use :func:`median_low` or :func:`median_high`
213 instead.
214
Berker Peksag9c1dba22014-09-28 00:00:58 +0300215 .. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700216
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700217
218.. function:: median_low(data)
219
Georg Brandleb2aeec2013-10-21 08:57:26 +0200220 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger6da90782016-11-21 16:31:02 -0800221 :exc:`StatisticsError` is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700222
Georg Brandleb2aeec2013-10-21 08:57:26 +0200223 The low median is always a member of the data set. When the number of data
224 points is odd, the middle value is returned. When it is even, the smaller of
225 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700226
227 .. doctest::
228
229 >>> median_low([1, 3, 5])
230 3
231 >>> median_low([1, 3, 5, 7])
232 3
233
Georg Brandleb2aeec2013-10-21 08:57:26 +0200234 Use the low median when your data are discrete and you prefer the median to
235 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700236
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700237
238.. function:: median_high(data)
239
Georg Brandleb2aeec2013-10-21 08:57:26 +0200240 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800241 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700242
Georg Brandleb2aeec2013-10-21 08:57:26 +0200243 The high median is always a member of the data set. When the number of data
244 points is odd, the middle value is returned. When it is even, the larger of
245 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700246
247 .. doctest::
248
249 >>> median_high([1, 3, 5])
250 3
251 >>> median_high([1, 3, 5, 7])
252 5
253
Georg Brandleb2aeec2013-10-21 08:57:26 +0200254 Use the high median when your data are discrete and you prefer the median to
255 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700256
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700257
Georg Brandleb2aeec2013-10-21 08:57:26 +0200258.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260 Return the median of grouped continuous data, calculated as the 50th
261 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger6da90782016-11-21 16:31:02 -0800262 is raised. *data* can be a sequence or iterator.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700263
264 .. doctest::
265
266 >>> median_grouped([52, 52, 53, 54])
267 52.5
268
Georg Brandleb2aeec2013-10-21 08:57:26 +0200269 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200270 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
271 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
272 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200273 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700274
275 .. doctest::
276
277 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
278 3.7
279
Georg Brandleb2aeec2013-10-21 08:57:26 +0200280 Optional argument *interval* represents the class interval, and defaults
281 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
283 .. doctest::
284
285 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
286 3.25
287 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
288 3.5
289
290 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200291 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700292
293 .. impl-detail::
294
Georg Brandleb2aeec2013-10-21 08:57:26 +0200295 Under some circumstances, :func:`median_grouped` may coerce data points to
296 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700297
298 .. seealso::
299
Georg Brandleb2aeec2013-10-21 08:57:26 +0200300 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
301 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700302
Georg Brandleb2aeec2013-10-21 08:57:26 +0200303 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100304 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200305 function in the Gnome Gnumeric spreadsheet, including `this discussion
306 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700307
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700308
309.. function:: mode(data)
310
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700311 Return the single most common data point from discrete or nominal *data*.
312 The mode (when it exists) is the most typical value and serves as a
313 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200314
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700315 If there are multiple modes, returns the first one encountered in the *data*.
316 If *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700317
318 ``mode`` assumes discrete data, and returns a single value. This is the
319 standard treatment of the mode as commonly taught in schools:
320
321 .. doctest::
322
323 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
324 3
325
326 The mode is unique in that it is the only statistic which also applies
327 to nominal (non-numeric) data:
328
329 .. doctest::
330
331 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
332 'red'
333
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700334 .. versionchanged:: 3.8
335 Now handles multimodal datasets by returning the first mode encountered.
336 Formerly, it raised :exc:`StatisticsError` when more than one mode was
337 found.
338
339
340.. function:: multimode(data)
341
342 Return a list of the most frequently occurring values in the order they
343 were first encountered in the *data*. Will return more than one result if
344 there are multiple modes or an empty list if the *data* is empty:
345
346 .. doctest::
347
348 >>> multimode('aabbbbccddddeeffffgg')
349 ['b', 'd', 'f']
350 >>> multimode('')
351 []
352
353 .. versionadded:: 3.8
354
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700355
Georg Brandleb2aeec2013-10-21 08:57:26 +0200356.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700357
Georg Brandleb2aeec2013-10-21 08:57:26 +0200358 Return the population standard deviation (the square root of the population
359 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700360
361 .. doctest::
362
363 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
364 0.986893273527251
365
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700366
Georg Brandleb2aeec2013-10-21 08:57:26 +0200367.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700368
Georg Brandleb2aeec2013-10-21 08:57:26 +0200369 Return the population variance of *data*, a non-empty iterable of real-valued
370 numbers. Variance, or second moment about the mean, is a measure of the
371 variability (spread or dispersion) of data. A large variance indicates that
372 the data is spread out; a small variance indicates it is clustered closely
373 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700374
Georg Brandleb2aeec2013-10-21 08:57:26 +0200375 If the optional second argument *mu* is given, it should be the mean of
376 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700377 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700378
Georg Brandleb2aeec2013-10-21 08:57:26 +0200379 Use this function to calculate the variance from the entire population. To
380 estimate the variance from a sample, the :func:`variance` function is usually
381 a better choice.
382
383 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700384
385 Examples:
386
387 .. doctest::
388
389 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
390 >>> pvariance(data)
391 1.25
392
Georg Brandleb2aeec2013-10-21 08:57:26 +0200393 If you have already calculated the mean of your data, you can pass it as the
394 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700395
396 .. doctest::
397
398 >>> mu = mean(data)
399 >>> pvariance(data, mu)
400 1.25
401
Georg Brandleb2aeec2013-10-21 08:57:26 +0200402 This function does not attempt to verify that you have passed the actual mean
403 as *mu*. Using arbitrary values for *mu* may lead to invalid or impossible
404 results.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700405
406 Decimals and Fractions are supported:
407
408 .. doctest::
409
410 >>> from decimal import Decimal as D
411 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
412 Decimal('24.815')
413
414 >>> from fractions import Fraction as F
415 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
416 Fraction(13, 72)
417
418 .. note::
419
Georg Brandleb2aeec2013-10-21 08:57:26 +0200420 When called with the entire population, this gives the population variance
421 σ². When called on a sample instead, this is the biased sample variance
422 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700423
Georg Brandleb2aeec2013-10-21 08:57:26 +0200424 If you somehow know the true population mean μ, you may use this function
425 to calculate the variance of a sample, giving the known population mean as
426 the second argument. Provided the data points are representative
427 (e.g. independent and identically distributed), the result will be an
428 unbiased estimate of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700429
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700430
Georg Brandleb2aeec2013-10-21 08:57:26 +0200431.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700432
Georg Brandleb2aeec2013-10-21 08:57:26 +0200433 Return the sample standard deviation (the square root of the sample
434 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700435
436 .. doctest::
437
438 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
439 1.0810874155219827
440
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700441
Georg Brandleb2aeec2013-10-21 08:57:26 +0200442.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700443
Georg Brandleb2aeec2013-10-21 08:57:26 +0200444 Return the sample variance of *data*, an iterable of at least two real-valued
445 numbers. Variance, or second moment about the mean, is a measure of the
446 variability (spread or dispersion) of data. A large variance indicates that
447 the data is spread out; a small variance indicates it is clustered closely
448 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700449
Georg Brandleb2aeec2013-10-21 08:57:26 +0200450 If the optional second argument *xbar* is given, it should be the mean of
451 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700452 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700453
Georg Brandleb2aeec2013-10-21 08:57:26 +0200454 Use this function when your data is a sample from a population. To calculate
455 the variance from the entire population, see :func:`pvariance`.
456
457 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700458
459 Examples:
460
461 .. doctest::
462
463 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
464 >>> variance(data)
465 1.3720238095238095
466
Georg Brandleb2aeec2013-10-21 08:57:26 +0200467 If you have already calculated the mean of your data, you can pass it as the
468 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700469
470 .. doctest::
471
472 >>> m = mean(data)
473 >>> variance(data, m)
474 1.3720238095238095
475
Georg Brandleb2aeec2013-10-21 08:57:26 +0200476 This function does not attempt to verify that you have passed the actual mean
477 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700478 impossible results.
479
480 Decimal and Fraction values are supported:
481
482 .. doctest::
483
484 >>> from decimal import Decimal as D
485 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
486 Decimal('31.01875')
487
488 >>> from fractions import Fraction as F
489 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
490 Fraction(67, 108)
491
492 .. note::
493
Georg Brandleb2aeec2013-10-21 08:57:26 +0200494 This is the sample variance s² with Bessel's correction, also known as
495 variance with N-1 degrees of freedom. Provided that the data points are
496 representative (e.g. independent and identically distributed), the result
497 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700498
Georg Brandleb2aeec2013-10-21 08:57:26 +0200499 If you somehow know the actual population mean μ you should pass it to the
500 :func:`pvariance` function as the *mu* parameter to get the variance of a
501 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700502
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700503.. function:: quantiles(dist, *, n=4, method='exclusive')
504
505 Divide *dist* into *n* continuous intervals with equal probability.
506 Returns a list of ``n - 1`` cut points separating the intervals.
507
508 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
509 *n* to 100 for percentiles which gives the 99 cuts points that separate
510 *dist* in to 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
511 is not least 1.
512
513 The *dist* can be any iterable containing sample data or it can be an
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700514 instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
515 results, the number of data points in *dist* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700516 Raises :exc:`StatisticsError` if there are not at least two data points.
517
518 For sample data, the cut points are linearly interpolated from the
519 two nearest data points. For example, if a cut point falls one-third
520 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700521 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700522
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700523 The *method* for computing quantiles can be varied depending on
524 whether the data in *dist* includes or excludes the lowest and
525 highest possible values from the population.
526
527 The default *method* is "exclusive" and is used for data sampled from
528 a population that can have more extreme values than found in the
529 samples. The portion of the population falling below the *i-th* of
530 *m* data points is computed as ``i / (m + 1)``.
531
532 Setting the *method* to "inclusive" is used for describing population
533 data or for samples that include the extreme points. The minimum
534 value in *dist* is treated as the 0th percentile and the maximum
535 value is treated as the 100th percentile. The portion of the
536 population falling below the *i-th* of *m* data points is computed as
537 ``(i - 1) / (m - 1)``.
538
539 If *dist* is an instance of a class that defines an
540 :meth:`~inv_cdf` method, setting *method* has no effect.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700541
542 .. doctest::
543
544 # Decile cut points for empirically sampled data
545 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
546 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
547 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
548 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
549 ... 103, 107, 101, 81, 109, 104]
550 >>> [round(q, 1) for q in quantiles(data, n=10)]
551 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
552
553 >>> # Quartile cut points for the standard normal distibution
554 >>> Z = NormalDist()
555 >>> [round(q, 4) for q in quantiles(Z, n=4)]
556 [-0.6745, 0.0, 0.6745]
557
558 .. versionadded:: 3.8
559
560
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700561Exceptions
562----------
563
564A single exception is defined:
565
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400566.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700567
Benjamin Peterson44c30652013-10-20 17:52:09 -0400568 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700569
Raymond Hettinger11c79532019-02-23 14:44:07 -0800570
571:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700572---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800573
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800574:class:`NormalDist` is a tool for creating and manipulating normal
575distributions of a `random variable
576<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
577composite class that treats the mean and standard deviation of data
578measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800579
580Normal distributions arise from the `Central Limit Theorem
581<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800582of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800583
584.. class:: NormalDist(mu=0.0, sigma=1.0)
585
586 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800587 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800588 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800589 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800590
591 If *sigma* is negative, raises :exc:`StatisticsError`.
592
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800593 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800594
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800595 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800596 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
597 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800598
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800599 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800600
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800601 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800602 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
603 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800604
605 .. attribute:: variance
606
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800607 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800608 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
609 distribution. Equal to the square of the standard deviation.
610
611 .. classmethod:: NormalDist.from_samples(data)
612
Raymond Hettingercc353a02019-03-10 23:43:33 -0700613 Makes a normal distribution instance computed from sample data. The
614 *data* can be any :term:`iterable` and should consist of values that
615 can be converted to type :class:`float`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800616
617 If *data* does not contain at least two elements, raises
618 :exc:`StatisticsError` because it takes at least one point to estimate
619 a central value and at least two points to estimate dispersion.
620
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700621 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800622
623 Generates *n* random samples for a given mean and standard deviation.
624 Returns a :class:`list` of :class:`float` values.
625
626 If *seed* is given, creates a new instance of the underlying random
627 number generator. This is useful for creating reproducible results,
628 even in a multi-threading context.
629
630 .. method:: NormalDist.pdf(x)
631
632 Using a `probability density function (pdf)
633 <https://en.wikipedia.org/wiki/Probability_density_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800634 compute the relative likelihood that a random variable *X* will be near
Raymond Hettinger11c79532019-02-23 14:44:07 -0800635 the given value *x*. Mathematically, it is the ratio ``P(x <= X <
636 x+dx) / dx``.
637
Raymond Hettingercc353a02019-03-10 23:43:33 -0700638 The relative likelihood is computed as the probability of a sample
639 occurring in a narrow range divided by the width of the range (hence
640 the word "density"). Since the likelihood is relative to other points,
641 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800642
643 .. method:: NormalDist.cdf(x)
644
645 Using a `cumulative distribution function (cdf)
646 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800647 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800648 equal to *x*. Mathematically, it is written ``P(X <= x)``.
649
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700650 .. method:: NormalDist.inv_cdf(p)
651
652 Compute the inverse cumulative distribution function, also known as the
653 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
654 or the `percent-point
655 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
656 function. Mathematically, it is written ``x : P(X <= x) = p``.
657
658 Finds the value *x* of the random variable *X* such that the
659 probability of the variable being less than or equal to that value
660 equals the given probability *p*.
661
Raymond Hettinger318d5372019-03-06 22:59:40 -0800662 .. method:: NormalDist.overlap(other)
663
664 Compute the `overlapping coefficient (OVL)
665 <http://www.iceaaonline.com/ready/wp-content/uploads/2014/06/MM-9-Presentation-Meet-the-Overlapping-Coefficient-A-Measure-for-Elevator-Speeches.pdf>`_
Raymond Hettinger14bab7a2019-03-07 08:54:31 -0800666 between two normal distributions, giving a measure of agreement.
667 Returns a value between 0.0 and 1.0 giving `the overlapping area for
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700668 the two probability density functions
Raymond Hettinger14bab7a2019-03-07 08:54:31 -0800669 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800670
Raymond Hettinger11c79532019-02-23 14:44:07 -0800671 Instances of :class:`NormalDist` support addition, subtraction,
672 multiplication and division by a constant. These operations
673 are used for translation and scaling. For example:
674
675 .. doctest::
676
677 >>> temperature_february = NormalDist(5, 2.5) # Celsius
678 >>> temperature_february * (9/5) + 32 # Fahrenheit
679 NormalDist(mu=41.0, sigma=4.5)
680
Raymond Hettingercc353a02019-03-10 23:43:33 -0700681 Dividing a constant by an instance of :class:`NormalDist` is not supported
682 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800683
684 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800685 variables, it is possible to `add and subtract two independent normally
686 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800687 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
688 represented as instances of :class:`NormalDist`. For example:
689
690 .. doctest::
691
692 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
693 >>> drug_effects = NormalDist(0.4, 0.15)
694 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700695 >>> round(combined.mean, 1)
696 3.1
697 >>> round(combined.stdev, 1)
698 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800699
700 .. versionadded:: 3.8
701
702
703:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700704^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800705
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800706:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800707
708For example, given `historical data for SAT exams
709<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800710are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700711determine the percentage of students with test scores between 1100 and
7121200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800713
714.. doctest::
715
716 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800717 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700718 >>> round(fraction * 100.0, 1)
719 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800720
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700721Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
722<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
723
724.. doctest::
725
726 >>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
727 [928, 1060, 1192]
728 >>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
729 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
730
Raymond Hettinger318d5372019-03-06 22:59:40 -0800731What percentage of men and women will have the same height in `two normally
732distributed populations with known means and standard deviations
733<http://www.usablestats.com/lessons/normal>`_?
734
735 >>> men = NormalDist(70, 4)
736 >>> women = NormalDist(65, 3.5)
737 >>> ovl = men.overlap(women)
738 >>> round(ovl * 100.0, 1)
739 50.3
740
Raymond Hettinger11c79532019-02-23 14:44:07 -0800741To estimate the distribution for a model than isn't easy to solve
742analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700743Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800744
745.. doctest::
746
Raymond Hettingercc353a02019-03-10 23:43:33 -0700747 >>> def model(x, y, z):
748 ... return (3*x + 7*x*y - 5*y) / (11 * z)
749 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800750 >>> n = 100_000
Raymond Hettingercc353a02019-03-10 23:43:33 -0700751 >>> X = NormalDist(10, 2.5).samples(n)
752 >>> Y = NormalDist(15, 1.75).samples(n)
753 >>> Z = NormalDist(5, 1.25).samples(n)
754 >>> NormalDist.from_samples(map(model, X, Y, Z)) # doctest: +SKIP
755 NormalDist(mu=19.640137307085507, sigma=47.03273142191088)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800756
757Normal distributions commonly arise in machine learning problems.
758
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800759Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800760<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
761The challenge is to predict a person's gender from measurements of normally
762distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800763
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800764We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800765measurements are assumed to be normally distributed, so we summarize the data
766with :class:`NormalDist`:
767
768.. doctest::
769
770 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
771 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
772 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
773 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
774 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
775 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
776
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800777Next, we encounter a new person whose feature measurements are known but whose
778gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800779
780.. doctest::
781
782 >>> ht = 6.0 # height
783 >>> wt = 130 # weight
784 >>> fs = 8 # foot size
785
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800786Starting with a 50% `prior probability
787<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
788we compute the posterior as the prior times the product of likelihoods for the
789feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800790
791.. doctest::
792
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800793 >>> prior_male = 0.5
794 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800795 >>> posterior_male = (prior_male * height_male.pdf(ht) *
796 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
797
798 >>> posterior_female = (prior_female * height_female.pdf(ht) *
799 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
800
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800801The final prediction goes to the largest posterior. This is known as the
802`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800803<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
804
805.. doctest::
806
807 >>> 'male' if posterior_male > posterior_female else 'female'
808 'female'
809
810
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700811..
812 # This modelines must appear within the last ten lines of the file.
813 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;