blob: 3c3f9d2df585852990f4e626021c8def8b6eaab2 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
Sanchit Khuranaf8a63162019-11-26 03:47:59 +05305 :synopsis: Mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070022numeric (:class:`~numbers.Real`-valued) data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070023
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070024The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100029
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070030Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported. Collections with a mix of types are also undefined
34and implementation-dependent. If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100037
Larry Hastingsf5e987b2013-10-19 11:50:09 -070038Averages and measures of central location
39-----------------------------------------
40
41These functions calculate an average or typical value from a population
42or sample.
43
Raymond Hettingerfc06a192019-03-12 00:43:27 -070044======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080046:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070047:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100048:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070049:func:`median` Median (middle value) of data.
50:func:`median_low` Low median of data.
51:func:`median_high` High median of data.
52:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070053:func:`mode` Single mode (most common value) of discrete or nominal data.
Zackery Spytzf2b45362021-03-13 18:00:28 -070054:func:`multimode` List of modes (most common values) of discrete or nominal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070055:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070056======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070057
Georg Brandleb2aeec2013-10-21 08:57:26 +020058Measures of spread
59------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070060
Georg Brandleb2aeec2013-10-21 08:57:26 +020061These functions calculate a measure of how much the population or sample
62tends to deviate from the typical or average values.
63
64======================= =============================================
65:func:`pstdev` Population standard deviation of data.
66:func:`pvariance` Population variance of data.
67:func:`stdev` Sample standard deviation of data.
68:func:`variance` Sample variance of data.
69======================= =============================================
70
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +020071Statistics for relations between two inputs
72-------------------------------------------
73
74These functions calculate statistics regarding relations between two inputs.
75
76========================= =====================================================
77:func:`covariance` Sample covariance for two variables.
78:func:`correlation` Pearson's correlation coefficient for two variables.
Miss Islington (bot)86779872021-05-24 18:11:12 -070079:func:`linear_regression` Slope and intercept for simple linear regression.
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +020080========================= =====================================================
81
Georg Brandleb2aeec2013-10-21 08:57:26 +020082
83Function details
84----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070085
Georg Brandle051b552013-11-04 07:30:50 +010086Note: The functions do not require the data given to them to be sorted.
87However, for reading convenience, most of the examples show sorted sequences.
88
Larry Hastingsf5e987b2013-10-19 11:50:09 -070089.. function:: mean(data)
90
Raymond Hettinger733b9a32019-11-11 23:35:06 -080091 Return the sample arithmetic mean of *data* which can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070092
Georg Brandleb2aeec2013-10-21 08:57:26 +020093 The arithmetic mean is the sum of the data divided by the number of data
94 points. It is commonly called "the average", although it is only one of many
95 different mathematical averages. It is a measure of the central location of
96 the data.
97
98 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070099
100 Some examples of use:
101
102 .. doctest::
103
104 >>> mean([1, 2, 3, 4, 4])
105 2.8
106 >>> mean([-1.0, 2.5, 3.25, 5.75])
107 2.625
108
109 >>> from fractions import Fraction as F
110 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
111 Fraction(13, 21)
112
113 >>> from decimal import Decimal as D
114 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
115 Decimal('0.5625')
116
117 .. note::
118
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200119 The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettingere4810b22019-09-05 00:18:47 -0700120 for central location: the mean is not necessarily a typical example of
121 the data points. For more robust measures of central location, see
122 :func:`median` and :func:`mode`.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700123
Georg Brandleb2aeec2013-10-21 08:57:26 +0200124 The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700125 so that when taken on average over all the possible samples,
Georg Brandleb2aeec2013-10-21 08:57:26 +0200126 ``mean(sample)`` converges on the true mean of the entire population. If
127 *data* represents the entire population rather than a sample, then
128 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700129
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700130
Raymond Hettinger47d99872019-02-21 15:06:29 -0800131.. function:: fmean(data)
132
133 Convert *data* to floats and compute the arithmetic mean.
134
135 This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800136 :class:`float`. The *data* may be a sequence or iterable. If the input
Raymond Hettingere4810b22019-09-05 00:18:47 -0700137 dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger47d99872019-02-21 15:06:29 -0800138
139 .. doctest::
140
141 >>> fmean([3.5, 4.0, 5.25])
142 4.25
143
144 .. versionadded:: 3.8
145
146
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700147.. function:: geometric_mean(data)
148
149 Convert *data* to floats and compute the geometric mean.
150
Raymond Hettingere4810b22019-09-05 00:18:47 -0700151 The geometric mean indicates the central tendency or typical value of the
152 *data* using the product of the values (as opposed to the arithmetic mean
153 which uses their sum).
154
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700155 Raises a :exc:`StatisticsError` if the input dataset is empty,
156 if it contains a zero, or if it contains a negative value.
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800157 The *data* may be a sequence or iterable.
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700158
159 No special efforts are made to achieve exact results.
160 (However, this may change in the future.)
161
162 .. doctest::
163
Raymond Hettingere4810b22019-09-05 00:18:47 -0700164 >>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700165 36.0
166
167 .. versionadded:: 3.8
168
169
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800170.. function:: harmonic_mean(data, weights=None)
Steven D'Aprano22873182016-08-24 02:34:25 +1000171
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800172 Return the harmonic mean of *data*, a sequence or iterable of
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800173 real-valued numbers. If *weights* is omitted or *None*, then
174 equal weighting is assumed.
Steven D'Aprano22873182016-08-24 02:34:25 +1000175
Raymond Hettinger30a8b282021-02-07 16:44:42 -0800176 The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the
177 reciprocals of the data. For example, the harmonic mean of three values *a*,
178 *b* and *c* will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the
179 values is zero, the result will be zero.
Steven D'Aprano22873182016-08-24 02:34:25 +1000180
181 The harmonic mean is a type of average, a measure of the central
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700182 location of the data. It is often appropriate when averaging
Raymond Hettinger30a8b282021-02-07 16:44:42 -0800183 ratios or rates, for example speeds.
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700184
185 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
186 What is the average speed?
187
188 .. doctest::
189
190 >>> harmonic_mean([40, 60])
191 48.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000192
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800193 Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
194 speeds-up to 60 km/hr for the remaining 30 km of the journey. What
195 is the average speed?
Steven D'Aprano22873182016-08-24 02:34:25 +1000196
197 .. doctest::
198
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800199 >>> harmonic_mean([40, 60], weights=[5, 30])
200 56.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000201
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800202 :exc:`StatisticsError` is raised if *data* is empty, any element
203 is less than zero, or if the weighted sum isn't positive.
Steven D'Aprano22873182016-08-24 02:34:25 +1000204
Raymond Hettinger7f460492019-11-06 21:50:44 -0800205 The current algorithm has an early-out when it encounters a zero
206 in the input. This means that the subsequent inputs are not tested
207 for validity. (This behavior may change in the future.)
208
Zachary Warec019bd32016-08-23 13:23:31 -0500209 .. versionadded:: 3.6
210
Zackery Spytz66136762021-01-03 05:35:26 -0700211 .. versionchanged:: 3.10
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800212 Added support for *weights*.
Steven D'Aprano22873182016-08-24 02:34:25 +1000213
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700214.. function:: median(data)
215
Georg Brandleb2aeec2013-10-21 08:57:26 +0200216 Return the median (middle value) of numeric data, using the common "mean of
217 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800218 *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700219
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700220 The median is a robust measure of central location and is less affected by
221 the presence of outliers. When the number of data points is odd, the
222 middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700223
224 .. doctest::
225
226 >>> median([1, 3, 5])
227 3
228
Georg Brandleb2aeec2013-10-21 08:57:26 +0200229 When the number of data points is even, the median is interpolated by taking
230 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700231
232 .. doctest::
233
234 >>> median([1, 3, 5, 7])
235 4.0
236
Georg Brandleb2aeec2013-10-21 08:57:26 +0200237 This is suited for when your data is discrete, and you don't mind that the
238 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700239
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700240 If the data is ordinal (supports order operations) but not numeric (doesn't
241 support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300242 instead.
243
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700244.. function:: median_low(data)
245
Georg Brandleb2aeec2013-10-21 08:57:26 +0200246 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800247 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700248
Georg Brandleb2aeec2013-10-21 08:57:26 +0200249 The low median is always a member of the data set. When the number of data
250 points is odd, the middle value is returned. When it is even, the smaller of
251 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700252
253 .. doctest::
254
255 >>> median_low([1, 3, 5])
256 3
257 >>> median_low([1, 3, 5, 7])
258 3
259
Georg Brandleb2aeec2013-10-21 08:57:26 +0200260 Use the low median when your data are discrete and you prefer the median to
261 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700262
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700263
264.. function:: median_high(data)
265
Georg Brandleb2aeec2013-10-21 08:57:26 +0200266 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800267 is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700268
Georg Brandleb2aeec2013-10-21 08:57:26 +0200269 The high median is always a member of the data set. When the number of data
270 points is odd, the middle value is returned. When it is even, the larger of
271 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700272
273 .. doctest::
274
275 >>> median_high([1, 3, 5])
276 3
277 >>> median_high([1, 3, 5, 7])
278 5
279
Georg Brandleb2aeec2013-10-21 08:57:26 +0200280 Use the high median when your data are discrete and you prefer the median to
281 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700282
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700283
Georg Brandleb2aeec2013-10-21 08:57:26 +0200284.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700285
Georg Brandleb2aeec2013-10-21 08:57:26 +0200286 Return the median of grouped continuous data, calculated as the 50th
287 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800288 is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700289
290 .. doctest::
291
292 >>> median_grouped([52, 52, 53, 54])
293 52.5
294
Georg Brandleb2aeec2013-10-21 08:57:26 +0200295 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200296 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
297 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
298 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200299 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700300
301 .. doctest::
302
303 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
304 3.7
305
Georg Brandleb2aeec2013-10-21 08:57:26 +0200306 Optional argument *interval* represents the class interval, and defaults
307 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700308
309 .. doctest::
310
311 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
312 3.25
313 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
314 3.5
315
316 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200317 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700318
319 .. impl-detail::
320
Georg Brandleb2aeec2013-10-21 08:57:26 +0200321 Under some circumstances, :func:`median_grouped` may coerce data points to
322 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700323
324 .. seealso::
325
Georg Brandleb2aeec2013-10-21 08:57:26 +0200326 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
327 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700328
Georg Brandleb2aeec2013-10-21 08:57:26 +0200329 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100330 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200331 function in the Gnome Gnumeric spreadsheet, including `this discussion
332 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700333
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700334
335.. function:: mode(data)
336
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700337 Return the single most common data point from discrete or nominal *data*.
338 The mode (when it exists) is the most typical value and serves as a
339 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200340
Raymond Hettingere4810b22019-09-05 00:18:47 -0700341 If there are multiple modes with the same frequency, returns the first one
342 encountered in the *data*. If the smallest or largest of those is
343 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
344 If the input *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700345
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700346 ``mode`` assumes discrete data and returns a single value. This is the
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700347 standard treatment of the mode as commonly taught in schools:
348
349 .. doctest::
350
351 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
352 3
353
Raymond Hettingere4810b22019-09-05 00:18:47 -0700354 The mode is unique in that it is the only statistic in this package that
355 also applies to nominal (non-numeric) data:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700356
357 .. doctest::
358
359 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
360 'red'
361
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700362 .. versionchanged:: 3.8
363 Now handles multimodal datasets by returning the first mode encountered.
364 Formerly, it raised :exc:`StatisticsError` when more than one mode was
365 found.
366
367
368.. function:: multimode(data)
369
370 Return a list of the most frequently occurring values in the order they
371 were first encountered in the *data*. Will return more than one result if
372 there are multiple modes or an empty list if the *data* is empty:
373
374 .. doctest::
375
376 >>> multimode('aabbbbccddddeeffffgg')
377 ['b', 'd', 'f']
378 >>> multimode('')
379 []
380
381 .. versionadded:: 3.8
382
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700383
Georg Brandleb2aeec2013-10-21 08:57:26 +0200384.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700385
Georg Brandleb2aeec2013-10-21 08:57:26 +0200386 Return the population standard deviation (the square root of the population
387 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700388
389 .. doctest::
390
391 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
392 0.986893273527251
393
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700394
Georg Brandleb2aeec2013-10-21 08:57:26 +0200395.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700396
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800397 Return the population variance of *data*, a non-empty sequence or iterable
Raymond Hettingere4810b22019-09-05 00:18:47 -0700398 of real-valued numbers. Variance, or second moment about the mean, is a
399 measure of the variability (spread or dispersion) of data. A large
400 variance indicates that the data is spread out; a small variance indicates
401 it is clustered closely around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700402
Raymond Hettingere4810b22019-09-05 00:18:47 -0700403 If the optional second argument *mu* is given, it is typically the mean of
404 the *data*. It can also be used to compute the second moment around a
405 point that is not the mean. If it is missing or ``None`` (the default),
406 the arithmetic mean is automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700407
Georg Brandleb2aeec2013-10-21 08:57:26 +0200408 Use this function to calculate the variance from the entire population. To
409 estimate the variance from a sample, the :func:`variance` function is usually
410 a better choice.
411
412 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700413
414 Examples:
415
416 .. doctest::
417
418 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
419 >>> pvariance(data)
420 1.25
421
Georg Brandleb2aeec2013-10-21 08:57:26 +0200422 If you have already calculated the mean of your data, you can pass it as the
423 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700424
425 .. doctest::
426
427 >>> mu = mean(data)
428 >>> pvariance(data, mu)
429 1.25
430
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700431 Decimals and Fractions are supported:
432
433 .. doctest::
434
435 >>> from decimal import Decimal as D
436 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
437 Decimal('24.815')
438
439 >>> from fractions import Fraction as F
440 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
441 Fraction(13, 72)
442
443 .. note::
444
Georg Brandleb2aeec2013-10-21 08:57:26 +0200445 When called with the entire population, this gives the population variance
446 σ². When called on a sample instead, this is the biased sample variance
447 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700448
Raymond Hettingere4810b22019-09-05 00:18:47 -0700449 If you somehow know the true population mean μ, you may use this
450 function to calculate the variance of a sample, giving the known
451 population mean as the second argument. Provided the data points are a
452 random sample of the population, the result will be an unbiased estimate
453 of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700454
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700455
Georg Brandleb2aeec2013-10-21 08:57:26 +0200456.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700457
Georg Brandleb2aeec2013-10-21 08:57:26 +0200458 Return the sample standard deviation (the square root of the sample
459 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700460
461 .. doctest::
462
463 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
464 1.0810874155219827
465
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700466
Georg Brandleb2aeec2013-10-21 08:57:26 +0200467.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700468
Georg Brandleb2aeec2013-10-21 08:57:26 +0200469 Return the sample variance of *data*, an iterable of at least two real-valued
470 numbers. Variance, or second moment about the mean, is a measure of the
471 variability (spread or dispersion) of data. A large variance indicates that
472 the data is spread out; a small variance indicates it is clustered closely
473 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700474
Georg Brandleb2aeec2013-10-21 08:57:26 +0200475 If the optional second argument *xbar* is given, it should be the mean of
476 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700477 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700478
Georg Brandleb2aeec2013-10-21 08:57:26 +0200479 Use this function when your data is a sample from a population. To calculate
480 the variance from the entire population, see :func:`pvariance`.
481
482 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700483
484 Examples:
485
486 .. doctest::
487
488 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
489 >>> variance(data)
490 1.3720238095238095
491
Georg Brandleb2aeec2013-10-21 08:57:26 +0200492 If you have already calculated the mean of your data, you can pass it as the
493 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700494
495 .. doctest::
496
497 >>> m = mean(data)
498 >>> variance(data, m)
499 1.3720238095238095
500
Georg Brandleb2aeec2013-10-21 08:57:26 +0200501 This function does not attempt to verify that you have passed the actual mean
502 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700503 impossible results.
504
505 Decimal and Fraction values are supported:
506
507 .. doctest::
508
509 >>> from decimal import Decimal as D
510 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
511 Decimal('31.01875')
512
513 >>> from fractions import Fraction as F
514 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
515 Fraction(67, 108)
516
517 .. note::
518
Georg Brandleb2aeec2013-10-21 08:57:26 +0200519 This is the sample variance s² with Bessel's correction, also known as
520 variance with N-1 degrees of freedom. Provided that the data points are
521 representative (e.g. independent and identically distributed), the result
522 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700523
Georg Brandleb2aeec2013-10-21 08:57:26 +0200524 If you somehow know the actual population mean μ you should pass it to the
525 :func:`pvariance` function as the *mu* parameter to get the variance of a
526 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700527
Raymond Hettingere4810b22019-09-05 00:18:47 -0700528.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700529
Raymond Hettingere4810b22019-09-05 00:18:47 -0700530 Divide *data* into *n* continuous intervals with equal probability.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700531 Returns a list of ``n - 1`` cut points separating the intervals.
532
533 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
534 *n* to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700535 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700536 is not least 1.
537
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700538 The *data* can be any iterable containing sample data. For meaningful
Raymond Hettingere4810b22019-09-05 00:18:47 -0700539 results, the number of data points in *data* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700540 Raises :exc:`StatisticsError` if there are not at least two data points.
541
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700542 The cut points are linearly interpolated from the
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700543 two nearest data points. For example, if a cut point falls one-third
544 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700545 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700546
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700547 The *method* for computing quantiles can be varied depending on
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700548 whether the *data* includes or excludes the lowest and
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700549 highest possible values from the population.
550
551 The default *method* is "exclusive" and is used for data sampled from
552 a population that can have more extreme values than found in the
553 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700554 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
555 sample values, the method sorts them and assigns the following
556 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700557
558 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700559 data or for samples that are known to include the most extreme values
Raymond Hettingere4810b22019-09-05 00:18:47 -0700560 from the population. The minimum value in *data* is treated as the 0th
Raymond Hettingerb530a442019-07-21 16:32:00 -0700561 percentile and the maximum value is treated as the 100th percentile.
562 The portion of the population falling below the *i-th* of *m* sorted
563 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
564 values, the method sorts them and assigns the following percentiles:
565 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700566
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700567 .. doctest::
568
569 # Decile cut points for empirically sampled data
570 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
571 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
572 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
573 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
574 ... 103, 107, 101, 81, 109, 104]
575 >>> [round(q, 1) for q in quantiles(data, n=10)]
576 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
577
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700578 .. versionadded:: 3.8
579
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200580.. function:: covariance(x, y, /)
581
582 Return the sample covariance of two inputs *x* and *y*. Covariance
583 is a measure of the joint variability of two inputs.
584
585 Both inputs must be of the same length (no less than two), otherwise
586 :exc:`StatisticsError` is raised.
587
588 Examples:
589
590 .. doctest::
591
592 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
593 >>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
594 >>> covariance(x, y)
595 0.75
596 >>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
597 >>> covariance(x, z)
598 -7.5
599 >>> covariance(z, x)
600 -7.5
601
602 .. versionadded:: 3.10
603
604.. function:: correlation(x, y, /)
605
606 Return the `Pearson's correlation coefficient
607 <https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
608 for two inputs. Pearson's correlation coefficient *r* takes values
609 between -1 and +1. It measures the strength and direction of the linear
610 relationship, where +1 means very strong, positive linear relationship,
611 -1 very strong, negative linear relationship, and 0 no linear relationship.
612
613 Both inputs must be of the same length (no less than two), and need
614 not to be constant, otherwise :exc:`StatisticsError` is raised.
615
616 Examples:
617
618 .. doctest::
619
620 >>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
621 >>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
622 >>> correlation(x, x)
623 1.0
624 >>> correlation(x, y)
625 -1.0
626
627 .. versionadded:: 3.10
628
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700629.. function:: linear_regression(x, y, /)
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200630
Miss Islington (bot)86779872021-05-24 18:11:12 -0700631 Return the slope and intercept of `simple linear regression
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200632 <https://en.wikipedia.org/wiki/Simple_linear_regression>`_
633 parameters estimated using ordinary least squares. Simple linear
Miss Islington (bot)86779872021-05-24 18:11:12 -0700634 regression describes the relationship between an independent variable *x* and
635 a dependent variable *y* in terms of this linear function:
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200636
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700637 *y = slope \* x + intercept + noise*
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200638
Miss Islington (bot)86779872021-05-24 18:11:12 -0700639 where ``slope`` and ``intercept`` are the regression parameters that are
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700640 estimated, and ``noise`` represents the
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200641 variability of the data that was not explained by the linear regression
Miss Islington (bot)e6755ba2021-05-16 19:47:57 -0700642 (it is equal to the difference between predicted and actual values
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700643 of the dependent variable).
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200644
Miss Islington (bot)86779872021-05-24 18:11:12 -0700645 Both inputs must be of the same length (no less than two), and
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700646 the independent variable *x* cannot be constant;
647 otherwise a :exc:`StatisticsError` is raised.
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200648
Miss Islington (bot)e6755ba2021-05-16 19:47:57 -0700649 For example, we can use the `release dates of the Monty
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700650 Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_
651 to predict the cumulative number of Monty Python films
Miss Islington (bot)e6755ba2021-05-16 19:47:57 -0700652 that would have been produced by 2019
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700653 assuming that they had kept the pace.
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200654
655 .. doctest::
656
657 >>> year = [1971, 1975, 1979, 1982, 1983]
658 >>> films_total = [1, 2, 3, 4, 5]
Miss Islington (bot)86779872021-05-24 18:11:12 -0700659 >>> slope, intercept = linear_regression(year, films_total)
Miss Islington (bot)a6825192021-05-24 23:23:10 -0700660 >>> round(slope * 2019 + intercept)
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200661 16
662
Tymoteusz Wołodźko09aa6f92021-04-25 13:45:09 +0200663 .. versionadded:: 3.10
664
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700665
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700666Exceptions
667----------
668
669A single exception is defined:
670
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400671.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700672
Benjamin Peterson44c30652013-10-20 17:52:09 -0400673 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700674
Raymond Hettinger11c79532019-02-23 14:44:07 -0800675
676:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700677---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800678
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800679:class:`NormalDist` is a tool for creating and manipulating normal
680distributions of a `random variable
681<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700682class that treats the mean and standard deviation of data
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800683measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800684
685Normal distributions arise from the `Central Limit Theorem
686<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800687of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800688
689.. class:: NormalDist(mu=0.0, sigma=1.0)
690
691 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800692 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800693 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800694 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800695
696 If *sigma* is negative, raises :exc:`StatisticsError`.
697
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800698 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800699
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800700 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800701 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
702 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800703
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700704 .. attribute:: median
705
706 A read-only property for the `median
707 <https://en.wikipedia.org/wiki/Median>`_ of a normal
708 distribution.
709
710 .. attribute:: mode
711
712 A read-only property for the `mode
713 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
714 distribution.
715
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800716 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800717
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800718 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800719 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
720 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800721
722 .. attribute:: variance
723
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800724 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800725 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
726 distribution. Equal to the square of the standard deviation.
727
728 .. classmethod:: NormalDist.from_samples(data)
729
Raymond Hettingere4810b22019-09-05 00:18:47 -0700730 Makes a normal distribution instance with *mu* and *sigma* parameters
731 estimated from the *data* using :func:`fmean` and :func:`stdev`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800732
Raymond Hettingere4810b22019-09-05 00:18:47 -0700733 The *data* can be any :term:`iterable` and should consist of values
734 that can be converted to type :class:`float`. If *data* does not
735 contain at least two elements, raises :exc:`StatisticsError` because it
736 takes at least one point to estimate a central value and at least two
737 points to estimate dispersion.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800738
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700739 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800740
741 Generates *n* random samples for a given mean and standard deviation.
742 Returns a :class:`list` of :class:`float` values.
743
744 If *seed* is given, creates a new instance of the underlying random
745 number generator. This is useful for creating reproducible results,
746 even in a multi-threading context.
747
748 .. method:: NormalDist.pdf(x)
749
750 Using a `probability density function (pdf)
Raymond Hettingere4810b22019-09-05 00:18:47 -0700751 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
752 the relative likelihood that a random variable *X* will be near the
753 given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
754 X < x+dx) / dx`` as *dx* approaches zero.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800755
Raymond Hettingercc353a02019-03-10 23:43:33 -0700756 The relative likelihood is computed as the probability of a sample
757 occurring in a narrow range divided by the width of the range (hence
758 the word "density"). Since the likelihood is relative to other points,
759 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800760
761 .. method:: NormalDist.cdf(x)
762
763 Using a `cumulative distribution function (cdf)
764 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800765 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800766 equal to *x*. Mathematically, it is written ``P(X <= x)``.
767
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700768 .. method:: NormalDist.inv_cdf(p)
769
770 Compute the inverse cumulative distribution function, also known as the
771 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
772 or the `percent-point
773 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
774 function. Mathematically, it is written ``x : P(X <= x) = p``.
775
776 Finds the value *x* of the random variable *X* such that the
777 probability of the variable being less than or equal to that value
778 equals the given probability *p*.
779
Raymond Hettinger318d5372019-03-06 22:59:40 -0800780 .. method:: NormalDist.overlap(other)
781
Raymond Hettingere4810b22019-09-05 00:18:47 -0700782 Measures the agreement between two normal probability distributions.
783 Returns a value between 0.0 and 1.0 giving `the overlapping area for
784 the two probability density functions
785 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800786
Raymond Hettinger8a6cbf82019-10-13 19:53:30 -0700787 .. method:: NormalDist.quantiles(n=4)
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700788
789 Divide the normal distribution into *n* continuous intervals with
790 equal probability. Returns a list of (n - 1) cut points separating
791 the intervals.
792
793 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles.
794 Set *n* to 100 for percentiles which gives the 99 cuts points that
795 separate the normal distribution into 100 equal sized groups.
796
Raymond Hettinger70f027d2020-04-16 10:25:14 -0700797 .. method:: NormalDist.zscore(x)
798
799 Compute the
800 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
801 describing *x* in terms of the number of standard deviations
802 above or below the mean of the normal distribution:
803 ``(x - mean) / stdev``.
804
805 .. versionadded:: 3.9
806
Raymond Hettinger11c79532019-02-23 14:44:07 -0800807 Instances of :class:`NormalDist` support addition, subtraction,
808 multiplication and division by a constant. These operations
809 are used for translation and scaling. For example:
810
811 .. doctest::
812
813 >>> temperature_february = NormalDist(5, 2.5) # Celsius
814 >>> temperature_february * (9/5) + 32 # Fahrenheit
815 NormalDist(mu=41.0, sigma=4.5)
816
Raymond Hettingercc353a02019-03-10 23:43:33 -0700817 Dividing a constant by an instance of :class:`NormalDist` is not supported
818 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800819
820 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800821 variables, it is possible to `add and subtract two independent normally
822 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800823 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
824 represented as instances of :class:`NormalDist`. For example:
825
826 .. doctest::
827
828 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
829 >>> drug_effects = NormalDist(0.4, 0.15)
830 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700831 >>> round(combined.mean, 1)
832 3.1
833 >>> round(combined.stdev, 1)
834 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800835
836 .. versionadded:: 3.8
837
838
839:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700840^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800841
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800842:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800843
844For example, given `historical data for SAT exams
Raymond Hettinger01bf2192020-01-27 18:31:46 -0800845<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
846that scores are normally distributed with a mean of 1060 and a standard
847deviation of 195, determine the percentage of students with test scores
848between 1100 and 1200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800849
850.. doctest::
851
852 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800853 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700854 >>> round(fraction * 100.0, 1)
855 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800856
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700857Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
858<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
859
860.. doctest::
861
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700862 >>> list(map(round, sat.quantiles()))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700863 [928, 1060, 1192]
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700864 >>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700865 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
866
Raymond Hettinger11c79532019-02-23 14:44:07 -0800867To estimate the distribution for a model than isn't easy to solve
868analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700869Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800870
871.. doctest::
872
Raymond Hettingercc353a02019-03-10 23:43:33 -0700873 >>> def model(x, y, z):
874 ... return (3*x + 7*x*y - 5*y) / (11 * z)
875 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800876 >>> n = 100_000
Raymond Hettingere4810b22019-09-05 00:18:47 -0700877 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
878 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
879 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
880 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
881 [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger11c79532019-02-23 14:44:07 -0800882
Raymond Hettinger10355ed2020-01-25 20:21:17 -0800883Normal distributions can be used to approximate `Binomial
884distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
885when the sample size is large and when the probability of a successful
886trial is near 50%.
887
888For example, an open source conference has 750 attendees and two rooms with a
889500 person capacity. There is a talk about Python and another about Ruby.
890In previous conferences, 65% of the attendees preferred to listen to Python
891talks. Assuming the population preferences haven't changed, what is the
Raymond Hettinger01bf2192020-01-27 18:31:46 -0800892probability that the Python room will stay within its capacity limits?
Raymond Hettinger10355ed2020-01-25 20:21:17 -0800893
894.. doctest::
895
896 >>> n = 750 # Sample size
897 >>> p = 0.65 # Preference for Python
898 >>> q = 1.0 - p # Preference for Ruby
899 >>> k = 500 # Room capacity
900
901 >>> # Approximation using the cumulative normal distribution
902 >>> from math import sqrt
903 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
904 0.8402
905
906 >>> # Solution using the cumulative binomial distribution
907 >>> from math import comb, fsum
908 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
909 0.8402
910
911 >>> # Approximation using a simulation
912 >>> from random import seed, choices
913 >>> seed(8675309)
914 >>> def trial():
915 ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
916 >>> mean(trial() <= k for i in range(10_000))
917 0.8398
918
Raymond Hettinger11c79532019-02-23 14:44:07 -0800919Normal distributions commonly arise in machine learning problems.
920
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800921Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800922<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
923The challenge is to predict a person's gender from measurements of normally
924distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800925
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800926We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800927measurements are assumed to be normally distributed, so we summarize the data
928with :class:`NormalDist`:
929
930.. doctest::
931
932 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
933 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
934 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
935 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
936 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
937 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
938
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800939Next, we encounter a new person whose feature measurements are known but whose
940gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800941
942.. doctest::
943
944 >>> ht = 6.0 # height
945 >>> wt = 130 # weight
946 >>> fs = 8 # foot size
947
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800948Starting with a 50% `prior probability
949<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
950we compute the posterior as the prior times the product of likelihoods for the
951feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800952
953.. doctest::
954
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800955 >>> prior_male = 0.5
956 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800957 >>> posterior_male = (prior_male * height_male.pdf(ht) *
958 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
959
960 >>> posterior_female = (prior_female * height_female.pdf(ht) *
961 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
962
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800963The final prediction goes to the largest posterior. This is known as the
964`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800965<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
966
967.. doctest::
968
969 >>> 'male' if posterior_male > posterior_female else 'female'
970 'female'
971
972
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700973..
974 # This modelines must appear within the last ten lines of the file.
975 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;