blob: 6467704006d905c44de0c1b4af8b52197870a237 [file] [log] [blame]
Larry Hastingsf5e987b2013-10-19 11:50:09 -07001:mod:`statistics` --- Mathematical statistics functions
2=======================================================
3
4.. module:: statistics
Sanchit Khuranaf8a63162019-11-26 03:47:59 +05305 :synopsis: Mathematical statistics functions
Terry Jan Reedyfa089b92016-06-11 15:02:54 -04006
Larry Hastingsf5e987b2013-10-19 11:50:09 -07007.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
8.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
9
10.. versionadded:: 3.4
11
Terry Jan Reedyfa089b92016-06-11 15:02:54 -040012**Source code:** :source:`Lib/statistics.py`
13
Larry Hastingsf5e987b2013-10-19 11:50:09 -070014.. testsetup:: *
15
16 from statistics import *
17 __name__ = '<doctest>'
18
Larry Hastingsf5e987b2013-10-19 11:50:09 -070019--------------
20
21This module provides functions for calculating mathematical statistics of
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070022numeric (:class:`~numbers.Real`-valued) data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070023
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070024The module is not intended to be a competitor to third-party libraries such
25as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
26proprietary full-featured statistics packages aimed at professional
27statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
28graphing and scientific calculators.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100029
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -070030Unless explicitly noted, these functions support :class:`int`,
31:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
32Behaviour with other types (whether in the numeric tower or not) is
33currently unsupported. Collections with a mix of types are also undefined
34and implementation-dependent. If your input data consists of mixed types,
35you may be able to use :func:`map` to ensure a consistent result, for
36example: ``map(float, input_data)``.
Nick Coghlan73afe2a2014-02-08 19:58:04 +100037
Larry Hastingsf5e987b2013-10-19 11:50:09 -070038Averages and measures of central location
39-----------------------------------------
40
41These functions calculate an average or typical value from a population
42or sample.
43
Raymond Hettingerfc06a192019-03-12 00:43:27 -070044======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070045:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger47d99872019-02-21 15:06:29 -080046:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger6463ba32019-04-07 09:20:03 -070047:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano22873182016-08-24 02:34:25 +100048:func:`harmonic_mean` Harmonic mean of data.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070049:func:`median` Median (middle value) of data.
50:func:`median_low` Low median of data.
51:func:`median_high` High median of data.
52:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070053:func:`mode` Single mode (most common value) of discrete or nominal data.
54:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -070055:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettingerfc06a192019-03-12 00:43:27 -070056======================= ===============================================================
Larry Hastingsf5e987b2013-10-19 11:50:09 -070057
Georg Brandleb2aeec2013-10-21 08:57:26 +020058Measures of spread
59------------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070060
Georg Brandleb2aeec2013-10-21 08:57:26 +020061These functions calculate a measure of how much the population or sample
62tends to deviate from the typical or average values.
63
64======================= =============================================
65:func:`pstdev` Population standard deviation of data.
66:func:`pvariance` Population variance of data.
67:func:`stdev` Sample standard deviation of data.
68:func:`variance` Sample variance of data.
69======================= =============================================
70
71
72Function details
73----------------
Larry Hastingsf5e987b2013-10-19 11:50:09 -070074
Georg Brandle051b552013-11-04 07:30:50 +010075Note: The functions do not require the data given to them to be sorted.
76However, for reading convenience, most of the examples show sorted sequences.
77
Larry Hastingsf5e987b2013-10-19 11:50:09 -070078.. function:: mean(data)
79
Raymond Hettinger733b9a32019-11-11 23:35:06 -080080 Return the sample arithmetic mean of *data* which can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070081
Georg Brandleb2aeec2013-10-21 08:57:26 +020082 The arithmetic mean is the sum of the data divided by the number of data
83 points. It is commonly called "the average", although it is only one of many
84 different mathematical averages. It is a measure of the central location of
85 the data.
86
87 If *data* is empty, :exc:`StatisticsError` will be raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -070088
89 Some examples of use:
90
91 .. doctest::
92
93 >>> mean([1, 2, 3, 4, 4])
94 2.8
95 >>> mean([-1.0, 2.5, 3.25, 5.75])
96 2.625
97
98 >>> from fractions import Fraction as F
99 >>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
100 Fraction(13, 21)
101
102 >>> from decimal import Decimal as D
103 >>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
104 Decimal('0.5625')
105
106 .. note::
107
Georg Brandla3fdcaa2013-10-21 09:08:39 +0200108 The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettingere4810b22019-09-05 00:18:47 -0700109 for central location: the mean is not necessarily a typical example of
110 the data points. For more robust measures of central location, see
111 :func:`median` and :func:`mode`.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700112
Georg Brandleb2aeec2013-10-21 08:57:26 +0200113 The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700114 so that when taken on average over all the possible samples,
Georg Brandleb2aeec2013-10-21 08:57:26 +0200115 ``mean(sample)`` converges on the true mean of the entire population. If
116 *data* represents the entire population rather than a sample, then
117 ``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700118
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700119
Raymond Hettinger47d99872019-02-21 15:06:29 -0800120.. function:: fmean(data)
121
122 Convert *data* to floats and compute the arithmetic mean.
123
124 This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800125 :class:`float`. The *data* may be a sequence or iterable. If the input
Raymond Hettingere4810b22019-09-05 00:18:47 -0700126 dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger47d99872019-02-21 15:06:29 -0800127
128 .. doctest::
129
130 >>> fmean([3.5, 4.0, 5.25])
131 4.25
132
133 .. versionadded:: 3.8
134
135
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700136.. function:: geometric_mean(data)
137
138 Convert *data* to floats and compute the geometric mean.
139
Raymond Hettingere4810b22019-09-05 00:18:47 -0700140 The geometric mean indicates the central tendency or typical value of the
141 *data* using the product of the values (as opposed to the arithmetic mean
142 which uses their sum).
143
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700144 Raises a :exc:`StatisticsError` if the input dataset is empty,
145 if it contains a zero, or if it contains a negative value.
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800146 The *data* may be a sequence or iterable.
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700147
148 No special efforts are made to achieve exact results.
149 (However, this may change in the future.)
150
151 .. doctest::
152
Raymond Hettingere4810b22019-09-05 00:18:47 -0700153 >>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger6463ba32019-04-07 09:20:03 -0700154 36.0
155
156 .. versionadded:: 3.8
157
158
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800159.. function:: harmonic_mean(data, weights=None)
Steven D'Aprano22873182016-08-24 02:34:25 +1000160
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800161 Return the harmonic mean of *data*, a sequence or iterable of
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800162 real-valued numbers. If *weights* is omitted or *None*, then
163 equal weighting is assumed.
Steven D'Aprano22873182016-08-24 02:34:25 +1000164
165 The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Warec019bd32016-08-23 13:23:31 -0500166 reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano22873182016-08-24 02:34:25 +1000167 data. For example, the harmonic mean of three values *a*, *b* and *c*
Raymond Hettinger7f460492019-11-06 21:50:44 -0800168 will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the values
169 is zero, the result will be zero.
Steven D'Aprano22873182016-08-24 02:34:25 +1000170
171 The harmonic mean is a type of average, a measure of the central
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700172 location of the data. It is often appropriate when averaging
173 rates or ratios, for example speeds.
174
175 Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
176 What is the average speed?
177
178 .. doctest::
179
180 >>> harmonic_mean([40, 60])
181 48.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000182
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800183 Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
184 speeds-up to 60 km/hr for the remaining 30 km of the journey. What
185 is the average speed?
Steven D'Aprano22873182016-08-24 02:34:25 +1000186
187 .. doctest::
188
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800189 >>> harmonic_mean([40, 60], weights=[5, 30])
190 56.0
Steven D'Aprano22873182016-08-24 02:34:25 +1000191
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800192 :exc:`StatisticsError` is raised if *data* is empty, any element
193 is less than zero, or if the weighted sum isn't positive.
Steven D'Aprano22873182016-08-24 02:34:25 +1000194
Raymond Hettinger7f460492019-11-06 21:50:44 -0800195 The current algorithm has an early-out when it encounters a zero
196 in the input. This means that the subsequent inputs are not tested
197 for validity. (This behavior may change in the future.)
198
Zachary Warec019bd32016-08-23 13:23:31 -0500199 .. versionadded:: 3.6
200
Raymond Hettingercc3467a2020-12-23 19:52:09 -0800201 .. versionchanged:: 3.8
202 Added support for *weights*.
Steven D'Aprano22873182016-08-24 02:34:25 +1000203
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700204.. function:: median(data)
205
Georg Brandleb2aeec2013-10-21 08:57:26 +0200206 Return the median (middle value) of numeric data, using the common "mean of
207 middle two" method. If *data* is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800208 *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700209
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700210 The median is a robust measure of central location and is less affected by
211 the presence of outliers. When the number of data points is odd, the
212 middle data point is returned:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700213
214 .. doctest::
215
216 >>> median([1, 3, 5])
217 3
218
Georg Brandleb2aeec2013-10-21 08:57:26 +0200219 When the number of data points is even, the median is interpolated by taking
220 the average of the two middle values:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700221
222 .. doctest::
223
224 >>> median([1, 3, 5, 7])
225 4.0
226
Georg Brandleb2aeec2013-10-21 08:57:26 +0200227 This is suited for when your data is discrete, and you don't mind that the
228 median may not be an actual data point.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700229
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700230 If the data is ordinal (supports order operations) but not numeric (doesn't
231 support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einatfdd6e0b2018-06-25 14:04:01 +0300232 instead.
233
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700234.. function:: median_low(data)
235
Georg Brandleb2aeec2013-10-21 08:57:26 +0200236 Return the low median of numeric data. If *data* is empty,
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800237 :exc:`StatisticsError` is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700238
Georg Brandleb2aeec2013-10-21 08:57:26 +0200239 The low median is always a member of the data set. When the number of data
240 points is odd, the middle value is returned. When it is even, the smaller of
241 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700242
243 .. doctest::
244
245 >>> median_low([1, 3, 5])
246 3
247 >>> median_low([1, 3, 5, 7])
248 3
249
Georg Brandleb2aeec2013-10-21 08:57:26 +0200250 Use the low median when your data are discrete and you prefer the median to
251 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700252
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700253
254.. function:: median_high(data)
255
Georg Brandleb2aeec2013-10-21 08:57:26 +0200256 Return the high median of data. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800257 is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700258
Georg Brandleb2aeec2013-10-21 08:57:26 +0200259 The high median is always a member of the data set. When the number of data
260 points is odd, the middle value is returned. When it is even, the larger of
261 the two middle values is returned.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700262
263 .. doctest::
264
265 >>> median_high([1, 3, 5])
266 3
267 >>> median_high([1, 3, 5, 7])
268 5
269
Georg Brandleb2aeec2013-10-21 08:57:26 +0200270 Use the high median when your data are discrete and you prefer the median to
271 be an actual data point rather than interpolated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700272
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700273
Georg Brandleb2aeec2013-10-21 08:57:26 +0200274.. function:: median_grouped(data, interval=1)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700275
Georg Brandleb2aeec2013-10-21 08:57:26 +0200276 Return the median of grouped continuous data, calculated as the 50th
277 percentile, using interpolation. If *data* is empty, :exc:`StatisticsError`
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800278 is raised. *data* can be a sequence or iterable.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700279
280 .. doctest::
281
282 >>> median_grouped([52, 52, 53, 54])
283 52.5
284
Georg Brandleb2aeec2013-10-21 08:57:26 +0200285 In the following example, the data are rounded, so that each value represents
Serhiy Storchakac7b1a0b2016-11-26 13:43:28 +0200286 the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
287 is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
288 given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandleb2aeec2013-10-21 08:57:26 +0200289 interpolation is used to estimate it:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700290
291 .. doctest::
292
293 >>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
294 3.7
295
Georg Brandleb2aeec2013-10-21 08:57:26 +0200296 Optional argument *interval* represents the class interval, and defaults
297 to 1. Changing the class interval naturally will change the interpolation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700298
299 .. doctest::
300
301 >>> median_grouped([1, 3, 3, 5, 7], interval=1)
302 3.25
303 >>> median_grouped([1, 3, 3, 5, 7], interval=2)
304 3.5
305
306 This function does not check whether the data points are at least
Georg Brandleb2aeec2013-10-21 08:57:26 +0200307 *interval* apart.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700308
309 .. impl-detail::
310
Georg Brandleb2aeec2013-10-21 08:57:26 +0200311 Under some circumstances, :func:`median_grouped` may coerce data points to
312 floats. This behaviour is likely to change in the future.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700313
314 .. seealso::
315
Georg Brandleb2aeec2013-10-21 08:57:26 +0200316 * "Statistics for the Behavioral Sciences", Frederick J Gravetter and
317 Larry B Wallnau (8th Edition).
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700318
Georg Brandleb2aeec2013-10-21 08:57:26 +0200319 * The `SSMEDIAN
Georg Brandl525d3552014-10-29 10:26:56 +0100320 <https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandleb2aeec2013-10-21 08:57:26 +0200321 function in the Gnome Gnumeric spreadsheet, including `this discussion
322 <https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700323
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700324
325.. function:: mode(data)
326
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700327 Return the single most common data point from discrete or nominal *data*.
328 The mode (when it exists) is the most typical value and serves as a
329 measure of central location.
Georg Brandleb2aeec2013-10-21 08:57:26 +0200330
Raymond Hettingere4810b22019-09-05 00:18:47 -0700331 If there are multiple modes with the same frequency, returns the first one
332 encountered in the *data*. If the smallest or largest of those is
333 desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
334 If the input *data* is empty, :exc:`StatisticsError` is raised.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700335
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700336 ``mode`` assumes discrete data and returns a single value. This is the
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700337 standard treatment of the mode as commonly taught in schools:
338
339 .. doctest::
340
341 >>> mode([1, 1, 2, 3, 3, 3, 3, 4])
342 3
343
Raymond Hettingere4810b22019-09-05 00:18:47 -0700344 The mode is unique in that it is the only statistic in this package that
345 also applies to nominal (non-numeric) data:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700346
347 .. doctest::
348
349 >>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
350 'red'
351
Raymond Hettingerfc06a192019-03-12 00:43:27 -0700352 .. versionchanged:: 3.8
353 Now handles multimodal datasets by returning the first mode encountered.
354 Formerly, it raised :exc:`StatisticsError` when more than one mode was
355 found.
356
357
358.. function:: multimode(data)
359
360 Return a list of the most frequently occurring values in the order they
361 were first encountered in the *data*. Will return more than one result if
362 there are multiple modes or an empty list if the *data* is empty:
363
364 .. doctest::
365
366 >>> multimode('aabbbbccddddeeffffgg')
367 ['b', 'd', 'f']
368 >>> multimode('')
369 []
370
371 .. versionadded:: 3.8
372
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700373
Georg Brandleb2aeec2013-10-21 08:57:26 +0200374.. function:: pstdev(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700375
Georg Brandleb2aeec2013-10-21 08:57:26 +0200376 Return the population standard deviation (the square root of the population
377 variance). See :func:`pvariance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700378
379 .. doctest::
380
381 >>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
382 0.986893273527251
383
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700384
Georg Brandleb2aeec2013-10-21 08:57:26 +0200385.. function:: pvariance(data, mu=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700386
Raymond Hettinger733b9a32019-11-11 23:35:06 -0800387 Return the population variance of *data*, a non-empty sequence or iterable
Raymond Hettingere4810b22019-09-05 00:18:47 -0700388 of real-valued numbers. Variance, or second moment about the mean, is a
389 measure of the variability (spread or dispersion) of data. A large
390 variance indicates that the data is spread out; a small variance indicates
391 it is clustered closely around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700392
Raymond Hettingere4810b22019-09-05 00:18:47 -0700393 If the optional second argument *mu* is given, it is typically the mean of
394 the *data*. It can also be used to compute the second moment around a
395 point that is not the mean. If it is missing or ``None`` (the default),
396 the arithmetic mean is automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700397
Georg Brandleb2aeec2013-10-21 08:57:26 +0200398 Use this function to calculate the variance from the entire population. To
399 estimate the variance from a sample, the :func:`variance` function is usually
400 a better choice.
401
402 Raises :exc:`StatisticsError` if *data* is empty.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700403
404 Examples:
405
406 .. doctest::
407
408 >>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
409 >>> pvariance(data)
410 1.25
411
Georg Brandleb2aeec2013-10-21 08:57:26 +0200412 If you have already calculated the mean of your data, you can pass it as the
413 optional second argument *mu* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700414
415 .. doctest::
416
417 >>> mu = mean(data)
418 >>> pvariance(data, mu)
419 1.25
420
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700421 Decimals and Fractions are supported:
422
423 .. doctest::
424
425 >>> from decimal import Decimal as D
426 >>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
427 Decimal('24.815')
428
429 >>> from fractions import Fraction as F
430 >>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
431 Fraction(13, 72)
432
433 .. note::
434
Georg Brandleb2aeec2013-10-21 08:57:26 +0200435 When called with the entire population, this gives the population variance
436 σ². When called on a sample instead, this is the biased sample variance
437 s², also known as variance with N degrees of freedom.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700438
Raymond Hettingere4810b22019-09-05 00:18:47 -0700439 If you somehow know the true population mean μ, you may use this
440 function to calculate the variance of a sample, giving the known
441 population mean as the second argument. Provided the data points are a
442 random sample of the population, the result will be an unbiased estimate
443 of the population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700444
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700445
Georg Brandleb2aeec2013-10-21 08:57:26 +0200446.. function:: stdev(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700447
Georg Brandleb2aeec2013-10-21 08:57:26 +0200448 Return the sample standard deviation (the square root of the sample
449 variance). See :func:`variance` for arguments and other details.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700450
451 .. doctest::
452
453 >>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
454 1.0810874155219827
455
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700456
Georg Brandleb2aeec2013-10-21 08:57:26 +0200457.. function:: variance(data, xbar=None)
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700458
Georg Brandleb2aeec2013-10-21 08:57:26 +0200459 Return the sample variance of *data*, an iterable of at least two real-valued
460 numbers. Variance, or second moment about the mean, is a measure of the
461 variability (spread or dispersion) of data. A large variance indicates that
462 the data is spread out; a small variance indicates it is clustered closely
463 around the mean.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700464
Georg Brandleb2aeec2013-10-21 08:57:26 +0200465 If the optional second argument *xbar* is given, it should be the mean of
466 *data*. If it is missing or ``None`` (the default), the mean is
Ned Deily35866732013-10-19 12:10:01 -0700467 automatically calculated.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700468
Georg Brandleb2aeec2013-10-21 08:57:26 +0200469 Use this function when your data is a sample from a population. To calculate
470 the variance from the entire population, see :func:`pvariance`.
471
472 Raises :exc:`StatisticsError` if *data* has fewer than two values.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700473
474 Examples:
475
476 .. doctest::
477
478 >>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
479 >>> variance(data)
480 1.3720238095238095
481
Georg Brandleb2aeec2013-10-21 08:57:26 +0200482 If you have already calculated the mean of your data, you can pass it as the
483 optional second argument *xbar* to avoid recalculation:
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700484
485 .. doctest::
486
487 >>> m = mean(data)
488 >>> variance(data, m)
489 1.3720238095238095
490
Georg Brandleb2aeec2013-10-21 08:57:26 +0200491 This function does not attempt to verify that you have passed the actual mean
492 as *xbar*. Using arbitrary values for *xbar* can lead to invalid or
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700493 impossible results.
494
495 Decimal and Fraction values are supported:
496
497 .. doctest::
498
499 >>> from decimal import Decimal as D
500 >>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
501 Decimal('31.01875')
502
503 >>> from fractions import Fraction as F
504 >>> variance([F(1, 6), F(1, 2), F(5, 3)])
505 Fraction(67, 108)
506
507 .. note::
508
Georg Brandleb2aeec2013-10-21 08:57:26 +0200509 This is the sample variance s² with Bessel's correction, also known as
510 variance with N-1 degrees of freedom. Provided that the data points are
511 representative (e.g. independent and identically distributed), the result
512 should be an unbiased estimate of the true population variance.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700513
Georg Brandleb2aeec2013-10-21 08:57:26 +0200514 If you somehow know the actual population mean μ you should pass it to the
515 :func:`pvariance` function as the *mu* parameter to get the variance of a
516 sample.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700517
Raymond Hettingere4810b22019-09-05 00:18:47 -0700518.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700519
Raymond Hettingere4810b22019-09-05 00:18:47 -0700520 Divide *data* into *n* continuous intervals with equal probability.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700521 Returns a list of ``n - 1`` cut points separating the intervals.
522
523 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles. Set
524 *n* to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700525 *data* into 100 equal sized groups. Raises :exc:`StatisticsError` if *n*
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700526 is not least 1.
527
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700528 The *data* can be any iterable containing sample data. For meaningful
Raymond Hettingere4810b22019-09-05 00:18:47 -0700529 results, the number of data points in *data* should be larger than *n*.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700530 Raises :exc:`StatisticsError` if there are not at least two data points.
531
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700532 The cut points are linearly interpolated from the
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700533 two nearest data points. For example, if a cut point falls one-third
534 of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700535 cut-point will evaluate to ``104``.
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700536
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700537 The *method* for computing quantiles can be varied depending on
Raymond Hettingerd8c93aa2019-09-05 23:02:27 -0700538 whether the *data* includes or excludes the lowest and
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700539 highest possible values from the population.
540
541 The default *method* is "exclusive" and is used for data sampled from
542 a population that can have more extreme values than found in the
543 samples. The portion of the population falling below the *i-th* of
Raymond Hettingerb530a442019-07-21 16:32:00 -0700544 *m* sorted data points is computed as ``i / (m + 1)``. Given nine
545 sample values, the method sorts them and assigns the following
546 percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700547
548 Setting the *method* to "inclusive" is used for describing population
Raymond Hettingerb530a442019-07-21 16:32:00 -0700549 data or for samples that are known to include the most extreme values
Raymond Hettingere4810b22019-09-05 00:18:47 -0700550 from the population. The minimum value in *data* is treated as the 0th
Raymond Hettingerb530a442019-07-21 16:32:00 -0700551 percentile and the maximum value is treated as the 100th percentile.
552 The portion of the population falling below the *i-th* of *m* sorted
553 data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
554 values, the method sorts them and assigns the following percentiles:
555 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettingere917f2e2019-05-18 10:18:29 -0700556
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700557 .. doctest::
558
559 # Decile cut points for empirically sampled data
560 >>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
561 ... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
562 ... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
563 ... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
564 ... 103, 107, 101, 81, 109, 104]
565 >>> [round(q, 1) for q in quantiles(data, n=10)]
566 [81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
567
Raymond Hettinger9013ccf2019-04-23 00:06:35 -0700568 .. versionadded:: 3.8
569
570
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700571Exceptions
572----------
573
574A single exception is defined:
575
Benjamin Peterson4ea16e52013-10-20 17:52:54 -0400576.. exception:: StatisticsError
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700577
Benjamin Peterson44c30652013-10-20 17:52:09 -0400578 Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700579
Raymond Hettinger11c79532019-02-23 14:44:07 -0800580
581:class:`NormalDist` objects
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700582---------------------------
Raymond Hettinger11c79532019-02-23 14:44:07 -0800583
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800584:class:`NormalDist` is a tool for creating and manipulating normal
585distributions of a `random variable
586<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettingere4810b22019-09-05 00:18:47 -0700587class that treats the mean and standard deviation of data
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800588measurements as a single entity.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800589
590Normal distributions arise from the `Central Limit Theorem
591<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800592of applications in statistics.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800593
594.. class:: NormalDist(mu=0.0, sigma=1.0)
595
596 Returns a new *NormalDist* object where *mu* represents the `arithmetic
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800597 mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and *sigma*
Raymond Hettinger11c79532019-02-23 14:44:07 -0800598 represents the `standard deviation
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800599 <https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800600
601 If *sigma* is negative, raises :exc:`StatisticsError`.
602
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800603 .. attribute:: mean
Raymond Hettinger11c79532019-02-23 14:44:07 -0800604
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800605 A read-only property for the `arithmetic mean
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800606 <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
607 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800608
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700609 .. attribute:: median
610
611 A read-only property for the `median
612 <https://en.wikipedia.org/wiki/Median>`_ of a normal
613 distribution.
614
615 .. attribute:: mode
616
617 A read-only property for the `mode
618 <https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
619 distribution.
620
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800621 .. attribute:: stdev
Raymond Hettinger11c79532019-02-23 14:44:07 -0800622
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800623 A read-only property for the `standard deviation
Raymond Hettinger9e456bc2019-02-24 11:44:55 -0800624 <https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
625 distribution.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800626
627 .. attribute:: variance
628
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800629 A read-only property for the `variance
Raymond Hettinger11c79532019-02-23 14:44:07 -0800630 <https://en.wikipedia.org/wiki/Variance>`_ of a normal
631 distribution. Equal to the square of the standard deviation.
632
633 .. classmethod:: NormalDist.from_samples(data)
634
Raymond Hettingere4810b22019-09-05 00:18:47 -0700635 Makes a normal distribution instance with *mu* and *sigma* parameters
636 estimated from the *data* using :func:`fmean` and :func:`stdev`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800637
Raymond Hettingere4810b22019-09-05 00:18:47 -0700638 The *data* can be any :term:`iterable` and should consist of values
639 that can be converted to type :class:`float`. If *data* does not
640 contain at least two elements, raises :exc:`StatisticsError` because it
641 takes at least one point to estimate a central value and at least two
642 points to estimate dispersion.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800643
Raymond Hettingerfb8c7d52019-04-23 01:46:18 -0700644 .. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger11c79532019-02-23 14:44:07 -0800645
646 Generates *n* random samples for a given mean and standard deviation.
647 Returns a :class:`list` of :class:`float` values.
648
649 If *seed* is given, creates a new instance of the underlying random
650 number generator. This is useful for creating reproducible results,
651 even in a multi-threading context.
652
653 .. method:: NormalDist.pdf(x)
654
655 Using a `probability density function (pdf)
Raymond Hettingere4810b22019-09-05 00:18:47 -0700656 <https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
657 the relative likelihood that a random variable *X* will be near the
658 given value *x*. Mathematically, it is the limit of the ratio ``P(x <=
659 X < x+dx) / dx`` as *dx* approaches zero.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800660
Raymond Hettingercc353a02019-03-10 23:43:33 -0700661 The relative likelihood is computed as the probability of a sample
662 occurring in a narrow range divided by the width of the range (hence
663 the word "density"). Since the likelihood is relative to other points,
664 its value can be greater than `1.0`.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800665
666 .. method:: NormalDist.cdf(x)
667
668 Using a `cumulative distribution function (cdf)
669 <https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger9add4b32019-02-28 21:47:26 -0800670 compute the probability that a random variable *X* will be less than or
Raymond Hettinger11c79532019-02-23 14:44:07 -0800671 equal to *x*. Mathematically, it is written ``P(X <= x)``.
672
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700673 .. method:: NormalDist.inv_cdf(p)
674
675 Compute the inverse cumulative distribution function, also known as the
676 `quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
677 or the `percent-point
678 <https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
679 function. Mathematically, it is written ``x : P(X <= x) = p``.
680
681 Finds the value *x* of the random variable *X* such that the
682 probability of the variable being less than or equal to that value
683 equals the given probability *p*.
684
Raymond Hettinger318d5372019-03-06 22:59:40 -0800685 .. method:: NormalDist.overlap(other)
686
Raymond Hettingere4810b22019-09-05 00:18:47 -0700687 Measures the agreement between two normal probability distributions.
688 Returns a value between 0.0 and 1.0 giving `the overlapping area for
689 the two probability density functions
690 <https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger318d5372019-03-06 22:59:40 -0800691
Raymond Hettinger8a6cbf82019-10-13 19:53:30 -0700692 .. method:: NormalDist.quantiles(n=4)
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700693
694 Divide the normal distribution into *n* continuous intervals with
695 equal probability. Returns a list of (n - 1) cut points separating
696 the intervals.
697
698 Set *n* to 4 for quartiles (the default). Set *n* to 10 for deciles.
699 Set *n* to 100 for percentiles which gives the 99 cuts points that
700 separate the normal distribution into 100 equal sized groups.
701
Raymond Hettinger70f027d2020-04-16 10:25:14 -0700702 .. method:: NormalDist.zscore(x)
703
704 Compute the
705 `Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
706 describing *x* in terms of the number of standard deviations
707 above or below the mean of the normal distribution:
708 ``(x - mean) / stdev``.
709
710 .. versionadded:: 3.9
711
Raymond Hettinger11c79532019-02-23 14:44:07 -0800712 Instances of :class:`NormalDist` support addition, subtraction,
713 multiplication and division by a constant. These operations
714 are used for translation and scaling. For example:
715
716 .. doctest::
717
718 >>> temperature_february = NormalDist(5, 2.5) # Celsius
719 >>> temperature_february * (9/5) + 32 # Fahrenheit
720 NormalDist(mu=41.0, sigma=4.5)
721
Raymond Hettingercc353a02019-03-10 23:43:33 -0700722 Dividing a constant by an instance of :class:`NormalDist` is not supported
723 because the result wouldn't be normally distributed.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800724
725 Since normal distributions arise from additive effects of independent
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800726 variables, it is possible to `add and subtract two independent normally
727 distributed random variables
Raymond Hettinger11c79532019-02-23 14:44:07 -0800728 <https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
729 represented as instances of :class:`NormalDist`. For example:
730
731 .. doctest::
732
733 >>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
734 >>> drug_effects = NormalDist(0.4, 0.15)
735 >>> combined = birth_weights + drug_effects
Raymond Hettingercc353a02019-03-10 23:43:33 -0700736 >>> round(combined.mean, 1)
737 3.1
738 >>> round(combined.stdev, 1)
739 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800740
741 .. versionadded:: 3.8
742
743
744:class:`NormalDist` Examples and Recipes
Raymond Hettinger1c668d12019-03-14 21:46:31 -0700745^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger11c79532019-02-23 14:44:07 -0800746
Raymond Hettingeref17fdb2019-02-28 09:16:25 -0800747:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800748
749For example, given `historical data for SAT exams
Raymond Hettinger01bf2192020-01-27 18:31:46 -0800750<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
751that scores are normally distributed with a mean of 1060 and a standard
752deviation of 195, determine the percentage of students with test scores
753between 1100 and 1200, after rounding to the nearest whole number:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800754
755.. doctest::
756
757 >>> sat = NormalDist(1060, 195)
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800758 >>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettingercc353a02019-03-10 23:43:33 -0700759 >>> round(fraction * 100.0, 1)
760 18.4
Raymond Hettinger11c79532019-02-23 14:44:07 -0800761
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700762Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
763<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
764
765.. doctest::
766
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700767 >>> list(map(round, sat.quantiles()))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700768 [928, 1060, 1192]
Raymond Hettinger4db25d52019-09-08 16:57:58 -0700769 >>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger714c60d2019-03-18 20:17:14 -0700770 [810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
771
Raymond Hettinger11c79532019-02-23 14:44:07 -0800772To estimate the distribution for a model than isn't easy to solve
773analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettingercc353a02019-03-10 23:43:33 -0700774Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800775
776.. doctest::
777
Raymond Hettingercc353a02019-03-10 23:43:33 -0700778 >>> def model(x, y, z):
779 ... return (3*x + 7*x*y - 5*y) / (11 * z)
780 ...
Raymond Hettinger11c79532019-02-23 14:44:07 -0800781 >>> n = 100_000
Raymond Hettingere4810b22019-09-05 00:18:47 -0700782 >>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
783 >>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
784 >>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
785 >>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
786 [1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger11c79532019-02-23 14:44:07 -0800787
Raymond Hettinger10355ed2020-01-25 20:21:17 -0800788Normal distributions can be used to approximate `Binomial
789distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
790when the sample size is large and when the probability of a successful
791trial is near 50%.
792
793For example, an open source conference has 750 attendees and two rooms with a
794500 person capacity. There is a talk about Python and another about Ruby.
795In previous conferences, 65% of the attendees preferred to listen to Python
796talks. Assuming the population preferences haven't changed, what is the
Raymond Hettinger01bf2192020-01-27 18:31:46 -0800797probability that the Python room will stay within its capacity limits?
Raymond Hettinger10355ed2020-01-25 20:21:17 -0800798
799.. doctest::
800
801 >>> n = 750 # Sample size
802 >>> p = 0.65 # Preference for Python
803 >>> q = 1.0 - p # Preference for Ruby
804 >>> k = 500 # Room capacity
805
806 >>> # Approximation using the cumulative normal distribution
807 >>> from math import sqrt
808 >>> round(NormalDist(mu=n*p, sigma=sqrt(n*p*q)).cdf(k + 0.5), 4)
809 0.8402
810
811 >>> # Solution using the cumulative binomial distribution
812 >>> from math import comb, fsum
813 >>> round(fsum(comb(n, r) * p**r * q**(n-r) for r in range(k+1)), 4)
814 0.8402
815
816 >>> # Approximation using a simulation
817 >>> from random import seed, choices
818 >>> seed(8675309)
819 >>> def trial():
820 ... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
821 >>> mean(trial() <= k for i in range(10_000))
822 0.8398
823
Raymond Hettinger11c79532019-02-23 14:44:07 -0800824Normal distributions commonly arise in machine learning problems.
825
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800826Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettingerd70a3592019-03-09 00:42:23 -0800827<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
828The challenge is to predict a person's gender from measurements of normally
829distributed features including height, weight, and foot size.
Raymond Hettinger11c79532019-02-23 14:44:07 -0800830
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800831We're given a training dataset with measurements for eight people. The
Raymond Hettinger11c79532019-02-23 14:44:07 -0800832measurements are assumed to be normally distributed, so we summarize the data
833with :class:`NormalDist`:
834
835.. doctest::
836
837 >>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
838 >>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
839 >>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
840 >>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
841 >>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
842 >>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
843
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800844Next, we encounter a new person whose feature measurements are known but whose
845gender is unknown:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800846
847.. doctest::
848
849 >>> ht = 6.0 # height
850 >>> wt = 130 # weight
851 >>> fs = 8 # foot size
852
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800853Starting with a 50% `prior probability
854<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
855we compute the posterior as the prior times the product of likelihoods for the
856feature measurements given the gender:
Raymond Hettinger11c79532019-02-23 14:44:07 -0800857
858.. doctest::
859
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800860 >>> prior_male = 0.5
861 >>> prior_female = 0.5
Raymond Hettinger11c79532019-02-23 14:44:07 -0800862 >>> posterior_male = (prior_male * height_male.pdf(ht) *
863 ... weight_male.pdf(wt) * foot_size_male.pdf(fs))
864
865 >>> posterior_female = (prior_female * height_female.pdf(ht) *
866 ... weight_female.pdf(wt) * foot_size_female.pdf(fs))
867
Raymond Hettinger1f58f4f2019-03-06 23:23:55 -0800868The final prediction goes to the largest posterior. This is known as the
869`maximum a posteriori
Raymond Hettinger11c79532019-02-23 14:44:07 -0800870<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
871
872.. doctest::
873
874 >>> 'male' if posterior_male > posterior_female else 'female'
875 'female'
876
877
Larry Hastingsf5e987b2013-10-19 11:50:09 -0700878..
879 # This modelines must appear within the last ten lines of the file.
880 kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;