Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: 62d856bf248c5ad512b5a09792280f417779b311 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	29	currently unsupported. Collections with a mix of types are also undefined
				30	and implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, for
				32	example: ``map(float, input_data)``.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	40	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	43	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	44	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`median` Median (middle value) of data.
				46	:func:`median_low` Low median of data.
				47	:func:`median_high` High median of data.
				48	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	49	:func:`mode` Single mode (most common value) of discrete or nominal data.
				50	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	51	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	52	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	53
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	54	Measures of spread
				55	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	56
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	57	These functions calculate a measure of how much the population or sample
				58	tends to deviate from the typical or average values.
				59
				60	======================= =============================================
				61	:func:`pstdev` Population standard deviation of data.
				62	:func:`pvariance` Population variance of data.
				63	:func:`stdev` Sample standard deviation of data.
				64	:func:`variance` Sample variance of data.
				65	======================= =============================================
				66
				67
				68	Function details
				69	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	70
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	71	Note: The functions do not require the data given to them to be sorted.
				72	However, for reading convenience, most of the examples show sorted sequences.
				73
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74	.. function:: mean(data)
				75
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	76	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	77
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	78	The arithmetic mean is the sum of the data divided by the number of data
				79	points. It is commonly called "the average", although it is only one of many
				80	different mathematical averages. It is a measure of the central location of
				81	the data.
				82
				83	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	84
				85	Some examples of use:
				86
				87	.. doctest::
				88
				89	>>> mean([1, 2, 3, 4, 4])
				90	2.8
				91	>>> mean([-1.0, 2.5, 3.25, 5.75])
				92	2.625
				93
				94	>>> from fractions import Fraction as F
				95	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				96	Fraction(13, 21)
				97
				98	>>> from decimal import Decimal as D
				99	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				100	Decimal('0.5625')
				101
				102	.. note::
				103
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	104	The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	105	for central location: the mean is not necessarily a typical example of
				106	the data points. For more robust measures of central location, see
				107	:func:`median` and :func:`mode`.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	108
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	109	The sample mean gives an unbiased estimate of the true population mean,
				110	which means that, taken on average over all the possible samples,
				111	``mean(sample)`` converges on the true mean of the entire population. If
				112	data represents the entire population rather than a sample, then
				113	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	114
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	115
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	116	.. function:: fmean(data)
				117
				118	Convert data to floats and compute the arithmetic mean.
				119
				120	This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	121	:class:`float`. The data may be a sequence or iterator. If the input
				122	dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	123
				124	.. doctest::
				125
				126	>>> fmean([3.5, 4.0, 5.25])
				127	4.25
				128
				129	.. versionadded:: 3.8
				130
				131
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	132	.. function:: geometric_mean(data)
				133
				134	Convert data to floats and compute the geometric mean.
				135
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	136	The geometric mean indicates the central tendency or typical value of the
				137	data using the product of the values (as opposed to the arithmetic mean
				138	which uses their sum).
				139
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	140	Raises a :exc:`StatisticsError` if the input dataset is empty,
				141	if it contains a zero, or if it contains a negative value.
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	142	The data may be a sequence or iterator.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	143
				144	No special efforts are made to achieve exact results.
				145	(However, this may change in the future.)
				146
				147	.. doctest::
				148
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	149	>>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	150	36.0
				151
				152	.. versionadded:: 3.8
				153
				154
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	155	.. function:: harmonic_mean(data)
				156
				157	Return the harmonic mean of data, a sequence or iterator of
				158	real-valued numbers.
				159
				160	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	161	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	162	data. For example, the harmonic mean of three values a, b and c
				163	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				164
				165	The harmonic mean is a type of average, a measure of the central
				166	location of the data. It is often appropriate when averaging quantities
				167	which are rates or ratios, for example speeds. For example:
				168
				169	Suppose an investor purchases an equal value of shares in each of
				170	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				171	What is the average P/E ratio for the investor's portfolio?
				172
				173	.. doctest::
				174
				175	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				176	3.6
				177
				178	Using the arithmetic mean would give an average of about 5.167, which
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	179	is well over the aggregate P/E ratio.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	180
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	181	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	182	is less than zero.
				183
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	184	.. versionadded:: 3.6
				185
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	186
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	187	.. function:: median(data)
				188
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	189	Return the median (middle value) of numeric data, using the common "mean of
				190	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	191	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	192
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	193	The median is a robust measure of central location, and is less affected by
				194	the presence of outliers in your data. When the number of data points is
				195	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	196
				197	.. doctest::
				198
				199	>>> median([1, 3, 5])
				200	3
				201
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	202	When the number of data points is even, the median is interpolated by taking
				203	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	204
				205	.. doctest::
				206
				207	>>> median([1, 3, 5, 7])
				208	4.0
				209
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	210	This is suited for when your data is discrete, and you don't mind that the
				211	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	212
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	213	If your data is ordinal (supports order operations) but not numeric (doesn't
				214	support addition), you should use :func:`median_low` or :func:`median_high`
				215	instead.
				216
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	217	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	218
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	219
				220	.. function:: median_low(data)
				221
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	222	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	223	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	224
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	225	The low median is always a member of the data set. When the number of data
				226	points is odd, the middle value is returned. When it is even, the smaller of
				227	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	228
				229	.. doctest::
				230
				231	>>> median_low([1, 3, 5])
				232	3
				233	>>> median_low([1, 3, 5, 7])
				234	3
				235
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	236	Use the low median when your data are discrete and you prefer the median to
				237	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	238
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	239
				240	.. function:: median_high(data)
				241
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	242	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	243	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	244
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	245	The high median is always a member of the data set. When the number of data
				246	points is odd, the middle value is returned. When it is even, the larger of
				247	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	248
				249	.. doctest::
				250
				251	>>> median_high([1, 3, 5])
				252	3
				253	>>> median_high([1, 3, 5, 7])
				254	5
				255
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	256	Use the high median when your data are discrete and you prefer the median to
				257	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	258
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	261
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	262	Return the median of grouped continuous data, calculated as the 50th
				263	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	264	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	265
				266	.. doctest::
				267
				268	>>> median_grouped([52, 52, 53, 54])
				269	52.5
				270
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	271	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	272	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				273	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				274	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	275	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	276
				277	.. doctest::
				278
				279	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				280	3.7
				281
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	282	Optional argument interval represents the class interval, and defaults
				283	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	284
				285	.. doctest::
				286
				287	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				288	3.25
				289	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				290	3.5
				291
				292	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	293	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	294
				295	.. impl-detail::
				296
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	297	Under some circumstances, :func:`median_grouped` may coerce data points to
				298	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	299
				300	.. seealso::
				301
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	302	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				303	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	304
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	305	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	306	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	307	function in the Gnome Gnumeric spreadsheet, including `this discussion
				308	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	309
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	310
				311	.. function:: mode(data)
				312
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	313	Return the single most common data point from discrete or nominal data.
				314	The mode (when it exists) is the most typical value and serves as a
				315	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	316
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	317	If there are multiple modes with the same frequency, returns the first one
				318	encountered in the data. If the smallest or largest of those is
				319	desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
				320	If the input data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	321
				322	``mode`` assumes discrete data, and returns a single value. This is the
				323	standard treatment of the mode as commonly taught in schools:
				324
				325	.. doctest::
				326
				327	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				328	3
				329
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	330	The mode is unique in that it is the only statistic in this package that
				331	also applies to nominal (non-numeric) data:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	332
				333	.. doctest::
				334
				335	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				336	'red'
				337
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	338	.. versionchanged:: 3.8
				339	Now handles multimodal datasets by returning the first mode encountered.
				340	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				341	found.
				342
				343
				344	.. function:: multimode(data)
				345
				346	Return a list of the most frequently occurring values in the order they
				347	were first encountered in the data. Will return more than one result if
				348	there are multiple modes or an empty list if the data is empty:
				349
				350	.. doctest::
				351
				352	>>> multimode('aabbbbccddddeeffffgg')
				353	['b', 'd', 'f']
				354	>>> multimode('')
				355	[]
				356
				357	.. versionadded:: 3.8
				358
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	359
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	360	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	361
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	362	Return the population standard deviation (the square root of the population
				363	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	364
				365	.. doctest::
				366
				367	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				368	0.986893273527251
				369
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	370
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	371	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	372
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	373	Return the population variance of data, a non-empty sequence or iterator
				374	of real-valued numbers. Variance, or second moment about the mean, is a
				375	measure of the variability (spread or dispersion) of data. A large
				376	variance indicates that the data is spread out; a small variance indicates
				377	it is clustered closely around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	378
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	379	If the optional second argument mu is given, it is typically the mean of
				380	the data. It can also be used to compute the second moment around a
				381	point that is not the mean. If it is missing or ``None`` (the default),
				382	the arithmetic mean is automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	383
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	384	Use this function to calculate the variance from the entire population. To
				385	estimate the variance from a sample, the :func:`variance` function is usually
				386	a better choice.
				387
				388	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	389
				390	Examples:
				391
				392	.. doctest::
				393
				394	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				395	>>> pvariance(data)
				396	1.25
				397
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	398	If you have already calculated the mean of your data, you can pass it as the
				399	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	400
				401	.. doctest::
				402
				403	>>> mu = mean(data)
				404	>>> pvariance(data, mu)
				405	1.25
				406
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	407	Decimals and Fractions are supported:
				408
				409	.. doctest::
				410
				411	>>> from decimal import Decimal as D
				412	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				413	Decimal('24.815')
				414
				415	>>> from fractions import Fraction as F
				416	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				417	Fraction(13, 72)
				418
				419	.. note::
				420
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	421	When called with the entire population, this gives the population variance
				422	σ². When called on a sample instead, this is the biased sample variance
				423	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	424
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	425	If you somehow know the true population mean μ, you may use this
				426	function to calculate the variance of a sample, giving the known
				427	population mean as the second argument. Provided the data points are a
				428	random sample of the population, the result will be an unbiased estimate
				429	of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	430
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	431
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	432	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	433
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	434	Return the sample standard deviation (the square root of the sample
				435	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	436
				437	.. doctest::
				438
				439	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				440	1.0810874155219827
				441
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	442
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	443	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	444
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	445	Return the sample variance of data, an iterable of at least two real-valued
				446	numbers. Variance, or second moment about the mean, is a measure of the
				447	variability (spread or dispersion) of data. A large variance indicates that
				448	the data is spread out; a small variance indicates it is clustered closely
				449	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	450
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	451	If the optional second argument xbar is given, it should be the mean of
				452	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	453	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	454
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	455	Use this function when your data is a sample from a population. To calculate
				456	the variance from the entire population, see :func:`pvariance`.
				457
				458	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	459
				460	Examples:
				461
				462	.. doctest::
				463
				464	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				465	>>> variance(data)
				466	1.3720238095238095
				467
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	468	If you have already calculated the mean of your data, you can pass it as the
				469	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	470
				471	.. doctest::
				472
				473	>>> m = mean(data)
				474	>>> variance(data, m)
				475	1.3720238095238095
				476
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	477	This function does not attempt to verify that you have passed the actual mean
				478	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	479	impossible results.
				480
				481	Decimal and Fraction values are supported:
				482
				483	.. doctest::
				484
				485	>>> from decimal import Decimal as D
				486	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				487	Decimal('31.01875')
				488
				489	>>> from fractions import Fraction as F
				490	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				491	Fraction(67, 108)
				492
				493	.. note::
				494
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	495	This is the sample variance s² with Bessel's correction, also known as
				496	variance with N-1 degrees of freedom. Provided that the data points are
				497	representative (e.g. independent and identically distributed), the result
				498	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	499
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	500	If you somehow know the actual population mean μ you should pass it to the
				501	:func:`pvariance` function as the mu parameter to get the variance of a
				502	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	503
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	504	.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	505
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	506	Divide data into n continuous intervals with equal probability.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	507	Returns a list of ``n - 1`` cut points separating the intervals.
				508
				509	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				510	n to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	511	data in to 100 equal sized groups. Raises :exc:`StatisticsError` if n
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	512	is not least 1.
				513
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	514	The data can be any iterable containing sample data or it can be an
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	515	instance of a class that defines an :meth:`~inv_cdf` method. For meaningful
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	516	results, the number of data points in data should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	517	Raises :exc:`StatisticsError` if there are not at least two data points.
				518
				519	For sample data, the cut points are linearly interpolated from the
				520	two nearest data points. For example, if a cut point falls one-third
				521	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	522	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	523
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	524	The method for computing quantiles can be varied depending on
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	525	whether the data in data includes or excludes the lowest and
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	526	highest possible values from the population.
				527
				528	The default method is "exclusive" and is used for data sampled from
				529	a population that can have more extreme values than found in the
				530	samples. The portion of the population falling below the i-th of
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	531	m sorted data points is computed as ``i / (m + 1)``. Given nine
				532	sample values, the method sorts them and assigns the following
				533	percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	534
				535	Setting the method to "inclusive" is used for describing population
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	536	data or for samples that are known to include the most extreme values
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	537	from the population. The minimum value in data is treated as the 0th
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	538	percentile and the maximum value is treated as the 100th percentile.
				539	The portion of the population falling below the i-th of m sorted
				540	data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
				541	values, the method sorts them and assigns the following percentiles:
				542	0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	543
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	544	If data is an instance of a class that defines an
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	545	:meth:`~inv_cdf` method, setting method has no effect.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	546
				547	.. doctest::
				548
				549	# Decile cut points for empirically sampled data
				550	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				551	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				552	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				553	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				554	... 103, 107, 101, 81, 109, 104]
				555	>>> [round(q, 1) for q in quantiles(data, n=10)]
				556	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				557
Min ho Kim	c4cacc8	2019-07-31 08:16:13 +1000	[diff] [blame]	558	>>> # Quartile cut points for the standard normal distribution
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	559	>>> Z = NormalDist()
				560	>>> [round(q, 4) for q in quantiles(Z, n=4)]
				561	[-0.6745, 0.0, 0.6745]
				562
				563	.. versionadded:: 3.8
				564
				565
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	566	Exceptions
				567	----------
				568
				569	A single exception is defined:
				570
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	571	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	572
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	573	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	574
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	575
				576	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	577	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	578
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	579	:class:`NormalDist` is a tool for creating and manipulating normal
				580	distributions of a `random variable
				581	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	582	class that treats the mean and standard deviation of data
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	583	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	584
				585	Normal distributions arise from the `Central Limit Theorem
				586	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	587	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	588
				589	.. class:: NormalDist(mu=0.0, sigma=1.0)
				590
				591	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	592	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	593	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	594	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	595
				596	If sigma is negative, raises :exc:`StatisticsError`.
				597
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	598	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	599
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	600	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	601	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				602	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	603
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	604	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	605
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	606	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	607	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				608	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	609
				610	.. attribute:: variance
				611
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	612	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	613	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				614	distribution. Equal to the square of the standard deviation.
				615
				616	.. classmethod:: NormalDist.from_samples(data)
				617
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	618	Makes a normal distribution instance with mu and sigma parameters
				619	estimated from the data using :func:`fmean` and :func:`stdev`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	620
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	621	The data can be any :term:`iterable` and should consist of values
				622	that can be converted to type :class:`float`. If data does not
				623	contain at least two elements, raises :exc:`StatisticsError` because it
				624	takes at least one point to estimate a central value and at least two
				625	points to estimate dispersion.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	626
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	627	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	628
				629	Generates n random samples for a given mean and standard deviation.
				630	Returns a :class:`list` of :class:`float` values.
				631
				632	If seed is given, creates a new instance of the underlying random
				633	number generator. This is useful for creating reproducible results,
				634	even in a multi-threading context.
				635
				636	.. method:: NormalDist.pdf(x)
				637
				638	Using a `probability density function (pdf)
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	639	<https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
				640	the relative likelihood that a random variable X will be near the
				641	given value x. Mathematically, it is the limit of the ratio ``P(x <=
				642	X < x+dx) / dx`` as dx approaches zero.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	643
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	644	The relative likelihood is computed as the probability of a sample
				645	occurring in a narrow range divided by the width of the range (hence
				646	the word "density"). Since the likelihood is relative to other points,
				647	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	648
				649	.. method:: NormalDist.cdf(x)
				650
				651	Using a `cumulative distribution function (cdf)
				652	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	653	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	654	equal to x. Mathematically, it is written ``P(X <= x)``.
				655
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	656	.. method:: NormalDist.inv_cdf(p)
				657
				658	Compute the inverse cumulative distribution function, also known as the
				659	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				660	or the `percent-point
				661	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				662	function. Mathematically, it is written ``x : P(X <= x) = p``.
				663
				664	Finds the value x of the random variable X such that the
				665	probability of the variable being less than or equal to that value
				666	equals the given probability p.
				667
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	668	.. method:: NormalDist.overlap(other)
				669
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	670	Measures the agreement between two normal probability distributions.
				671	Returns a value between 0.0 and 1.0 giving `the overlapping area for
				672	the two probability density functions
				673	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	674
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	675	Instances of :class:`NormalDist` support addition, subtraction,
				676	multiplication and division by a constant. These operations
				677	are used for translation and scaling. For example:
				678
				679	.. doctest::
				680
				681	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				682	>>> temperature_february * (9/5) + 32 # Fahrenheit
				683	NormalDist(mu=41.0, sigma=4.5)
				684
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	685	Dividing a constant by an instance of :class:`NormalDist` is not supported
				686	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	687
				688	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	689	variables, it is possible to `add and subtract two independent normally
				690	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	691	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				692	represented as instances of :class:`NormalDist`. For example:
				693
				694	.. doctest::
				695
				696	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				697	>>> drug_effects = NormalDist(0.4, 0.15)
				698	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	699	>>> round(combined.mean, 1)
				700	3.1
				701	>>> round(combined.stdev, 1)
				702	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	703
				704	.. versionadded:: 3.8
				705
				706
				707	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	708	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	709
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	710	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	711
				712	For example, given `historical data for SAT exams
				713	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	714	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	715	determine the percentage of students with test scores between 1100 and
				716	1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	717
				718	.. doctest::
				719
				720	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	721	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	722	>>> round(fraction * 100.0, 1)
				723	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	724
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	725	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				726	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				727
				728	.. doctest::
				729
				730	>>> [round(sat.inv_cdf(p)) for p in (0.25, 0.50, 0.75)]
				731	[928, 1060, 1192]
				732	>>> [round(sat.inv_cdf(p / 10)) for p in range(1, 10)]
				733	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				734
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	735	To estimate the distribution for a model than isn't easy to solve
				736	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	737	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	738
				739	.. doctest::
				740
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	741	>>> def model(x, y, z):
				742	... return (3x + 7xy - 5y) / (11 * z)
				743	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	744	>>> n = 100_000
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame^]	745	>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
				746	>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
				747	>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
				748	>>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
				749	[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	750
				751	Normal distributions commonly arise in machine learning problems.
				752
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	753	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	754	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				755	The challenge is to predict a person's gender from measurements of normally
				756	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	757
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	758	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	759	measurements are assumed to be normally distributed, so we summarize the data
				760	with :class:`NormalDist`:
				761
				762	.. doctest::
				763
				764	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				765	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				766	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				767	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				768	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				769	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				770
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	771	Next, we encounter a new person whose feature measurements are known but whose
				772	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	773
				774	.. doctest::
				775
				776	>>> ht = 6.0 # height
				777	>>> wt = 130 # weight
				778	>>> fs = 8 # foot size
				779
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	780	Starting with a 50% `prior probability
				781	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				782	we compute the posterior as the prior times the product of likelihoods for the
				783	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	784
				785	.. doctest::
				786
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	787	>>> prior_male = 0.5
				788	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	789	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				790	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				791
				792	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				793	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				794
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	795	The final prediction goes to the largest posterior. This is known as the
				796	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	797	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				798
				799	.. doctest::
				800
				801	>>> 'male' if posterior_male > posterior_female else 'female'
				802	'female'
				803
				804
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	805	..
				806	# This modelines must appear within the last ten lines of the file.
				807	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;