Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: a702b2463c39b1678801723dbe2698d3d3105361 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	22	numeric (:class:`~numbers.Real`-valued) data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	23
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	24	The module is not intended to be a competitor to third-party libraries such
				25	as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
				26	proprietary full-featured statistics packages aimed at professional
				27	statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
				28	graphing and scientific calculators.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	29
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	30	Unless explicitly noted, these functions support :class:`int`,
				31	:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
				32	Behaviour with other types (whether in the numeric tower or not) is
				33	currently unsupported. Collections with a mix of types are also undefined
				34	and implementation-dependent. If your input data consists of mixed types,
				35	you may be able to use :func:`map` to ensure a consistent result, for
				36	example: ``map(float, input_data)``.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	37
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	38	Averages and measures of central location
				39	-----------------------------------------
				40
				41	These functions calculate an average or typical value from a population
				42	or sample.
				43
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	44	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	46	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	47	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	48	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	49	:func:`median` Median (middle value) of data.
				50	:func:`median_low` Low median of data.
				51	:func:`median_high` High median of data.
				52	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	53	:func:`mode` Single mode (most common value) of discrete or nominal data.
				54	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	55	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	56	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	57
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	58	Measures of spread
				59	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	60
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	61	These functions calculate a measure of how much the population or sample
				62	tends to deviate from the typical or average values.
				63
				64	======================= =============================================
				65	:func:`pstdev` Population standard deviation of data.
				66	:func:`pvariance` Population variance of data.
				67	:func:`stdev` Sample standard deviation of data.
				68	:func:`variance` Sample variance of data.
				69	======================= =============================================
				70
				71
				72	Function details
				73	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	75	Note: The functions do not require the data given to them to be sorted.
				76	However, for reading convenience, most of the examples show sorted sequences.
				77
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	78	.. function:: mean(data)
				79
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	80	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	81
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	82	The arithmetic mean is the sum of the data divided by the number of data
				83	points. It is commonly called "the average", although it is only one of many
				84	different mathematical averages. It is a measure of the central location of
				85	the data.
				86
				87	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	88
				89	Some examples of use:
				90
				91	.. doctest::
				92
				93	>>> mean([1, 2, 3, 4, 4])
				94	2.8
				95	>>> mean([-1.0, 2.5, 3.25, 5.75])
				96	2.625
				97
				98	>>> from fractions import Fraction as F
				99	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				100	Fraction(13, 21)
				101
				102	>>> from decimal import Decimal as D
				103	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				104	Decimal('0.5625')
				105
				106	.. note::
				107
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	108	The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	109	for central location: the mean is not necessarily a typical example of
				110	the data points. For more robust measures of central location, see
				111	:func:`median` and :func:`mode`.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	112
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	113	The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	114	so that when taken on average over all the possible samples,
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	115	``mean(sample)`` converges on the true mean of the entire population. If
				116	data represents the entire population rather than a sample, then
				117	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	118
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	119
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	120	.. function:: fmean(data)
				121
				122	Convert data to floats and compute the arithmetic mean.
				123
				124	This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	125	:class:`float`. The data may be a sequence or iterator. If the input
				126	dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	127
				128	.. doctest::
				129
				130	>>> fmean([3.5, 4.0, 5.25])
				131	4.25
				132
				133	.. versionadded:: 3.8
				134
				135
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	136	.. function:: geometric_mean(data)
				137
				138	Convert data to floats and compute the geometric mean.
				139
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	140	The geometric mean indicates the central tendency or typical value of the
				141	data using the product of the values (as opposed to the arithmetic mean
				142	which uses their sum).
				143
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	144	Raises a :exc:`StatisticsError` if the input dataset is empty,
				145	if it contains a zero, or if it contains a negative value.
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	146	The data may be a sequence or iterator.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	147
				148	No special efforts are made to achieve exact results.
				149	(However, this may change in the future.)
				150
				151	.. doctest::
				152
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	153	>>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	154	36.0
				155
				156	.. versionadded:: 3.8
				157
				158
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	159	.. function:: harmonic_mean(data)
				160
				161	Return the harmonic mean of data, a sequence or iterator of
				162	real-valued numbers.
				163
				164	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	165	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	166	data. For example, the harmonic mean of three values a, b and c
				167	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				168
				169	The harmonic mean is a type of average, a measure of the central
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	170	location of the data. It is often appropriate when averaging
				171	rates or ratios, for example speeds.
				172
				173	Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
				174	What is the average speed?
				175
				176	.. doctest::
				177
				178	>>> harmonic_mean([40, 60])
				179	48.0
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	180
				181	Suppose an investor purchases an equal value of shares in each of
				182	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				183	What is the average P/E ratio for the investor's portfolio?
				184
				185	.. doctest::
				186
				187	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				188	3.6
				189
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	190	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	191	is less than zero.
				192
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	193	.. versionadded:: 3.6
				194
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	195
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	196	.. function:: median(data)
				197
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	198	Return the median (middle value) of numeric data, using the common "mean of
				199	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	200	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	201
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	202	The median is a robust measure of central location and is less affected by
				203	the presence of outliers. When the number of data points is odd, the
				204	middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	205
				206	.. doctest::
				207
				208	>>> median([1, 3, 5])
				209	3
				210
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	211	When the number of data points is even, the median is interpolated by taking
				212	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	213
				214	.. doctest::
				215
				216	>>> median([1, 3, 5, 7])
				217	4.0
				218
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	219	This is suited for when your data is discrete, and you don't mind that the
				220	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	221
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	222	If the data is ordinal (supports order operations) but not numeric (doesn't
				223	support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	224	instead.
				225
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	226	.. function:: median_low(data)
				227
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	228	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	229	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	230
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	231	The low median is always a member of the data set. When the number of data
				232	points is odd, the middle value is returned. When it is even, the smaller of
				233	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	234
				235	.. doctest::
				236
				237	>>> median_low([1, 3, 5])
				238	3
				239	>>> median_low([1, 3, 5, 7])
				240	3
				241
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	242	Use the low median when your data are discrete and you prefer the median to
				243	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	244
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	245
				246	.. function:: median_high(data)
				247
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	248	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	249	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	250
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	251	The high median is always a member of the data set. When the number of data
				252	points is odd, the middle value is returned. When it is even, the larger of
				253	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	254
				255	.. doctest::
				256
				257	>>> median_high([1, 3, 5])
				258	3
				259	>>> median_high([1, 3, 5, 7])
				260	5
				261
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	262	Use the high median when your data are discrete and you prefer the median to
				263	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	264
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	265
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	266	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	267
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	268	Return the median of grouped continuous data, calculated as the 50th
				269	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	270	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	271
				272	.. doctest::
				273
				274	>>> median_grouped([52, 52, 53, 54])
				275	52.5
				276
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	277	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	278	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				279	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				280	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	281	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	282
				283	.. doctest::
				284
				285	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				286	3.7
				287
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	288	Optional argument interval represents the class interval, and defaults
				289	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	290
				291	.. doctest::
				292
				293	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				294	3.25
				295	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				296	3.5
				297
				298	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	299	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	300
				301	.. impl-detail::
				302
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	303	Under some circumstances, :func:`median_grouped` may coerce data points to
				304	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	305
				306	.. seealso::
				307
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	308	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				309	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	310
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	311	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	312	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	313	function in the Gnome Gnumeric spreadsheet, including `this discussion
				314	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	315
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	316
				317	.. function:: mode(data)
				318
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	319	Return the single most common data point from discrete or nominal data.
				320	The mode (when it exists) is the most typical value and serves as a
				321	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	322
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	323	If there are multiple modes with the same frequency, returns the first one
				324	encountered in the data. If the smallest or largest of those is
				325	desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
				326	If the input data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	327
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	328	``mode`` assumes discrete data and returns a single value. This is the
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	329	standard treatment of the mode as commonly taught in schools:
				330
				331	.. doctest::
				332
				333	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				334	3
				335
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	336	The mode is unique in that it is the only statistic in this package that
				337	also applies to nominal (non-numeric) data:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	338
				339	.. doctest::
				340
				341	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				342	'red'
				343
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	344	.. versionchanged:: 3.8
				345	Now handles multimodal datasets by returning the first mode encountered.
				346	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				347	found.
				348
				349
				350	.. function:: multimode(data)
				351
				352	Return a list of the most frequently occurring values in the order they
				353	were first encountered in the data. Will return more than one result if
				354	there are multiple modes or an empty list if the data is empty:
				355
				356	.. doctest::
				357
				358	>>> multimode('aabbbbccddddeeffffgg')
				359	['b', 'd', 'f']
				360	>>> multimode('')
				361	[]
				362
				363	.. versionadded:: 3.8
				364
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	365
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	366	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	367
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	368	Return the population standard deviation (the square root of the population
				369	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	370
				371	.. doctest::
				372
				373	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				374	0.986893273527251
				375
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	376
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	377	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	378
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	379	Return the population variance of data, a non-empty sequence or iterator
				380	of real-valued numbers. Variance, or second moment about the mean, is a
				381	measure of the variability (spread or dispersion) of data. A large
				382	variance indicates that the data is spread out; a small variance indicates
				383	it is clustered closely around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	384
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	385	If the optional second argument mu is given, it is typically the mean of
				386	the data. It can also be used to compute the second moment around a
				387	point that is not the mean. If it is missing or ``None`` (the default),
				388	the arithmetic mean is automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	389
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	390	Use this function to calculate the variance from the entire population. To
				391	estimate the variance from a sample, the :func:`variance` function is usually
				392	a better choice.
				393
				394	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	395
				396	Examples:
				397
				398	.. doctest::
				399
				400	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				401	>>> pvariance(data)
				402	1.25
				403
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	404	If you have already calculated the mean of your data, you can pass it as the
				405	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	406
				407	.. doctest::
				408
				409	>>> mu = mean(data)
				410	>>> pvariance(data, mu)
				411	1.25
				412
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	413	Decimals and Fractions are supported:
				414
				415	.. doctest::
				416
				417	>>> from decimal import Decimal as D
				418	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				419	Decimal('24.815')
				420
				421	>>> from fractions import Fraction as F
				422	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				423	Fraction(13, 72)
				424
				425	.. note::
				426
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	427	When called with the entire population, this gives the population variance
				428	σ². When called on a sample instead, this is the biased sample variance
				429	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	430
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	431	If you somehow know the true population mean μ, you may use this
				432	function to calculate the variance of a sample, giving the known
				433	population mean as the second argument. Provided the data points are a
				434	random sample of the population, the result will be an unbiased estimate
				435	of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	436
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	437
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	438	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	439
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	440	Return the sample standard deviation (the square root of the sample
				441	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	442
				443	.. doctest::
				444
				445	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				446	1.0810874155219827
				447
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	448
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	449	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	450
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	451	Return the sample variance of data, an iterable of at least two real-valued
				452	numbers. Variance, or second moment about the mean, is a measure of the
				453	variability (spread or dispersion) of data. A large variance indicates that
				454	the data is spread out; a small variance indicates it is clustered closely
				455	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	456
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	457	If the optional second argument xbar is given, it should be the mean of
				458	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	459	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	460
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	461	Use this function when your data is a sample from a population. To calculate
				462	the variance from the entire population, see :func:`pvariance`.
				463
				464	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	465
				466	Examples:
				467
				468	.. doctest::
				469
				470	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				471	>>> variance(data)
				472	1.3720238095238095
				473
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	474	If you have already calculated the mean of your data, you can pass it as the
				475	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	476
				477	.. doctest::
				478
				479	>>> m = mean(data)
				480	>>> variance(data, m)
				481	1.3720238095238095
				482
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	483	This function does not attempt to verify that you have passed the actual mean
				484	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	485	impossible results.
				486
				487	Decimal and Fraction values are supported:
				488
				489	.. doctest::
				490
				491	>>> from decimal import Decimal as D
				492	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				493	Decimal('31.01875')
				494
				495	>>> from fractions import Fraction as F
				496	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				497	Fraction(67, 108)
				498
				499	.. note::
				500
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	501	This is the sample variance s² with Bessel's correction, also known as
				502	variance with N-1 degrees of freedom. Provided that the data points are
				503	representative (e.g. independent and identically distributed), the result
				504	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	505
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	506	If you somehow know the actual population mean μ you should pass it to the
				507	:func:`pvariance` function as the mu parameter to get the variance of a
				508	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	509
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	510	.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	511
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	512	Divide data into n continuous intervals with equal probability.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	513	Returns a list of ``n - 1`` cut points separating the intervals.
				514
				515	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				516	n to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	517	data into 100 equal sized groups. Raises :exc:`StatisticsError` if n
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	518	is not least 1.
				519
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	520	The data can be any iterable containing sample data. For meaningful
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	521	results, the number of data points in data should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	522	Raises :exc:`StatisticsError` if there are not at least two data points.
				523
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	524	The cut points are linearly interpolated from the
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	525	two nearest data points. For example, if a cut point falls one-third
				526	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	527	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	528
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	529	The method for computing quantiles can be varied depending on
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	530	whether the data includes or excludes the lowest and
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	531	highest possible values from the population.
				532
				533	The default method is "exclusive" and is used for data sampled from
				534	a population that can have more extreme values than found in the
				535	samples. The portion of the population falling below the i-th of
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	536	m sorted data points is computed as ``i / (m + 1)``. Given nine
				537	sample values, the method sorts them and assigns the following
				538	percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	539
				540	Setting the method to "inclusive" is used for describing population
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	541	data or for samples that are known to include the most extreme values
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	542	from the population. The minimum value in data is treated as the 0th
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	543	percentile and the maximum value is treated as the 100th percentile.
				544	The portion of the population falling below the i-th of m sorted
				545	data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
				546	values, the method sorts them and assigns the following percentiles:
				547	0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	548
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	549	.. doctest::
				550
				551	# Decile cut points for empirically sampled data
				552	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				553	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				554	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				555	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				556	... 103, 107, 101, 81, 109, 104]
				557	>>> [round(q, 1) for q in quantiles(data, n=10)]
				558	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				559
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	560	.. versionadded:: 3.8
				561
				562
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	563	Exceptions
				564	----------
				565
				566	A single exception is defined:
				567
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	568	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	569
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	570	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	571
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	572
				573	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	574	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	575
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	576	:class:`NormalDist` is a tool for creating and manipulating normal
				577	distributions of a `random variable
				578	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	579	class that treats the mean and standard deviation of data
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	580	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	581
				582	Normal distributions arise from the `Central Limit Theorem
				583	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	584	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	585
				586	.. class:: NormalDist(mu=0.0, sigma=1.0)
				587
				588	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	589	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	590	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	591	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	592
				593	If sigma is negative, raises :exc:`StatisticsError`.
				594
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	595	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	596
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	597	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	598	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				599	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	600
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	601	.. attribute:: median
				602
				603	A read-only property for the `median
				604	<https://en.wikipedia.org/wiki/Median>`_ of a normal
				605	distribution.
				606
				607	.. attribute:: mode
				608
				609	A read-only property for the `mode
				610	<https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
				611	distribution.
				612
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	613	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	614
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	615	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	616	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				617	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	618
				619	.. attribute:: variance
				620
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	621	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	622	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				623	distribution. Equal to the square of the standard deviation.
				624
				625	.. classmethod:: NormalDist.from_samples(data)
				626
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	627	Makes a normal distribution instance with mu and sigma parameters
				628	estimated from the data using :func:`fmean` and :func:`stdev`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	629
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	630	The data can be any :term:`iterable` and should consist of values
				631	that can be converted to type :class:`float`. If data does not
				632	contain at least two elements, raises :exc:`StatisticsError` because it
				633	takes at least one point to estimate a central value and at least two
				634	points to estimate dispersion.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	635
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	636	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	637
				638	Generates n random samples for a given mean and standard deviation.
				639	Returns a :class:`list` of :class:`float` values.
				640
				641	If seed is given, creates a new instance of the underlying random
				642	number generator. This is useful for creating reproducible results,
				643	even in a multi-threading context.
				644
				645	.. method:: NormalDist.pdf(x)
				646
				647	Using a `probability density function (pdf)
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	648	<https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
				649	the relative likelihood that a random variable X will be near the
				650	given value x. Mathematically, it is the limit of the ratio ``P(x <=
				651	X < x+dx) / dx`` as dx approaches zero.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	652
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	653	The relative likelihood is computed as the probability of a sample
				654	occurring in a narrow range divided by the width of the range (hence
				655	the word "density"). Since the likelihood is relative to other points,
				656	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	657
				658	.. method:: NormalDist.cdf(x)
				659
				660	Using a `cumulative distribution function (cdf)
				661	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	662	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	663	equal to x. Mathematically, it is written ``P(X <= x)``.
				664
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	665	.. method:: NormalDist.inv_cdf(p)
				666
				667	Compute the inverse cumulative distribution function, also known as the
				668	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				669	or the `percent-point
				670	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				671	function. Mathematically, it is written ``x : P(X <= x) = p``.
				672
				673	Finds the value x of the random variable X such that the
				674	probability of the variable being less than or equal to that value
				675	equals the given probability p.
				676
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	677	.. method:: NormalDist.overlap(other)
				678
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	679	Measures the agreement between two normal probability distributions.
				680	Returns a value between 0.0 and 1.0 giving `the overlapping area for
				681	the two probability density functions
				682	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	683
Raymond Hettinger	8a6cbf8	2019-10-13 19:53:30 -0700	[diff] [blame^]	684	.. method:: NormalDist.quantiles(n=4)
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	685
				686	Divide the normal distribution into n continuous intervals with
				687	equal probability. Returns a list of (n - 1) cut points separating
				688	the intervals.
				689
				690	Set n to 4 for quartiles (the default). Set n to 10 for deciles.
				691	Set n to 100 for percentiles which gives the 99 cuts points that
				692	separate the normal distribution into 100 equal sized groups.
				693
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	694	Instances of :class:`NormalDist` support addition, subtraction,
				695	multiplication and division by a constant. These operations
				696	are used for translation and scaling. For example:
				697
				698	.. doctest::
				699
				700	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				701	>>> temperature_february * (9/5) + 32 # Fahrenheit
				702	NormalDist(mu=41.0, sigma=4.5)
				703
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	704	Dividing a constant by an instance of :class:`NormalDist` is not supported
				705	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	706
				707	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	708	variables, it is possible to `add and subtract two independent normally
				709	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	710	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				711	represented as instances of :class:`NormalDist`. For example:
				712
				713	.. doctest::
				714
				715	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				716	>>> drug_effects = NormalDist(0.4, 0.15)
				717	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	718	>>> round(combined.mean, 1)
				719	3.1
				720	>>> round(combined.stdev, 1)
				721	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	722
				723	.. versionadded:: 3.8
				724
				725
				726	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	727	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	728
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	729	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	730
				731	For example, given `historical data for SAT exams
				732	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	733	are normally distributed with a mean of 1060 and a standard deviation of 192,
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	734	determine the percentage of students with test scores between 1100 and
				735	1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	736
				737	.. doctest::
				738
				739	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	740	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	741	>>> round(fraction * 100.0, 1)
				742	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	743
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	744	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				745	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				746
				747	.. doctest::
				748
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	749	>>> list(map(round, sat.quantiles()))
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	750	[928, 1060, 1192]
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	751	>>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	752	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				753
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	754	To estimate the distribution for a model than isn't easy to solve
				755	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	756	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	757
				758	.. doctest::
				759
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	760	>>> def model(x, y, z):
				761	... return (3x + 7xy - 5y) / (11 * z)
				762	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	763	>>> n = 100_000
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	764	>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
				765	>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
				766	>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
				767	>>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
				768	[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	769
				770	Normal distributions commonly arise in machine learning problems.
				771
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	772	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	773	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				774	The challenge is to predict a person's gender from measurements of normally
				775	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	776
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	777	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	778	measurements are assumed to be normally distributed, so we summarize the data
				779	with :class:`NormalDist`:
				780
				781	.. doctest::
				782
				783	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				784	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				785	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				786	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				787	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				788	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				789
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	790	Next, we encounter a new person whose feature measurements are known but whose
				791	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	792
				793	.. doctest::
				794
				795	>>> ht = 6.0 # height
				796	>>> wt = 130 # weight
				797	>>> fs = 8 # foot size
				798
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	799	Starting with a 50% `prior probability
				800	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				801	we compute the posterior as the prior times the product of likelihoods for the
				802	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	803
				804	.. doctest::
				805
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	806	>>> prior_male = 0.5
				807	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	808	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				809	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				810
				811	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				812	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				813
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	814	The final prediction goes to the largest posterior. This is known as the
				815	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	816	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				817
				818	.. doctest::
				819
				820	>>> 'male' if posterior_male > posterior_female else 'female'
				821	'female'
				822
				823
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	824	..
				825	# This modelines must appear within the last ten lines of the file.
				826	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;