Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: 6467704006d905c44de0c1b4af8b52197870a237 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
Sanchit Khurana	f8a6316	2019-11-26 03:47:59 +0530	[diff] [blame]	5	:synopsis: Mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	22	numeric (:class:`~numbers.Real`-valued) data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	23
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	24	The module is not intended to be a competitor to third-party libraries such
				25	as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
				26	proprietary full-featured statistics packages aimed at professional
				27	statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
				28	graphing and scientific calculators.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	29
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	30	Unless explicitly noted, these functions support :class:`int`,
				31	:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
				32	Behaviour with other types (whether in the numeric tower or not) is
				33	currently unsupported. Collections with a mix of types are also undefined
				34	and implementation-dependent. If your input data consists of mixed types,
				35	you may be able to use :func:`map` to ensure a consistent result, for
				36	example: ``map(float, input_data)``.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	37
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	38	Averages and measures of central location
				39	-----------------------------------------
				40
				41	These functions calculate an average or typical value from a population
				42	or sample.
				43
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	44	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	46	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	47	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	48	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	49	:func:`median` Median (middle value) of data.
				50	:func:`median_low` Low median of data.
				51	:func:`median_high` High median of data.
				52	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	53	:func:`mode` Single mode (most common value) of discrete or nominal data.
				54	:func:`multimode` List of modes (most common values) of discrete or nomimal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	55	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	56	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	57
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	58	Measures of spread
				59	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	60
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	61	These functions calculate a measure of how much the population or sample
				62	tends to deviate from the typical or average values.
				63
				64	======================= =============================================
				65	:func:`pstdev` Population standard deviation of data.
				66	:func:`pvariance` Population variance of data.
				67	:func:`stdev` Sample standard deviation of data.
				68	:func:`variance` Sample variance of data.
				69	======================= =============================================
				70
				71
				72	Function details
				73	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	75	Note: The functions do not require the data given to them to be sorted.
				76	However, for reading convenience, most of the examples show sorted sequences.
				77
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	78	.. function:: mean(data)
				79
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	80	Return the sample arithmetic mean of data which can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	81
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	82	The arithmetic mean is the sum of the data divided by the number of data
				83	points. It is commonly called "the average", although it is only one of many
				84	different mathematical averages. It is a measure of the central location of
				85	the data.
				86
				87	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	88
				89	Some examples of use:
				90
				91	.. doctest::
				92
				93	>>> mean([1, 2, 3, 4, 4])
				94	2.8
				95	>>> mean([-1.0, 2.5, 3.25, 5.75])
				96	2.625
				97
				98	>>> from fractions import Fraction as F
				99	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				100	Fraction(13, 21)
				101
				102	>>> from decimal import Decimal as D
				103	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				104	Decimal('0.5625')
				105
				106	.. note::
				107
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	108	The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	109	for central location: the mean is not necessarily a typical example of
				110	the data points. For more robust measures of central location, see
				111	:func:`median` and :func:`mode`.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	112
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	113	The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	114	so that when taken on average over all the possible samples,
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	115	``mean(sample)`` converges on the true mean of the entire population. If
				116	data represents the entire population rather than a sample, then
				117	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	118
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	119
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	120	.. function:: fmean(data)
				121
				122	Convert data to floats and compute the arithmetic mean.
				123
				124	This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	125	:class:`float`. The data may be a sequence or iterable. If the input
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	126	dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	127
				128	.. doctest::
				129
				130	>>> fmean([3.5, 4.0, 5.25])
				131	4.25
				132
				133	.. versionadded:: 3.8
				134
				135
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	136	.. function:: geometric_mean(data)
				137
				138	Convert data to floats and compute the geometric mean.
				139
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	140	The geometric mean indicates the central tendency or typical value of the
				141	data using the product of the values (as opposed to the arithmetic mean
				142	which uses their sum).
				143
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	144	Raises a :exc:`StatisticsError` if the input dataset is empty,
				145	if it contains a zero, or if it contains a negative value.
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	146	The data may be a sequence or iterable.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	147
				148	No special efforts are made to achieve exact results.
				149	(However, this may change in the future.)
				150
				151	.. doctest::
				152
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	153	>>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	154	36.0
				155
				156	.. versionadded:: 3.8
				157
				158
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	159	.. function:: harmonic_mean(data, weights=None)
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	160
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	161	Return the harmonic mean of data, a sequence or iterable of
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	162	real-valued numbers. If weights is omitted or None, then
				163	equal weighting is assumed.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	164
				165	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	166	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	167	data. For example, the harmonic mean of three values a, b and c
Raymond Hettinger	7f46049	2019-11-06 21:50:44 -0800	[diff] [blame]	168	will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the values
				169	is zero, the result will be zero.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	170
				171	The harmonic mean is a type of average, a measure of the central
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	172	location of the data. It is often appropriate when averaging
				173	rates or ratios, for example speeds.
				174
				175	Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
				176	What is the average speed?
				177
				178	.. doctest::
				179
				180	>>> harmonic_mean([40, 60])
				181	48.0
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	182
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	183	Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
				184	speeds-up to 60 km/hr for the remaining 30 km of the journey. What
				185	is the average speed?
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	186
				187	.. doctest::
				188
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	189	>>> harmonic_mean([40, 60], weights=[5, 30])
				190	56.0
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	191
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	192	:exc:`StatisticsError` is raised if data is empty, any element
				193	is less than zero, or if the weighted sum isn't positive.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	194
Raymond Hettinger	7f46049	2019-11-06 21:50:44 -0800	[diff] [blame]	195	The current algorithm has an early-out when it encounters a zero
				196	in the input. This means that the subsequent inputs are not tested
				197	for validity. (This behavior may change in the future.)
				198
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	199	.. versionadded:: 3.6
				200
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	201	.. versionchanged:: 3.8
				202	Added support for weights.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	203
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	204	.. function:: median(data)
				205
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	206	Return the median (middle value) of numeric data, using the common "mean of
				207	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	208	data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	209
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	210	The median is a robust measure of central location and is less affected by
				211	the presence of outliers. When the number of data points is odd, the
				212	middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	213
				214	.. doctest::
				215
				216	>>> median([1, 3, 5])
				217	3
				218
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	219	When the number of data points is even, the median is interpolated by taking
				220	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	221
				222	.. doctest::
				223
				224	>>> median([1, 3, 5, 7])
				225	4.0
				226
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	227	This is suited for when your data is discrete, and you don't mind that the
				228	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	229
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	230	If the data is ordinal (supports order operations) but not numeric (doesn't
				231	support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	232	instead.
				233
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	234	.. function:: median_low(data)
				235
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	236	Return the low median of numeric data. If data is empty,
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	237	:exc:`StatisticsError` is raised. data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	238
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	239	The low median is always a member of the data set. When the number of data
				240	points is odd, the middle value is returned. When it is even, the smaller of
				241	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	242
				243	.. doctest::
				244
				245	>>> median_low([1, 3, 5])
				246	3
				247	>>> median_low([1, 3, 5, 7])
				248	3
				249
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	250	Use the low median when your data are discrete and you prefer the median to
				251	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	252
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	253
				254	.. function:: median_high(data)
				255
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	256	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	257	is raised. data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	258
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	259	The high median is always a member of the data set. When the number of data
				260	points is odd, the middle value is returned. When it is even, the larger of
				261	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	262
				263	.. doctest::
				264
				265	>>> median_high([1, 3, 5])
				266	3
				267	>>> median_high([1, 3, 5, 7])
				268	5
				269
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	270	Use the high median when your data are discrete and you prefer the median to
				271	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	272
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	273
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	274	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	275
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	276	Return the median of grouped continuous data, calculated as the 50th
				277	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	278	is raised. data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	279
				280	.. doctest::
				281
				282	>>> median_grouped([52, 52, 53, 54])
				283	52.5
				284
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	285	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	286	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				287	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				288	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	289	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	290
				291	.. doctest::
				292
				293	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				294	3.7
				295
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	296	Optional argument interval represents the class interval, and defaults
				297	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	298
				299	.. doctest::
				300
				301	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				302	3.25
				303	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				304	3.5
				305
				306	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	307	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	308
				309	.. impl-detail::
				310
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	311	Under some circumstances, :func:`median_grouped` may coerce data points to
				312	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	313
				314	.. seealso::
				315
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	316	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				317	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	318
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	319	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	320	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	321	function in the Gnome Gnumeric spreadsheet, including `this discussion
				322	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	323
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	324
				325	.. function:: mode(data)
				326
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	327	Return the single most common data point from discrete or nominal data.
				328	The mode (when it exists) is the most typical value and serves as a
				329	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	330
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	331	If there are multiple modes with the same frequency, returns the first one
				332	encountered in the data. If the smallest or largest of those is
				333	desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
				334	If the input data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	335
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	336	``mode`` assumes discrete data and returns a single value. This is the
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	337	standard treatment of the mode as commonly taught in schools:
				338
				339	.. doctest::
				340
				341	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				342	3
				343
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	344	The mode is unique in that it is the only statistic in this package that
				345	also applies to nominal (non-numeric) data:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	346
				347	.. doctest::
				348
				349	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				350	'red'
				351
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	352	.. versionchanged:: 3.8
				353	Now handles multimodal datasets by returning the first mode encountered.
				354	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				355	found.
				356
				357
				358	.. function:: multimode(data)
				359
				360	Return a list of the most frequently occurring values in the order they
				361	were first encountered in the data. Will return more than one result if
				362	there are multiple modes or an empty list if the data is empty:
				363
				364	.. doctest::
				365
				366	>>> multimode('aabbbbccddddeeffffgg')
				367	['b', 'd', 'f']
				368	>>> multimode('')
				369	[]
				370
				371	.. versionadded:: 3.8
				372
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	373
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	374	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	375
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	376	Return the population standard deviation (the square root of the population
				377	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	378
				379	.. doctest::
				380
				381	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				382	0.986893273527251
				383
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	384
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	385	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	386
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	387	Return the population variance of data, a non-empty sequence or iterable
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	388	of real-valued numbers. Variance, or second moment about the mean, is a
				389	measure of the variability (spread or dispersion) of data. A large
				390	variance indicates that the data is spread out; a small variance indicates
				391	it is clustered closely around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	392
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	393	If the optional second argument mu is given, it is typically the mean of
				394	the data. It can also be used to compute the second moment around a
				395	point that is not the mean. If it is missing or ``None`` (the default),
				396	the arithmetic mean is automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	397
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	398	Use this function to calculate the variance from the entire population. To
				399	estimate the variance from a sample, the :func:`variance` function is usually
				400	a better choice.
				401
				402	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	403
				404	Examples:
				405
				406	.. doctest::
				407
				408	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				409	>>> pvariance(data)
				410	1.25
				411
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	412	If you have already calculated the mean of your data, you can pass it as the
				413	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	414
				415	.. doctest::
				416
				417	>>> mu = mean(data)
				418	>>> pvariance(data, mu)
				419	1.25
				420
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	421	Decimals and Fractions are supported:
				422
				423	.. doctest::
				424
				425	>>> from decimal import Decimal as D
				426	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				427	Decimal('24.815')
				428
				429	>>> from fractions import Fraction as F
				430	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				431	Fraction(13, 72)
				432
				433	.. note::
				434
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	435	When called with the entire population, this gives the population variance
				436	σ². When called on a sample instead, this is the biased sample variance
				437	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	438
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	439	If you somehow know the true population mean μ, you may use this
				440	function to calculate the variance of a sample, giving the known
				441	population mean as the second argument. Provided the data points are a
				442	random sample of the population, the result will be an unbiased estimate
				443	of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	444
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	445
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	446	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	447
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	448	Return the sample standard deviation (the square root of the sample
				449	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	450
				451	.. doctest::
				452
				453	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				454	1.0810874155219827
				455
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	456
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	457	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	458
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	459	Return the sample variance of data, an iterable of at least two real-valued
				460	numbers. Variance, or second moment about the mean, is a measure of the
				461	variability (spread or dispersion) of data. A large variance indicates that
				462	the data is spread out; a small variance indicates it is clustered closely
				463	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	464
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	465	If the optional second argument xbar is given, it should be the mean of
				466	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	467	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	468
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	469	Use this function when your data is a sample from a population. To calculate
				470	the variance from the entire population, see :func:`pvariance`.
				471
				472	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	473
				474	Examples:
				475
				476	.. doctest::
				477
				478	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				479	>>> variance(data)
				480	1.3720238095238095
				481
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	482	If you have already calculated the mean of your data, you can pass it as the
				483	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	484
				485	.. doctest::
				486
				487	>>> m = mean(data)
				488	>>> variance(data, m)
				489	1.3720238095238095
				490
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	491	This function does not attempt to verify that you have passed the actual mean
				492	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	493	impossible results.
				494
				495	Decimal and Fraction values are supported:
				496
				497	.. doctest::
				498
				499	>>> from decimal import Decimal as D
				500	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				501	Decimal('31.01875')
				502
				503	>>> from fractions import Fraction as F
				504	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				505	Fraction(67, 108)
				506
				507	.. note::
				508
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	509	This is the sample variance s² with Bessel's correction, also known as
				510	variance with N-1 degrees of freedom. Provided that the data points are
				511	representative (e.g. independent and identically distributed), the result
				512	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	513
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	514	If you somehow know the actual population mean μ you should pass it to the
				515	:func:`pvariance` function as the mu parameter to get the variance of a
				516	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	517
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	518	.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	519
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	520	Divide data into n continuous intervals with equal probability.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	521	Returns a list of ``n - 1`` cut points separating the intervals.
				522
				523	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				524	n to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	525	data into 100 equal sized groups. Raises :exc:`StatisticsError` if n
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	526	is not least 1.
				527
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	528	The data can be any iterable containing sample data. For meaningful
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	529	results, the number of data points in data should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	530	Raises :exc:`StatisticsError` if there are not at least two data points.
				531
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	532	The cut points are linearly interpolated from the
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	533	two nearest data points. For example, if a cut point falls one-third
				534	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	535	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	536
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	537	The method for computing quantiles can be varied depending on
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	538	whether the data includes or excludes the lowest and
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	539	highest possible values from the population.
				540
				541	The default method is "exclusive" and is used for data sampled from
				542	a population that can have more extreme values than found in the
				543	samples. The portion of the population falling below the i-th of
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	544	m sorted data points is computed as ``i / (m + 1)``. Given nine
				545	sample values, the method sorts them and assigns the following
				546	percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	547
				548	Setting the method to "inclusive" is used for describing population
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	549	data or for samples that are known to include the most extreme values
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	550	from the population. The minimum value in data is treated as the 0th
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	551	percentile and the maximum value is treated as the 100th percentile.
				552	The portion of the population falling below the i-th of m sorted
				553	data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
				554	values, the method sorts them and assigns the following percentiles:
				555	0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	556
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	557	.. doctest::
				558
				559	# Decile cut points for empirically sampled data
				560	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				561	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				562	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				563	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				564	... 103, 107, 101, 81, 109, 104]
				565	>>> [round(q, 1) for q in quantiles(data, n=10)]
				566	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				567
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	568	.. versionadded:: 3.8
				569
				570
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	571	Exceptions
				572	----------
				573
				574	A single exception is defined:
				575
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	576	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	577
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	578	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	579
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	580
				581	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	582	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	583
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	584	:class:`NormalDist` is a tool for creating and manipulating normal
				585	distributions of a `random variable
				586	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	587	class that treats the mean and standard deviation of data
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	588	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	589
				590	Normal distributions arise from the `Central Limit Theorem
				591	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	592	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	593
				594	.. class:: NormalDist(mu=0.0, sigma=1.0)
				595
				596	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	597	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	598	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	599	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	600
				601	If sigma is negative, raises :exc:`StatisticsError`.
				602
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	603	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	604
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	605	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	606	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				607	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	608
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	609	.. attribute:: median
				610
				611	A read-only property for the `median
				612	<https://en.wikipedia.org/wiki/Median>`_ of a normal
				613	distribution.
				614
				615	.. attribute:: mode
				616
				617	A read-only property for the `mode
				618	<https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
				619	distribution.
				620
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	621	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	622
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	623	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	624	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				625	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	626
				627	.. attribute:: variance
				628
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	629	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	630	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				631	distribution. Equal to the square of the standard deviation.
				632
				633	.. classmethod:: NormalDist.from_samples(data)
				634
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	635	Makes a normal distribution instance with mu and sigma parameters
				636	estimated from the data using :func:`fmean` and :func:`stdev`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	637
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	638	The data can be any :term:`iterable` and should consist of values
				639	that can be converted to type :class:`float`. If data does not
				640	contain at least two elements, raises :exc:`StatisticsError` because it
				641	takes at least one point to estimate a central value and at least two
				642	points to estimate dispersion.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	643
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	644	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	645
				646	Generates n random samples for a given mean and standard deviation.
				647	Returns a :class:`list` of :class:`float` values.
				648
				649	If seed is given, creates a new instance of the underlying random
				650	number generator. This is useful for creating reproducible results,
				651	even in a multi-threading context.
				652
				653	.. method:: NormalDist.pdf(x)
				654
				655	Using a `probability density function (pdf)
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	656	<https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
				657	the relative likelihood that a random variable X will be near the
				658	given value x. Mathematically, it is the limit of the ratio ``P(x <=
				659	X < x+dx) / dx`` as dx approaches zero.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	660
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	661	The relative likelihood is computed as the probability of a sample
				662	occurring in a narrow range divided by the width of the range (hence
				663	the word "density"). Since the likelihood is relative to other points,
				664	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	665
				666	.. method:: NormalDist.cdf(x)
				667
				668	Using a `cumulative distribution function (cdf)
				669	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	670	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	671	equal to x. Mathematically, it is written ``P(X <= x)``.
				672
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	673	.. method:: NormalDist.inv_cdf(p)
				674
				675	Compute the inverse cumulative distribution function, also known as the
				676	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				677	or the `percent-point
				678	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				679	function. Mathematically, it is written ``x : P(X <= x) = p``.
				680
				681	Finds the value x of the random variable X such that the
				682	probability of the variable being less than or equal to that value
				683	equals the given probability p.
				684
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	685	.. method:: NormalDist.overlap(other)
				686
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	687	Measures the agreement between two normal probability distributions.
				688	Returns a value between 0.0 and 1.0 giving `the overlapping area for
				689	the two probability density functions
				690	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	691
Raymond Hettinger	8a6cbf8	2019-10-13 19:53:30 -0700	[diff] [blame]	692	.. method:: NormalDist.quantiles(n=4)
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	693
				694	Divide the normal distribution into n continuous intervals with
				695	equal probability. Returns a list of (n - 1) cut points separating
				696	the intervals.
				697
				698	Set n to 4 for quartiles (the default). Set n to 10 for deciles.
				699	Set n to 100 for percentiles which gives the 99 cuts points that
				700	separate the normal distribution into 100 equal sized groups.
				701
Raymond Hettinger	70f027d	2020-04-16 10:25:14 -0700	[diff] [blame]	702	.. method:: NormalDist.zscore(x)
				703
				704	Compute the
				705	`Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
				706	describing x in terms of the number of standard deviations
				707	above or below the mean of the normal distribution:
				708	``(x - mean) / stdev``.
				709
				710	.. versionadded:: 3.9
				711
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	712	Instances of :class:`NormalDist` support addition, subtraction,
				713	multiplication and division by a constant. These operations
				714	are used for translation and scaling. For example:
				715
				716	.. doctest::
				717
				718	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				719	>>> temperature_february * (9/5) + 32 # Fahrenheit
				720	NormalDist(mu=41.0, sigma=4.5)
				721
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	722	Dividing a constant by an instance of :class:`NormalDist` is not supported
				723	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	724
				725	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	726	variables, it is possible to `add and subtract two independent normally
				727	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	728	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				729	represented as instances of :class:`NormalDist`. For example:
				730
				731	.. doctest::
				732
				733	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				734	>>> drug_effects = NormalDist(0.4, 0.15)
				735	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	736	>>> round(combined.mean, 1)
				737	3.1
				738	>>> round(combined.stdev, 1)
				739	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	740
				741	.. versionadded:: 3.8
				742
				743
				744	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	745	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	746
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	747	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	748
				749	For example, given `historical data for SAT exams
Raymond Hettinger	01bf219	2020-01-27 18:31:46 -0800	[diff] [blame]	750	<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
				751	that scores are normally distributed with a mean of 1060 and a standard
				752	deviation of 195, determine the percentage of students with test scores
				753	between 1100 and 1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	754
				755	.. doctest::
				756
				757	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	758	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	759	>>> round(fraction * 100.0, 1)
				760	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	761
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	762	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				763	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				764
				765	.. doctest::
				766
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	767	>>> list(map(round, sat.quantiles()))
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	768	[928, 1060, 1192]
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	769	>>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	770	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				771
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	772	To estimate the distribution for a model than isn't easy to solve
				773	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	774	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	775
				776	.. doctest::
				777
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	778	>>> def model(x, y, z):
				779	... return (3x + 7xy - 5y) / (11 * z)
				780	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	781	>>> n = 100_000
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	782	>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
				783	>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
				784	>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
				785	>>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
				786	[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	787
Raymond Hettinger	10355ed	2020-01-25 20:21:17 -0800	[diff] [blame]	788	Normal distributions can be used to approximate `Binomial
				789	distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
				790	when the sample size is large and when the probability of a successful
				791	trial is near 50%.
				792
				793	For example, an open source conference has 750 attendees and two rooms with a
				794	500 person capacity. There is a talk about Python and another about Ruby.
				795	In previous conferences, 65% of the attendees preferred to listen to Python
				796	talks. Assuming the population preferences haven't changed, what is the
Raymond Hettinger	01bf219	2020-01-27 18:31:46 -0800	[diff] [blame]	797	probability that the Python room will stay within its capacity limits?
Raymond Hettinger	10355ed	2020-01-25 20:21:17 -0800	[diff] [blame]	798
				799	.. doctest::
				800
				801	>>> n = 750 # Sample size
				802	>>> p = 0.65 # Preference for Python
				803	>>> q = 1.0 - p # Preference for Ruby
				804	>>> k = 500 # Room capacity
				805
				806	>>> # Approximation using the cumulative normal distribution
				807	>>> from math import sqrt
				808	>>> round(NormalDist(mu=np, sigma=sqrt(np*q)).cdf(k + 0.5), 4)
				809	0.8402
				810
				811	>>> # Solution using the cumulative binomial distribution
				812	>>> from math import comb, fsum
				813	>>> round(fsum(comb(n, r) * p*r q**(n-r) for r in range(k+1)), 4)
				814	0.8402
				815
				816	>>> # Approximation using a simulation
				817	>>> from random import seed, choices
				818	>>> seed(8675309)
				819	>>> def trial():
				820	... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
				821	>>> mean(trial() <= k for i in range(10_000))
				822	0.8398
				823
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	824	Normal distributions commonly arise in machine learning problems.
				825
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	826	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	827	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				828	The challenge is to predict a person's gender from measurements of normally
				829	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	830
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	831	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	832	measurements are assumed to be normally distributed, so we summarize the data
				833	with :class:`NormalDist`:
				834
				835	.. doctest::
				836
				837	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				838	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				839	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				840	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				841	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				842	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				843
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	844	Next, we encounter a new person whose feature measurements are known but whose
				845	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	846
				847	.. doctest::
				848
				849	>>> ht = 6.0 # height
				850	>>> wt = 130 # weight
				851	>>> fs = 8 # foot size
				852
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	853	Starting with a 50% `prior probability
				854	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				855	we compute the posterior as the prior times the product of likelihoods for the
				856	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	857
				858	.. doctest::
				859
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	860	>>> prior_male = 0.5
				861	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	862	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				863	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				864
				865	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				866	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				867
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	868	The final prediction goes to the largest posterior. This is known as the
				869	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	870	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				871
				872	.. doctest::
				873
				874	>>> 'male' if posterior_male > posterior_female else 'female'
				875	'female'
				876
				877
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	878	..
				879	# This modelines must appear within the last ten lines of the file.
				880	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;