Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: 3c3f9d2df585852990f4e626021c8def8b6eaab2 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
Sanchit Khurana	f8a6316	2019-11-26 03:47:59 +0530	[diff] [blame]	5	:synopsis: Mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	22	numeric (:class:`~numbers.Real`-valued) data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	23
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	24	The module is not intended to be a competitor to third-party libraries such
				25	as `NumPy <https://numpy.org>`_, `SciPy <https://www.scipy.org/>`_, or
				26	proprietary full-featured statistics packages aimed at professional
				27	statisticians such as Minitab, SAS and Matlab. It is aimed at the level of
				28	graphing and scientific calculators.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	29
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	30	Unless explicitly noted, these functions support :class:`int`,
				31	:class:`float`, :class:`~decimal.Decimal` and :class:`~fractions.Fraction`.
				32	Behaviour with other types (whether in the numeric tower or not) is
				33	currently unsupported. Collections with a mix of types are also undefined
				34	and implementation-dependent. If your input data consists of mixed types,
				35	you may be able to use :func:`map` to ensure a consistent result, for
				36	example: ``map(float, input_data)``.
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	37
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	38	Averages and measures of central location
				39	-----------------------------------------
				40
				41	These functions calculate an average or typical value from a population
				42	or sample.
				43
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	44	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	45	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	46	:func:`fmean` Fast, floating point arithmetic mean.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	47	:func:`geometric_mean` Geometric mean of data.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	48	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	49	:func:`median` Median (middle value) of data.
				50	:func:`median_low` Low median of data.
				51	:func:`median_high` High median of data.
				52	:func:`median_grouped` Median, or 50th percentile, of grouped data.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	53	:func:`mode` Single mode (most common value) of discrete or nominal data.
Zackery Spytz	f2b4536	2021-03-13 18:00:28 -0700	[diff] [blame]	54	:func:`multimode` List of modes (most common values) of discrete or nominal data.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	55	:func:`quantiles` Divide data into intervals with equal probability.
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	56	======================= ===============================================================
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	57
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	58	Measures of spread
				59	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	60
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	61	These functions calculate a measure of how much the population or sample
				62	tends to deviate from the typical or average values.
				63
				64	======================= =============================================
				65	:func:`pstdev` Population standard deviation of data.
				66	:func:`pvariance` Population variance of data.
				67	:func:`stdev` Sample standard deviation of data.
				68	:func:`variance` Sample variance of data.
				69	======================= =============================================
				70
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	71	Statistics for relations between two inputs
				72	-------------------------------------------
				73
				74	These functions calculate statistics regarding relations between two inputs.
				75
				76	========================= =====================================================
				77	:func:`covariance` Sample covariance for two variables.
				78	:func:`correlation` Pearson's correlation coefficient for two variables.
Miss Islington (bot)	8677987	2021-05-24 18:11:12 -0700	[diff] [blame]	79	:func:`linear_regression` Slope and intercept for simple linear regression.
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	80	========================= =====================================================
				81
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	82
				83	Function details
				84	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	85
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	86	Note: The functions do not require the data given to them to be sorted.
				87	However, for reading convenience, most of the examples show sorted sequences.
				88
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	89	.. function:: mean(data)
				90
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	91	Return the sample arithmetic mean of data which can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	92
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	93	The arithmetic mean is the sum of the data divided by the number of data
				94	points. It is commonly called "the average", although it is only one of many
				95	different mathematical averages. It is a measure of the central location of
				96	the data.
				97
				98	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	99
				100	Some examples of use:
				101
				102	.. doctest::
				103
				104	>>> mean([1, 2, 3, 4, 4])
				105	2.8
				106	>>> mean([-1.0, 2.5, 3.25, 5.75])
				107	2.625
				108
				109	>>> from fractions import Fraction as F
				110	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				111	Fraction(13, 21)
				112
				113	>>> from decimal import Decimal as D
				114	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				115	Decimal('0.5625')
				116
				117	.. note::
				118
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	119	The mean is strongly affected by outliers and is not a robust estimator
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	120	for central location: the mean is not necessarily a typical example of
				121	the data points. For more robust measures of central location, see
				122	:func:`median` and :func:`mode`.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	123
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	124	The sample mean gives an unbiased estimate of the true population mean,
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	125	so that when taken on average over all the possible samples,
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	126	``mean(sample)`` converges on the true mean of the entire population. If
				127	data represents the entire population rather than a sample, then
				128	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	129
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	130
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	131	.. function:: fmean(data)
				132
				133	Convert data to floats and compute the arithmetic mean.
				134
				135	This runs faster than the :func:`mean` function and it always returns a
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	136	:class:`float`. The data may be a sequence or iterable. If the input
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	137	dataset is empty, raises a :exc:`StatisticsError`.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	138
				139	.. doctest::
				140
				141	>>> fmean([3.5, 4.0, 5.25])
				142	4.25
				143
				144	.. versionadded:: 3.8
				145
				146
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	147	.. function:: geometric_mean(data)
				148
				149	Convert data to floats and compute the geometric mean.
				150
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	151	The geometric mean indicates the central tendency or typical value of the
				152	data using the product of the values (as opposed to the arithmetic mean
				153	which uses their sum).
				154
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	155	Raises a :exc:`StatisticsError` if the input dataset is empty,
				156	if it contains a zero, or if it contains a negative value.
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	157	The data may be a sequence or iterable.
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	158
				159	No special efforts are made to achieve exact results.
				160	(However, this may change in the future.)
				161
				162	.. doctest::
				163
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	164	>>> round(geometric_mean([54, 24, 36]), 1)
Raymond Hettinger	6463ba3	2019-04-07 09:20:03 -0700	[diff] [blame]	165	36.0
				166
				167	.. versionadded:: 3.8
				168
				169
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	170	.. function:: harmonic_mean(data, weights=None)
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	171
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	172	Return the harmonic mean of data, a sequence or iterable of
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	173	real-valued numbers. If weights is omitted or None, then
				174	equal weighting is assumed.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	175
Raymond Hettinger	30a8b28	2021-02-07 16:44:42 -0800	[diff] [blame]	176	The harmonic mean is the reciprocal of the arithmetic :func:`mean` of the
				177	reciprocals of the data. For example, the harmonic mean of three values a,
				178	b and c will be equivalent to ``3/(1/a + 1/b + 1/c)``. If one of the
				179	values is zero, the result will be zero.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	180
				181	The harmonic mean is a type of average, a measure of the central
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	182	location of the data. It is often appropriate when averaging
Raymond Hettinger	30a8b28	2021-02-07 16:44:42 -0800	[diff] [blame]	183	ratios or rates, for example speeds.
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	184
				185	Suppose a car travels 10 km at 40 km/hr, then another 10 km at 60 km/hr.
				186	What is the average speed?
				187
				188	.. doctest::
				189
				190	>>> harmonic_mean([40, 60])
				191	48.0
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	192
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	193	Suppose a car travels 40 km/hr for 5 km, and when traffic clears,
				194	speeds-up to 60 km/hr for the remaining 30 km of the journey. What
				195	is the average speed?
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	196
				197	.. doctest::
				198
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	199	>>> harmonic_mean([40, 60], weights=[5, 30])
				200	56.0
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	201
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	202	:exc:`StatisticsError` is raised if data is empty, any element
				203	is less than zero, or if the weighted sum isn't positive.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	204
Raymond Hettinger	7f46049	2019-11-06 21:50:44 -0800	[diff] [blame]	205	The current algorithm has an early-out when it encounters a zero
				206	in the input. This means that the subsequent inputs are not tested
				207	for validity. (This behavior may change in the future.)
				208
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	209	.. versionadded:: 3.6
				210
Zackery Spytz	6613676	2021-01-03 05:35:26 -0700	[diff] [blame]	211	.. versionchanged:: 3.10
Raymond Hettinger	cc3467a	2020-12-23 19:52:09 -0800	[diff] [blame]	212	Added support for weights.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	213
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	214	.. function:: median(data)
				215
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	216	Return the median (middle value) of numeric data, using the common "mean of
				217	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	218	data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	219
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	220	The median is a robust measure of central location and is less affected by
				221	the presence of outliers. When the number of data points is odd, the
				222	middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	223
				224	.. doctest::
				225
				226	>>> median([1, 3, 5])
				227	3
				228
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	229	When the number of data points is even, the median is interpolated by taking
				230	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	231
				232	.. doctest::
				233
				234	>>> median([1, 3, 5, 7])
				235	4.0
				236
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	237	This is suited for when your data is discrete, and you don't mind that the
				238	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	239
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	240	If the data is ordinal (supports order operations) but not numeric (doesn't
				241	support addition), consider using :func:`median_low` or :func:`median_high`
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	242	instead.
				243
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	244	.. function:: median_low(data)
				245
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	246	Return the low median of numeric data. If data is empty,
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	247	:exc:`StatisticsError` is raised. data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	248
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	249	The low median is always a member of the data set. When the number of data
				250	points is odd, the middle value is returned. When it is even, the smaller of
				251	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	252
				253	.. doctest::
				254
				255	>>> median_low([1, 3, 5])
				256	3
				257	>>> median_low([1, 3, 5, 7])
				258	3
				259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	Use the low median when your data are discrete and you prefer the median to
				261	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	262
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	263
				264	.. function:: median_high(data)
				265
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	266	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	267	is raised. data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	268
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	269	The high median is always a member of the data set. When the number of data
				270	points is odd, the middle value is returned. When it is even, the larger of
				271	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	272
				273	.. doctest::
				274
				275	>>> median_high([1, 3, 5])
				276	3
				277	>>> median_high([1, 3, 5, 7])
				278	5
				279
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	280	Use the high median when your data are discrete and you prefer the median to
				281	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	282
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	283
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	284	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	285
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	286	Return the median of grouped continuous data, calculated as the 50th
				287	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	288	is raised. data can be a sequence or iterable.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	289
				290	.. doctest::
				291
				292	>>> median_grouped([52, 52, 53, 54])
				293	52.5
				294
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	295	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	296	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				297	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				298	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	299	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	300
				301	.. doctest::
				302
				303	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				304	3.7
				305
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	306	Optional argument interval represents the class interval, and defaults
				307	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	308
				309	.. doctest::
				310
				311	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				312	3.25
				313	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				314	3.5
				315
				316	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	317	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	318
				319	.. impl-detail::
				320
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	321	Under some circumstances, :func:`median_grouped` may coerce data points to
				322	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	323
				324	.. seealso::
				325
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	326	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				327	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	328
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	329	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	330	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	331	function in the Gnome Gnumeric spreadsheet, including `this discussion
				332	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	333
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	334
				335	.. function:: mode(data)
				336
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	337	Return the single most common data point from discrete or nominal data.
				338	The mode (when it exists) is the most typical value and serves as a
				339	measure of central location.
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	340
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	341	If there are multiple modes with the same frequency, returns the first one
				342	encountered in the data. If the smallest or largest of those is
				343	desired instead, use ``min(multimode(data))`` or ``max(multimode(data))``.
				344	If the input data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	345
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	346	``mode`` assumes discrete data and returns a single value. This is the
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	347	standard treatment of the mode as commonly taught in schools:
				348
				349	.. doctest::
				350
				351	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				352	3
				353
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	354	The mode is unique in that it is the only statistic in this package that
				355	also applies to nominal (non-numeric) data:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	356
				357	.. doctest::
				358
				359	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				360	'red'
				361
Raymond Hettinger	fc06a19	2019-03-12 00:43:27 -0700	[diff] [blame]	362	.. versionchanged:: 3.8
				363	Now handles multimodal datasets by returning the first mode encountered.
				364	Formerly, it raised :exc:`StatisticsError` when more than one mode was
				365	found.
				366
				367
				368	.. function:: multimode(data)
				369
				370	Return a list of the most frequently occurring values in the order they
				371	were first encountered in the data. Will return more than one result if
				372	there are multiple modes or an empty list if the data is empty:
				373
				374	.. doctest::
				375
				376	>>> multimode('aabbbbccddddeeffffgg')
				377	['b', 'd', 'f']
				378	>>> multimode('')
				379	[]
				380
				381	.. versionadded:: 3.8
				382
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	383
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	384	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	385
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	386	Return the population standard deviation (the square root of the population
				387	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	388
				389	.. doctest::
				390
				391	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				392	0.986893273527251
				393
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	394
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	395	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	396
Raymond Hettinger	733b9a3	2019-11-11 23:35:06 -0800	[diff] [blame]	397	Return the population variance of data, a non-empty sequence or iterable
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	398	of real-valued numbers. Variance, or second moment about the mean, is a
				399	measure of the variability (spread or dispersion) of data. A large
				400	variance indicates that the data is spread out; a small variance indicates
				401	it is clustered closely around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	402
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	403	If the optional second argument mu is given, it is typically the mean of
				404	the data. It can also be used to compute the second moment around a
				405	point that is not the mean. If it is missing or ``None`` (the default),
				406	the arithmetic mean is automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	407
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	408	Use this function to calculate the variance from the entire population. To
				409	estimate the variance from a sample, the :func:`variance` function is usually
				410	a better choice.
				411
				412	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	413
				414	Examples:
				415
				416	.. doctest::
				417
				418	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				419	>>> pvariance(data)
				420	1.25
				421
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	422	If you have already calculated the mean of your data, you can pass it as the
				423	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	424
				425	.. doctest::
				426
				427	>>> mu = mean(data)
				428	>>> pvariance(data, mu)
				429	1.25
				430
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	431	Decimals and Fractions are supported:
				432
				433	.. doctest::
				434
				435	>>> from decimal import Decimal as D
				436	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				437	Decimal('24.815')
				438
				439	>>> from fractions import Fraction as F
				440	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				441	Fraction(13, 72)
				442
				443	.. note::
				444
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	445	When called with the entire population, this gives the population variance
				446	σ². When called on a sample instead, this is the biased sample variance
				447	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	448
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	449	If you somehow know the true population mean μ, you may use this
				450	function to calculate the variance of a sample, giving the known
				451	population mean as the second argument. Provided the data points are a
				452	random sample of the population, the result will be an unbiased estimate
				453	of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	454
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	455
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	456	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	457
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	458	Return the sample standard deviation (the square root of the sample
				459	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	460
				461	.. doctest::
				462
				463	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				464	1.0810874155219827
				465
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	466
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	467	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	468
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	469	Return the sample variance of data, an iterable of at least two real-valued
				470	numbers. Variance, or second moment about the mean, is a measure of the
				471	variability (spread or dispersion) of data. A large variance indicates that
				472	the data is spread out; a small variance indicates it is clustered closely
				473	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	474
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	475	If the optional second argument xbar is given, it should be the mean of
				476	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	477	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	478
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	479	Use this function when your data is a sample from a population. To calculate
				480	the variance from the entire population, see :func:`pvariance`.
				481
				482	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	483
				484	Examples:
				485
				486	.. doctest::
				487
				488	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				489	>>> variance(data)
				490	1.3720238095238095
				491
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	492	If you have already calculated the mean of your data, you can pass it as the
				493	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	494
				495	.. doctest::
				496
				497	>>> m = mean(data)
				498	>>> variance(data, m)
				499	1.3720238095238095
				500
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	501	This function does not attempt to verify that you have passed the actual mean
				502	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	503	impossible results.
				504
				505	Decimal and Fraction values are supported:
				506
				507	.. doctest::
				508
				509	>>> from decimal import Decimal as D
				510	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				511	Decimal('31.01875')
				512
				513	>>> from fractions import Fraction as F
				514	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				515	Fraction(67, 108)
				516
				517	.. note::
				518
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	519	This is the sample variance s² with Bessel's correction, also known as
				520	variance with N-1 degrees of freedom. Provided that the data points are
				521	representative (e.g. independent and identically distributed), the result
				522	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	523
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	524	If you somehow know the actual population mean μ you should pass it to the
				525	:func:`pvariance` function as the mu parameter to get the variance of a
				526	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	527
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	528	.. function:: quantiles(data, *, n=4, method='exclusive')
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	529
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	530	Divide data into n continuous intervals with equal probability.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	531	Returns a list of ``n - 1`` cut points separating the intervals.
				532
				533	Set n to 4 for quartiles (the default). Set n to 10 for deciles. Set
				534	n to 100 for percentiles which gives the 99 cuts points that separate
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	535	data into 100 equal sized groups. Raises :exc:`StatisticsError` if n
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	536	is not least 1.
				537
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	538	The data can be any iterable containing sample data. For meaningful
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	539	results, the number of data points in data should be larger than n.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	540	Raises :exc:`StatisticsError` if there are not at least two data points.
				541
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	542	The cut points are linearly interpolated from the
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	543	two nearest data points. For example, if a cut point falls one-third
				544	of the distance between two sample values, ``100`` and ``112``, the
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	545	cut-point will evaluate to ``104``.
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	546
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	547	The method for computing quantiles can be varied depending on
Raymond Hettinger	d8c93aa	2019-09-05 23:02:27 -0700	[diff] [blame]	548	whether the data includes or excludes the lowest and
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	549	highest possible values from the population.
				550
				551	The default method is "exclusive" and is used for data sampled from
				552	a population that can have more extreme values than found in the
				553	samples. The portion of the population falling below the i-th of
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	554	m sorted data points is computed as ``i / (m + 1)``. Given nine
				555	sample values, the method sorts them and assigns the following
				556	percentiles: 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	557
				558	Setting the method to "inclusive" is used for describing population
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	559	data or for samples that are known to include the most extreme values
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	560	from the population. The minimum value in data is treated as the 0th
Raymond Hettinger	b530a44	2019-07-21 16:32:00 -0700	[diff] [blame]	561	percentile and the maximum value is treated as the 100th percentile.
				562	The portion of the population falling below the i-th of m sorted
				563	data points is computed as ``(i - 1) / (m - 1)``. Given 11 sample
				564	values, the method sorts them and assigns the following percentiles:
				565	0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90%, 100%.
Raymond Hettinger	e917f2e	2019-05-18 10:18:29 -0700	[diff] [blame]	566
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	567	.. doctest::
				568
				569	# Decile cut points for empirically sampled data
				570	>>> data = [105, 129, 87, 86, 111, 111, 89, 81, 108, 92, 110,
				571	... 100, 75, 105, 103, 109, 76, 119, 99, 91, 103, 129,
				572	... 106, 101, 84, 111, 74, 87, 86, 103, 103, 106, 86,
				573	... 111, 75, 87, 102, 121, 111, 88, 89, 101, 106, 95,
				574	... 103, 107, 101, 81, 109, 104]
				575	>>> [round(q, 1) for q in quantiles(data, n=10)]
				576	[81.0, 86.2, 89.0, 99.4, 102.5, 103.6, 106.0, 109.8, 111.0]
				577
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	578	.. versionadded:: 3.8
				579
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	580	.. function:: covariance(x, y, /)
				581
				582	Return the sample covariance of two inputs x and y. Covariance
				583	is a measure of the joint variability of two inputs.
				584
				585	Both inputs must be of the same length (no less than two), otherwise
				586	:exc:`StatisticsError` is raised.
				587
				588	Examples:
				589
				590	.. doctest::
				591
				592	>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
				593	>>> y = [1, 2, 3, 1, 2, 3, 1, 2, 3]
				594	>>> covariance(x, y)
				595	0.75
				596	>>> z = [9, 8, 7, 6, 5, 4, 3, 2, 1]
				597	>>> covariance(x, z)
				598	-7.5
				599	>>> covariance(z, x)
				600	-7.5
				601
				602	.. versionadded:: 3.10
				603
				604	.. function:: correlation(x, y, /)
				605
				606	Return the `Pearson's correlation coefficient
				607	<https://en.wikipedia.org/wiki/Pearson_correlation_coefficient>`_
				608	for two inputs. Pearson's correlation coefficient r takes values
				609	between -1 and +1. It measures the strength and direction of the linear
				610	relationship, where +1 means very strong, positive linear relationship,
				611	-1 very strong, negative linear relationship, and 0 no linear relationship.
				612
				613	Both inputs must be of the same length (no less than two), and need
				614	not to be constant, otherwise :exc:`StatisticsError` is raised.
				615
				616	Examples:
				617
				618	.. doctest::
				619
				620	>>> x = [1, 2, 3, 4, 5, 6, 7, 8, 9]
				621	>>> y = [9, 8, 7, 6, 5, 4, 3, 2, 1]
				622	>>> correlation(x, x)
				623	1.0
				624	>>> correlation(x, y)
				625	-1.0
				626
				627	.. versionadded:: 3.10
				628
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	629	.. function:: linear_regression(x, y, /)
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	630
Miss Islington (bot)	8677987	2021-05-24 18:11:12 -0700	[diff] [blame]	631	Return the slope and intercept of `simple linear regression
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	632	<https://en.wikipedia.org/wiki/Simple_linear_regression>`_
				633	parameters estimated using ordinary least squares. Simple linear
Miss Islington (bot)	8677987	2021-05-24 18:11:12 -0700	[diff] [blame]	634	regression describes the relationship between an independent variable x and
				635	a dependent variable y in terms of this linear function:
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	636
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	637	y = slope \ x + intercept + noise*
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	638
Miss Islington (bot)	8677987	2021-05-24 18:11:12 -0700	[diff] [blame]	639	where ``slope`` and ``intercept`` are the regression parameters that are
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	640	estimated, and ``noise`` represents the
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	641	variability of the data that was not explained by the linear regression
Miss Islington (bot)	e6755ba	2021-05-16 19:47:57 -0700	[diff] [blame]	642	(it is equal to the difference between predicted and actual values
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	643	of the dependent variable).
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	644
Miss Islington (bot)	8677987	2021-05-24 18:11:12 -0700	[diff] [blame]	645	Both inputs must be of the same length (no less than two), and
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	646	the independent variable x cannot be constant;
				647	otherwise a :exc:`StatisticsError` is raised.
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	648
Miss Islington (bot)	e6755ba	2021-05-16 19:47:57 -0700	[diff] [blame]	649	For example, we can use the `release dates of the Monty
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	650	Python films <https://en.wikipedia.org/wiki/Monty_Python#Films>`_
				651	to predict the cumulative number of Monty Python films
Miss Islington (bot)	e6755ba	2021-05-16 19:47:57 -0700	[diff] [blame]	652	that would have been produced by 2019
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	653	assuming that they had kept the pace.
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	654
				655	.. doctest::
				656
				657	>>> year = [1971, 1975, 1979, 1982, 1983]
				658	>>> films_total = [1, 2, 3, 4, 5]
Miss Islington (bot)	8677987	2021-05-24 18:11:12 -0700	[diff] [blame]	659	>>> slope, intercept = linear_regression(year, films_total)
Miss Islington (bot)	a682519	2021-05-24 23:23:10 -0700	[diff] [blame]	660	>>> round(slope * 2019 + intercept)
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	661	16
				662
Tymoteusz Wołodźko	09aa6f9	2021-04-25 13:45:09 +0200	[diff] [blame]	663	.. versionadded:: 3.10
				664
Raymond Hettinger	9013ccf	2019-04-23 00:06:35 -0700	[diff] [blame]	665
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	666	Exceptions
				667	----------
				668
				669	A single exception is defined:
				670
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	671	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	672
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	673	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	674
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	675
				676	:class:`NormalDist` objects
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	677	---------------------------
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	678
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	679	:class:`NormalDist` is a tool for creating and manipulating normal
				680	distributions of a `random variable
				681	<http://www.stat.yale.edu/Courses/1997-98/101/ranvar.htm>`_. It is a
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	682	class that treats the mean and standard deviation of data
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	683	measurements as a single entity.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	684
				685	Normal distributions arise from the `Central Limit Theorem
				686	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	687	of applications in statistics.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	688
				689	.. class:: NormalDist(mu=0.0, sigma=1.0)
				690
				691	Returns a new NormalDist object where mu represents the `arithmetic
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	692	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ and sigma
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	693	represents the `standard deviation
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	694	<https://en.wikipedia.org/wiki/Standard_deviation>`_.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	695
				696	If sigma is negative, raises :exc:`StatisticsError`.
				697
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	698	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	699
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	700	A read-only property for the `arithmetic mean
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	701	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				702	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	703
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	704	.. attribute:: median
				705
				706	A read-only property for the `median
				707	<https://en.wikipedia.org/wiki/Median>`_ of a normal
				708	distribution.
				709
				710	.. attribute:: mode
				711
				712	A read-only property for the `mode
				713	<https://en.wikipedia.org/wiki/Mode_(statistics)>`_ of a normal
				714	distribution.
				715
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	716	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	717
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	718	A read-only property for the `standard deviation
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame]	719	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				720	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	721
				722	.. attribute:: variance
				723
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	724	A read-only property for the `variance
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	725	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				726	distribution. Equal to the square of the standard deviation.
				727
				728	.. classmethod:: NormalDist.from_samples(data)
				729
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	730	Makes a normal distribution instance with mu and sigma parameters
				731	estimated from the data using :func:`fmean` and :func:`stdev`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	732
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	733	The data can be any :term:`iterable` and should consist of values
				734	that can be converted to type :class:`float`. If data does not
				735	contain at least two elements, raises :exc:`StatisticsError` because it
				736	takes at least one point to estimate a central value and at least two
				737	points to estimate dispersion.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	738
Raymond Hettinger	fb8c7d5	2019-04-23 01:46:18 -0700	[diff] [blame]	739	.. method:: NormalDist.samples(n, *, seed=None)
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	740
				741	Generates n random samples for a given mean and standard deviation.
				742	Returns a :class:`list` of :class:`float` values.
				743
				744	If seed is given, creates a new instance of the underlying random
				745	number generator. This is useful for creating reproducible results,
				746	even in a multi-threading context.
				747
				748	.. method:: NormalDist.pdf(x)
				749
				750	Using a `probability density function (pdf)
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	751	<https://en.wikipedia.org/wiki/Probability_density_function>`_, compute
				752	the relative likelihood that a random variable X will be near the
				753	given value x. Mathematically, it is the limit of the ratio ``P(x <=
				754	X < x+dx) / dx`` as dx approaches zero.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	755
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	756	The relative likelihood is computed as the probability of a sample
				757	occurring in a narrow range divided by the width of the range (hence
				758	the word "density"). Since the likelihood is relative to other points,
				759	its value can be greater than `1.0`.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	760
				761	.. method:: NormalDist.cdf(x)
				762
				763	Using a `cumulative distribution function (cdf)
				764	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
Raymond Hettinger	9add4b3	2019-02-28 21:47:26 -0800	[diff] [blame]	765	compute the probability that a random variable X will be less than or
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	766	equal to x. Mathematically, it is written ``P(X <= x)``.
				767
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	768	.. method:: NormalDist.inv_cdf(p)
				769
				770	Compute the inverse cumulative distribution function, also known as the
				771	`quantile function <https://en.wikipedia.org/wiki/Quantile_function>`_
				772	or the `percent-point
				773	<https://www.statisticshowto.datasciencecentral.com/inverse-distribution-function/>`_
				774	function. Mathematically, it is written ``x : P(X <= x) = p``.
				775
				776	Finds the value x of the random variable X such that the
				777	probability of the variable being less than or equal to that value
				778	equals the given probability p.
				779
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	780	.. method:: NormalDist.overlap(other)
				781
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	782	Measures the agreement between two normal probability distributions.
				783	Returns a value between 0.0 and 1.0 giving `the overlapping area for
				784	the two probability density functions
				785	<https://www.rasch.org/rmt/rmt101r.htm>`_.
Raymond Hettinger	318d537	2019-03-06 22:59:40 -0800	[diff] [blame]	786
Raymond Hettinger	8a6cbf8	2019-10-13 19:53:30 -0700	[diff] [blame]	787	.. method:: NormalDist.quantiles(n=4)
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	788
				789	Divide the normal distribution into n continuous intervals with
				790	equal probability. Returns a list of (n - 1) cut points separating
				791	the intervals.
				792
				793	Set n to 4 for quartiles (the default). Set n to 10 for deciles.
				794	Set n to 100 for percentiles which gives the 99 cuts points that
				795	separate the normal distribution into 100 equal sized groups.
				796
Raymond Hettinger	70f027d	2020-04-16 10:25:14 -0700	[diff] [blame]	797	.. method:: NormalDist.zscore(x)
				798
				799	Compute the
				800	`Standard Score <https://www.statisticshowto.com/probability-and-statistics/z-score/>`_
				801	describing x in terms of the number of standard deviations
				802	above or below the mean of the normal distribution:
				803	``(x - mean) / stdev``.
				804
				805	.. versionadded:: 3.9
				806
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	807	Instances of :class:`NormalDist` support addition, subtraction,
				808	multiplication and division by a constant. These operations
				809	are used for translation and scaling. For example:
				810
				811	.. doctest::
				812
				813	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				814	>>> temperature_february * (9/5) + 32 # Fahrenheit
				815	NormalDist(mu=41.0, sigma=4.5)
				816
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	817	Dividing a constant by an instance of :class:`NormalDist` is not supported
				818	because the result wouldn't be normally distributed.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	819
				820	Since normal distributions arise from additive effects of independent
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	821	variables, it is possible to `add and subtract two independent normally
				822	distributed random variables
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	823	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				824	represented as instances of :class:`NormalDist`. For example:
				825
				826	.. doctest::
				827
				828	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				829	>>> drug_effects = NormalDist(0.4, 0.15)
				830	>>> combined = birth_weights + drug_effects
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	831	>>> round(combined.mean, 1)
				832	3.1
				833	>>> round(combined.stdev, 1)
				834	0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	835
				836	.. versionadded:: 3.8
				837
				838
				839	:class:`NormalDist` Examples and Recipes
Raymond Hettinger	1c668d1	2019-03-14 21:46:31 -0700	[diff] [blame]	840	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	841
Raymond Hettinger	ef17fdb	2019-02-28 09:16:25 -0800	[diff] [blame]	842	:class:`NormalDist` readily solves classic probability problems.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	843
				844	For example, given `historical data for SAT exams
Raymond Hettinger	01bf219	2020-01-27 18:31:46 -0800	[diff] [blame]	845	<https://nces.ed.gov/programs/digest/d17/tables/dt17_226.40.asp>`_ showing
				846	that scores are normally distributed with a mean of 1060 and a standard
				847	deviation of 195, determine the percentage of students with test scores
				848	between 1100 and 1200, after rounding to the nearest whole number:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	849
				850	.. doctest::
				851
				852	>>> sat = NormalDist(1060, 195)
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	853	>>> fraction = sat.cdf(1200 + 0.5) - sat.cdf(1100 - 0.5)
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	854	>>> round(fraction * 100.0, 1)
				855	18.4
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	856
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	857	Find the `quartiles <https://en.wikipedia.org/wiki/Quartile>`_ and `deciles
				858	<https://en.wikipedia.org/wiki/Decile>`_ for the SAT scores:
				859
				860	.. doctest::
				861
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	862	>>> list(map(round, sat.quantiles()))
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	863	[928, 1060, 1192]
Raymond Hettinger	4db25d5	2019-09-08 16:57:58 -0700	[diff] [blame]	864	>>> list(map(round, sat.quantiles(n=10)))
Raymond Hettinger	714c60d	2019-03-18 20:17:14 -0700	[diff] [blame]	865	[810, 896, 958, 1011, 1060, 1109, 1162, 1224, 1310]
				866
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	867	To estimate the distribution for a model than isn't easy to solve
				868	analytically, :class:`NormalDist` can generate input samples for a `Monte
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	869	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	870
				871	.. doctest::
				872
Raymond Hettinger	cc353a0	2019-03-10 23:43:33 -0700	[diff] [blame]	873	>>> def model(x, y, z):
				874	... return (3x + 7xy - 5y) / (11 * z)
				875	...
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	876	>>> n = 100_000
Raymond Hettinger	e4810b2	2019-09-05 00:18:47 -0700	[diff] [blame]	877	>>> X = NormalDist(10, 2.5).samples(n, seed=3652260728)
				878	>>> Y = NormalDist(15, 1.75).samples(n, seed=4582495471)
				879	>>> Z = NormalDist(50, 1.25).samples(n, seed=6582483453)
				880	>>> quantiles(map(model, X, Y, Z)) # doctest: +SKIP
				881	[1.4591308524824727, 1.8035946855390597, 2.175091447274739]
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	882
Raymond Hettinger	10355ed	2020-01-25 20:21:17 -0800	[diff] [blame]	883	Normal distributions can be used to approximate `Binomial
				884	distributions <http://mathworld.wolfram.com/BinomialDistribution.html>`_
				885	when the sample size is large and when the probability of a successful
				886	trial is near 50%.
				887
				888	For example, an open source conference has 750 attendees and two rooms with a
				889	500 person capacity. There is a talk about Python and another about Ruby.
				890	In previous conferences, 65% of the attendees preferred to listen to Python
				891	talks. Assuming the population preferences haven't changed, what is the
Raymond Hettinger	01bf219	2020-01-27 18:31:46 -0800	[diff] [blame]	892	probability that the Python room will stay within its capacity limits?
Raymond Hettinger	10355ed	2020-01-25 20:21:17 -0800	[diff] [blame]	893
				894	.. doctest::
				895
				896	>>> n = 750 # Sample size
				897	>>> p = 0.65 # Preference for Python
				898	>>> q = 1.0 - p # Preference for Ruby
				899	>>> k = 500 # Room capacity
				900
				901	>>> # Approximation using the cumulative normal distribution
				902	>>> from math import sqrt
				903	>>> round(NormalDist(mu=np, sigma=sqrt(np*q)).cdf(k + 0.5), 4)
				904	0.8402
				905
				906	>>> # Solution using the cumulative binomial distribution
				907	>>> from math import comb, fsum
				908	>>> round(fsum(comb(n, r) * p*r q**(n-r) for r in range(k+1)), 4)
				909	0.8402
				910
				911	>>> # Approximation using a simulation
				912	>>> from random import seed, choices
				913	>>> seed(8675309)
				914	>>> def trial():
				915	... return choices(('Python', 'Ruby'), (p, q), k=n).count('Python')
				916	>>> mean(trial() <= k for i in range(10_000))
				917	0.8398
				918
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	919	Normal distributions commonly arise in machine learning problems.
				920
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	921	Wikipedia has a `nice example of a Naive Bayesian Classifier
Raymond Hettinger	d70a359	2019-03-09 00:42:23 -0800	[diff] [blame]	922	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier#Sex_classification>`_.
				923	The challenge is to predict a person's gender from measurements of normally
				924	distributed features including height, weight, and foot size.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	925
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	926	We're given a training dataset with measurements for eight people. The
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	927	measurements are assumed to be normally distributed, so we summarize the data
				928	with :class:`NormalDist`:
				929
				930	.. doctest::
				931
				932	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				933	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				934	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				935	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				936	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				937	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				938
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	939	Next, we encounter a new person whose feature measurements are known but whose
				940	gender is unknown:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	941
				942	.. doctest::
				943
				944	>>> ht = 6.0 # height
				945	>>> wt = 130 # weight
				946	>>> fs = 8 # foot size
				947
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	948	Starting with a 50% `prior probability
				949	<https://en.wikipedia.org/wiki/Prior_probability>`_ of being male or female,
				950	we compute the posterior as the prior times the product of likelihoods for the
				951	feature measurements given the gender:
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	952
				953	.. doctest::
				954
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	955	>>> prior_male = 0.5
				956	>>> prior_female = 0.5
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	957	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				958	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				959
				960	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				961	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				962
Raymond Hettinger	1f58f4f	2019-03-06 23:23:55 -0800	[diff] [blame]	963	The final prediction goes to the largest posterior. This is known as the
				964	`maximum a posteriori
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	965	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				966
				967	.. doctest::
				968
				969	>>> 'male' if posterior_male > posterior_female else 'female'
				970	'female'
				971
				972
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	973	..
				974	# This modelines must appear within the last ten lines of the file.
				975	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;