Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: a0d4d3910220088052a80eac16b0b25599cb4268 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	6
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	7	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				9
				10	.. versionadded:: 3.4
				11
Terry Jan Reedy	fa089b9	2016-06-11 15:02:54 -0400	[diff] [blame]	12	Source code: :source:`Lib/statistics.py`
				13
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	14	.. testsetup:: *
				15
				16	from statistics import *
				17	__name__ = '<doctest>'
				18
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	19	--------------
				20
				21	This module provides functions for calculating mathematical statistics of
				22	numeric (:class:`Real`-valued) data.
				23
Nick Coghlan	73afe2a	2014-02-08 19:58:04 +1000	[diff] [blame]	24	.. note::
				25
				26	Unless explicitly noted otherwise, these functions support :class:`int`,
				27	:class:`float`, :class:`decimal.Decimal` and :class:`fractions.Fraction`.
				28	Behaviour with other types (whether in the numeric tower or not) is
				29	currently unsupported. Mixed types are also undefined and
				30	implementation-dependent. If your input data consists of mixed types,
				31	you may be able to use :func:`map` to ensure a consistent result, e.g.
				32	``map(float, input_data)``.
				33
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	34	Averages and measures of central location
				35	-----------------------------------------
				36
				37	These functions calculate an average or typical value from a population
				38	or sample.
				39
				40	======================= =============================================
				41	:func:`mean` Arithmetic mean ("average") of data.
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	42	:func:`fmean` Fast, floating point arithmetic mean.
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	43	:func:`harmonic_mean` Harmonic mean of data.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	44	:func:`median` Median (middle value) of data.
				45	:func:`median_low` Low median of data.
				46	:func:`median_high` High median of data.
				47	:func:`median_grouped` Median, or 50th percentile, of grouped data.
				48	:func:`mode` Mode (most common value) of discrete data.
				49	======================= =============================================
				50
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	51	Measures of spread
				52	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	53
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	54	These functions calculate a measure of how much the population or sample
				55	tends to deviate from the typical or average values.
				56
				57	======================= =============================================
				58	:func:`pstdev` Population standard deviation of data.
				59	:func:`pvariance` Population variance of data.
				60	:func:`stdev` Sample standard deviation of data.
				61	:func:`variance` Sample variance of data.
				62	======================= =============================================
				63
				64
				65	Function details
				66	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	67
Georg Brandl	e051b55	2013-11-04 07:30:50 +0100	[diff] [blame]	68	Note: The functions do not require the data given to them to be sorted.
				69	However, for reading convenience, most of the examples show sorted sequences.
				70
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	71	.. function:: mean(data)
				72
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	73	Return the sample arithmetic mean of data which can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	74
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	75	The arithmetic mean is the sum of the data divided by the number of data
				76	points. It is commonly called "the average", although it is only one of many
				77	different mathematical averages. It is a measure of the central location of
				78	the data.
				79
				80	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	81
				82	Some examples of use:
				83
				84	.. doctest::
				85
				86	>>> mean([1, 2, 3, 4, 4])
				87	2.8
				88	>>> mean([-1.0, 2.5, 3.25, 5.75])
				89	2.625
				90
				91	>>> from fractions import Fraction as F
				92	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				93	Fraction(13, 21)
				94
				95	>>> from decimal import Decimal as D
				96	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				97	Decimal('0.5625')
				98
				99	.. note::
				100
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame]	101	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	102	for central location: the mean is not necessarily a typical example of the
				103	data points. For more robust, although less efficient, measures of
				104	central location, see :func:`median` and :func:`mode`. (In this case,
				105	"efficient" refers to statistical efficiency rather than computational
				106	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	107
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	108	The sample mean gives an unbiased estimate of the true population mean,
				109	which means that, taken on average over all the possible samples,
				110	``mean(sample)`` converges on the true mean of the entire population. If
				111	data represents the entire population rather than a sample, then
				112	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	113
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	114
Raymond Hettinger	47d9987	2019-02-21 15:06:29 -0800	[diff] [blame]	115	.. function:: fmean(data)
				116
				117	Convert data to floats and compute the arithmetic mean.
				118
				119	This runs faster than the :func:`mean` function and it always returns a
				120	:class:`float`. The result is highly accurate but not as perfect as
				121	:func:`mean`. If the input dataset is empty, raises a
				122	:exc:`StatisticsError`.
				123
				124	.. doctest::
				125
				126	>>> fmean([3.5, 4.0, 5.25])
				127	4.25
				128
				129	.. versionadded:: 3.8
				130
				131
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	132	.. function:: harmonic_mean(data)
				133
				134	Return the harmonic mean of data, a sequence or iterator of
				135	real-valued numbers.
				136
				137	The harmonic mean, sometimes called the subcontrary mean, is the
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	138	reciprocal of the arithmetic :func:`mean` of the reciprocals of the
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	139	data. For example, the harmonic mean of three values a, b and c
				140	will be equivalent to ``3/(1/a + 1/b + 1/c)``.
				141
				142	The harmonic mean is a type of average, a measure of the central
				143	location of the data. It is often appropriate when averaging quantities
				144	which are rates or ratios, for example speeds. For example:
				145
				146	Suppose an investor purchases an equal value of shares in each of
				147	three companies, with P/E (price/earning) ratios of 2.5, 3 and 10.
				148	What is the average P/E ratio for the investor's portfolio?
				149
				150	.. doctest::
				151
				152	>>> harmonic_mean([2.5, 3, 10]) # For an equal investment portfolio.
				153	3.6
				154
				155	Using the arithmetic mean would give an average of about 5.167, which
				156	is too high.
				157
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	158	:exc:`StatisticsError` is raised if data is empty, or any element
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	159	is less than zero.
				160
Zachary Ware	c019bd3	2016-08-23 13:23:31 -0500	[diff] [blame]	161	.. versionadded:: 3.6
				162
Steven D'Aprano	2287318	2016-08-24 02:34:25 +1000	[diff] [blame]	163
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	164	.. function:: median(data)
				165
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	166	Return the median (middle value) of numeric data, using the common "mean of
				167	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	168	data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	169
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	170	The median is a robust measure of central location, and is less affected by
				171	the presence of outliers in your data. When the number of data points is
				172	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	173
				174	.. doctest::
				175
				176	>>> median([1, 3, 5])
				177	3
				178
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	179	When the number of data points is even, the median is interpolated by taking
				180	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	181
				182	.. doctest::
				183
				184	>>> median([1, 3, 5, 7])
				185	4.0
				186
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	187	This is suited for when your data is discrete, and you don't mind that the
				188	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	189
Tal Einat	fdd6e0b	2018-06-25 14:04:01 +0300	[diff] [blame]	190	If your data is ordinal (supports order operations) but not numeric (doesn't
				191	support addition), you should use :func:`median_low` or :func:`median_high`
				192	instead.
				193
Berker Peksag	9c1dba2	2014-09-28 00:00:58 +0300	[diff] [blame]	194	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	195
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	196
				197	.. function:: median_low(data)
				198
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	199	Return the low median of numeric data. If data is empty,
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	200	:exc:`StatisticsError` is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	201
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	202	The low median is always a member of the data set. When the number of data
				203	points is odd, the middle value is returned. When it is even, the smaller of
				204	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	205
				206	.. doctest::
				207
				208	>>> median_low([1, 3, 5])
				209	3
				210	>>> median_low([1, 3, 5, 7])
				211	3
				212
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	213	Use the low median when your data are discrete and you prefer the median to
				214	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	215
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	216
				217	.. function:: median_high(data)
				218
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	219	Return the high median of data. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	220	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	221
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	222	The high median is always a member of the data set. When the number of data
				223	points is odd, the middle value is returned. When it is even, the larger of
				224	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	225
				226	.. doctest::
				227
				228	>>> median_high([1, 3, 5])
				229	3
				230	>>> median_high([1, 3, 5, 7])
				231	5
				232
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	233	Use the high median when your data are discrete and you prefer the median to
				234	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	235
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	236
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	237	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	238
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	239	Return the median of grouped continuous data, calculated as the 50th
				240	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
Raymond Hettinger	6da9078	2016-11-21 16:31:02 -0800	[diff] [blame]	241	is raised. data can be a sequence or iterator.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	242
				243	.. doctest::
				244
				245	>>> median_grouped([52, 52, 53, 54])
				246	52.5
				247
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	248	In the following example, the data are rounded, so that each value represents
Serhiy Storchaka	c7b1a0b	2016-11-26 13:43:28 +0200	[diff] [blame]	249	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5--1.5, 2
				250	is the midpoint of 1.5--2.5, 3 is the midpoint of 2.5--3.5, etc. With the data
				251	given, the middle value falls somewhere in the class 3.5--4.5, and
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	252	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	253
				254	.. doctest::
				255
				256	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				257	3.7
				258
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	259	Optional argument interval represents the class interval, and defaults
				260	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	261
				262	.. doctest::
				263
				264	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				265	3.25
				266	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				267	3.5
				268
				269	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	270	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	271
				272	.. impl-detail::
				273
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	274	Under some circumstances, :func:`median_grouped` may coerce data points to
				275	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	276
				277	.. seealso::
				278
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	279	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				280	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	281
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	282	* The `SSMEDIAN
Georg Brandl	525d355	2014-10-29 10:26:56 +0100	[diff] [blame]	283	<https://help.gnome.org/users/gnumeric/stable/gnumeric.html#gnumeric-function-SSMEDIAN>`_
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	284	function in the Gnome Gnumeric spreadsheet, including `this discussion
				285	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	286
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	287
				288	.. function:: mode(data)
				289
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	290	Return the most common data point from discrete or nominal data. The mode
				291	(when it exists) is the most typical value, and is a robust measure of
				292	central location.
				293
				294	If data is empty, or if there is not exactly one most common value,
				295	:exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	296
				297	``mode`` assumes discrete data, and returns a single value. This is the
				298	standard treatment of the mode as commonly taught in schools:
				299
				300	.. doctest::
				301
				302	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				303	3
				304
				305	The mode is unique in that it is the only statistic which also applies
				306	to nominal (non-numeric) data:
				307
				308	.. doctest::
				309
				310	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				311	'red'
				312
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	313
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	314	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	315
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	316	Return the population standard deviation (the square root of the population
				317	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	318
				319	.. doctest::
				320
				321	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				322	0.986893273527251
				323
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	324
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	325	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	326
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	327	Return the population variance of data, a non-empty iterable of real-valued
				328	numbers. Variance, or second moment about the mean, is a measure of the
				329	variability (spread or dispersion) of data. A large variance indicates that
				330	the data is spread out; a small variance indicates it is clustered closely
				331	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	332
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	333	If the optional second argument mu is given, it should be the mean of
				334	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	335	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	336
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	337	Use this function to calculate the variance from the entire population. To
				338	estimate the variance from a sample, the :func:`variance` function is usually
				339	a better choice.
				340
				341	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	342
				343	Examples:
				344
				345	.. doctest::
				346
				347	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				348	>>> pvariance(data)
				349	1.25
				350
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	351	If you have already calculated the mean of your data, you can pass it as the
				352	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	353
				354	.. doctest::
				355
				356	>>> mu = mean(data)
				357	>>> pvariance(data, mu)
				358	1.25
				359
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	360	This function does not attempt to verify that you have passed the actual mean
				361	as mu. Using arbitrary values for mu may lead to invalid or impossible
				362	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	363
				364	Decimals and Fractions are supported:
				365
				366	.. doctest::
				367
				368	>>> from decimal import Decimal as D
				369	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				370	Decimal('24.815')
				371
				372	>>> from fractions import Fraction as F
				373	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				374	Fraction(13, 72)
				375
				376	.. note::
				377
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	378	When called with the entire population, this gives the population variance
				379	σ². When called on a sample instead, this is the biased sample variance
				380	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	381
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	382	If you somehow know the true population mean μ, you may use this function
				383	to calculate the variance of a sample, giving the known population mean as
				384	the second argument. Provided the data points are representative
				385	(e.g. independent and identically distributed), the result will be an
				386	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	387
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	388
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	389	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	390
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	391	Return the sample standard deviation (the square root of the sample
				392	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	393
				394	.. doctest::
				395
				396	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				397	1.0810874155219827
				398
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	399
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	400	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	401
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	402	Return the sample variance of data, an iterable of at least two real-valued
				403	numbers. Variance, or second moment about the mean, is a measure of the
				404	variability (spread or dispersion) of data. A large variance indicates that
				405	the data is spread out; a small variance indicates it is clustered closely
				406	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	407
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	408	If the optional second argument xbar is given, it should be the mean of
				409	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	410	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	411
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	412	Use this function when your data is a sample from a population. To calculate
				413	the variance from the entire population, see :func:`pvariance`.
				414
				415	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	416
				417	Examples:
				418
				419	.. doctest::
				420
				421	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				422	>>> variance(data)
				423	1.3720238095238095
				424
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	425	If you have already calculated the mean of your data, you can pass it as the
				426	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	427
				428	.. doctest::
				429
				430	>>> m = mean(data)
				431	>>> variance(data, m)
				432	1.3720238095238095
				433
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	434	This function does not attempt to verify that you have passed the actual mean
				435	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	436	impossible results.
				437
				438	Decimal and Fraction values are supported:
				439
				440	.. doctest::
				441
				442	>>> from decimal import Decimal as D
				443	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				444	Decimal('31.01875')
				445
				446	>>> from fractions import Fraction as F
				447	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				448	Fraction(67, 108)
				449
				450	.. note::
				451
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	452	This is the sample variance s² with Bessel's correction, also known as
				453	variance with N-1 degrees of freedom. Provided that the data points are
				454	representative (e.g. independent and identically distributed), the result
				455	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	456
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	457	If you somehow know the actual population mean μ you should pass it to the
				458	:func:`pvariance` function as the mu parameter to get the variance of a
				459	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	460
				461	Exceptions
				462	----------
				463
				464	A single exception is defined:
				465
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	466	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	467
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	468	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	469
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	470
				471	:class:`NormalDist` objects
				472	===========================
				473
				474	A :class:`NormalDist` is a a composite class that treats the mean and standard
				475	deviation of data measurements as a single entity. It is a tool for creating
				476	and manipulating normal distributions of a random variable.
				477
				478	Normal distributions arise from the `Central Limit Theorem
				479	<https://en.wikipedia.org/wiki/Central_limit_theorem>`_ and have a wide range
				480	of applications in statistics, including simulations and hypothesis testing.
				481
				482	.. class:: NormalDist(mu=0.0, sigma=1.0)
				483
				484	Returns a new NormalDist object where mu represents the `arithmetic
				485	mean <https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of data and sigma
				486	represents the `standard deviation
				487	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of the data.
				488
				489	If sigma is negative, raises :exc:`StatisticsError`.
				490
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame^]	491	.. attribute:: mean
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	492
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame^]	493	A read-only property representing the `arithmetic mean
				494	<https://en.wikipedia.org/wiki/Arithmetic_mean>`_ of a normal
				495	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	496
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame^]	497	.. attribute:: stdev
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	498
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame^]	499	A read-only property representing the `standard deviation
				500	<https://en.wikipedia.org/wiki/Standard_deviation>`_ of a normal
				501	distribution.
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	502
				503	.. attribute:: variance
				504
				505	A read-only property representing the `variance
				506	<https://en.wikipedia.org/wiki/Variance>`_ of a normal
				507	distribution. Equal to the square of the standard deviation.
				508
				509	.. classmethod:: NormalDist.from_samples(data)
				510
				511	Class method that makes a normal distribution instance
				512	from sample data. The data can be any :term:`iterable`
				513	and should consist of values that can be converted to type
				514	:class:`float`.
				515
				516	If data does not contain at least two elements, raises
				517	:exc:`StatisticsError` because it takes at least one point to estimate
				518	a central value and at least two points to estimate dispersion.
				519
				520	.. method:: NormalDist.samples(n, seed=None)
				521
				522	Generates n random samples for a given mean and standard deviation.
				523	Returns a :class:`list` of :class:`float` values.
				524
				525	If seed is given, creates a new instance of the underlying random
				526	number generator. This is useful for creating reproducible results,
				527	even in a multi-threading context.
				528
				529	.. method:: NormalDist.pdf(x)
				530
				531	Using a `probability density function (pdf)
				532	<https://en.wikipedia.org/wiki/Probability_density_function>`_,
				533	compute the relative likelihood that a random sample X will be near
				534	the given value x. Mathematically, it is the ratio ``P(x <= X <
				535	x+dx) / dx``.
				536
				537	Note the relative likelihood of x can be greater than `1.0`. The
				538	probability for a specific point on a continuous distribution is `0.0`,
				539	so the :func:`pdf` is used instead. It gives the probability of a
				540	sample occurring in a narrow range around x and then dividing that
				541	probability by the width of the range (hence the word "density").
				542
				543	.. method:: NormalDist.cdf(x)
				544
				545	Using a `cumulative distribution function (cdf)
				546	<https://en.wikipedia.org/wiki/Cumulative_distribution_function>`_,
				547	compute the probability that a random sample X will be less than or
				548	equal to x. Mathematically, it is written ``P(X <= x)``.
				549
				550	Instances of :class:`NormalDist` support addition, subtraction,
				551	multiplication and division by a constant. These operations
				552	are used for translation and scaling. For example:
				553
				554	.. doctest::
				555
				556	>>> temperature_february = NormalDist(5, 2.5) # Celsius
				557	>>> temperature_february * (9/5) + 32 # Fahrenheit
				558	NormalDist(mu=41.0, sigma=4.5)
				559
				560	Dividing a constant by an instance of :class:`NormalDist` is not supported.
				561
				562	Since normal distributions arise from additive effects of independent
				563	variables, it is possible to `add and subtract two normally distributed
				564	random variables
				565	<https://en.wikipedia.org/wiki/Sum_of_normally_distributed_random_variables>`_
				566	represented as instances of :class:`NormalDist`. For example:
				567
				568	.. doctest::
				569
				570	>>> birth_weights = NormalDist.from_samples([2.5, 3.1, 2.1, 2.4, 2.7, 3.5])
				571	>>> drug_effects = NormalDist(0.4, 0.15)
				572	>>> combined = birth_weights + drug_effects
Raymond Hettinger	9e456bc	2019-02-24 11:44:55 -0800	[diff] [blame^]	573	>>> f'mean: {combined.mean :.1f} standard deviation: {combined.stdev :.1f}'
				574	'mean: 3.1 standard deviation: 0.5'
Raymond Hettinger	11c7953	2019-02-23 14:44:07 -0800	[diff] [blame]	575
				576	.. versionadded:: 3.8
				577
				578
				579	:class:`NormalDist` Examples and Recipes
				580	----------------------------------------
				581
				582	A :class:`NormalDist` readily solves classic probability problems.
				583
				584	For example, given `historical data for SAT exams
				585	<https://blog.prepscholar.com/sat-standard-deviation>`_ showing that scores
				586	are normally distributed with a mean of 1060 and standard deviation of 192,
				587	determine the percentage of students with scores between 1100 and 1200:
				588
				589	.. doctest::
				590
				591	>>> sat = NormalDist(1060, 195)
				592	>>> fraction = sat.cdf(1200) - sat.cdf(1100)
				593	>>> f'{fraction * 100 :.1f}% score between 1100 and 1200'
				594	'18.2% score between 1100 and 1200'
				595
				596	To estimate the distribution for a model than isn't easy to solve
				597	analytically, :class:`NormalDist` can generate input samples for a `Monte
				598	Carlo simulation <https://en.wikipedia.org/wiki/Monte_Carlo_method>`_ of the
				599	model:
				600
				601	.. doctest::
				602
				603	>>> n = 100_000
				604	>>> X = NormalDist(350, 15).samples(n)
				605	>>> Y = NormalDist(47, 17).samples(n)
				606	>>> Z = NormalDist(62, 6).samples(n)
				607	>>> model_simulation = [x * y / z for x, y, z in zip(X, Y, Z)]
				608	>>> NormalDist.from_samples(model_simulation) # doctest: +SKIP
				609	NormalDist(mu=267.6516398754636, sigma=101.357284306067)
				610
				611	Normal distributions commonly arise in machine learning problems.
				612
				613	Wikipedia has a `nice example with a Naive Bayesian Classifier
				614	<https://en.wikipedia.org/wiki/Naive_Bayes_classifier>`_. The challenge
				615	is to guess a person's gender from measurements of normally distributed
				616	features including height, weight, and foot size.
				617
				618	The `prior probability <https://en.wikipedia.org/wiki/Prior_probability>`_ of
				619	being male or female is 50%:
				620
				621	.. doctest::
				622
				623	>>> prior_male = 0.5
				624	>>> prior_female = 0.5
				625
				626	We also have a training dataset with measurements for eight people. These
				627	measurements are assumed to be normally distributed, so we summarize the data
				628	with :class:`NormalDist`:
				629
				630	.. doctest::
				631
				632	>>> height_male = NormalDist.from_samples([6, 5.92, 5.58, 5.92])
				633	>>> height_female = NormalDist.from_samples([5, 5.5, 5.42, 5.75])
				634	>>> weight_male = NormalDist.from_samples([180, 190, 170, 165])
				635	>>> weight_female = NormalDist.from_samples([100, 150, 130, 150])
				636	>>> foot_size_male = NormalDist.from_samples([12, 11, 12, 10])
				637	>>> foot_size_female = NormalDist.from_samples([6, 8, 7, 9])
				638
				639	We observe a new person whose feature measurements are known but whose gender
				640	is unknown:
				641
				642	.. doctest::
				643
				644	>>> ht = 6.0 # height
				645	>>> wt = 130 # weight
				646	>>> fs = 8 # foot size
				647
				648	The posterior is the product of the prior times each likelihood of a
				649	feature measurement given the gender:
				650
				651	.. doctest::
				652
				653	>>> posterior_male = (prior_male * height_male.pdf(ht) *
				654	... weight_male.pdf(wt) * foot_size_male.pdf(fs))
				655
				656	>>> posterior_female = (prior_female * height_female.pdf(ht) *
				657	... weight_female.pdf(wt) * foot_size_female.pdf(fs))
				658
				659	The final prediction is awarded to the largest posterior -- this is known as
				660	the `maximum a posteriori
				661	<https://en.wikipedia.org/wiki/Maximum_a_posteriori_estimation>`_ or MAP:
				662
				663	.. doctest::
				664
				665	>>> 'male' if posterior_male > posterior_female else 'female'
				666	'female'
				667
				668
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	669	..
				670	# This modelines must appear within the last ten lines of the file.
				671	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;