Blame - Doc/library/statistics.rst - platform/external/python/cpython3

blob: fe1284088943799c6b1a5a5d8f6fb0e284dd6007 [file] [log] [blame]

Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	1	:mod:`statistics` --- Mathematical statistics functions
				2	=======================================================
				3
				4	.. module:: statistics
				5	:synopsis: mathematical statistics functions
				6	.. moduleauthor:: Steven D'Aprano <steve+python@pearwood.info>
				7	.. sectionauthor:: Steven D'Aprano <steve+python@pearwood.info>
				8
				9	.. versionadded:: 3.4
				10
				11	.. testsetup:: *
				12
				13	from statistics import *
				14	__name__ = '<doctest>'
				15
				16	Source code: :source:`Lib/statistics.py`
				17
				18	--------------
				19
				20	This module provides functions for calculating mathematical statistics of
				21	numeric (:class:`Real`-valued) data.
				22
				23	Averages and measures of central location
				24	-----------------------------------------
				25
				26	These functions calculate an average or typical value from a population
				27	or sample.
				28
				29	======================= =============================================
				30	:func:`mean` Arithmetic mean ("average") of data.
				31	:func:`median` Median (middle value) of data.
				32	:func:`median_low` Low median of data.
				33	:func:`median_high` High median of data.
				34	:func:`median_grouped` Median, or 50th percentile, of grouped data.
				35	:func:`mode` Mode (most common value) of discrete data.
				36	======================= =============================================
				37
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	38	Measures of spread
				39	------------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	40
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	41	These functions calculate a measure of how much the population or sample
				42	tends to deviate from the typical or average values.
				43
				44	======================= =============================================
				45	:func:`pstdev` Population standard deviation of data.
				46	:func:`pvariance` Population variance of data.
				47	:func:`stdev` Sample standard deviation of data.
				48	:func:`variance` Sample variance of data.
				49	======================= =============================================
				50
				51
				52	Function details
				53	----------------
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	54
				55	.. function:: mean(data)
				56
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	57	Return the sample arithmetic mean of data, a sequence or iterator of
				58	real-valued numbers.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	59
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	60	The arithmetic mean is the sum of the data divided by the number of data
				61	points. It is commonly called "the average", although it is only one of many
				62	different mathematical averages. It is a measure of the central location of
				63	the data.
				64
				65	If data is empty, :exc:`StatisticsError` will be raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	66
				67	Some examples of use:
				68
				69	.. doctest::
				70
				71	>>> mean([1, 2, 3, 4, 4])
				72	2.8
				73	>>> mean([-1.0, 2.5, 3.25, 5.75])
				74	2.625
				75
				76	>>> from fractions import Fraction as F
				77	>>> mean([F(3, 7), F(1, 21), F(5, 3), F(1, 3)])
				78	Fraction(13, 21)
				79
				80	>>> from decimal import Decimal as D
				81	>>> mean([D("0.5"), D("0.75"), D("0.625"), D("0.375")])
				82	Decimal('0.5625')
				83
				84	.. note::
				85
Georg Brandl	a3fdcaa	2013-10-21 09:08:39 +0200	[diff] [blame^]	86	The mean is strongly affected by outliers and is not a robust estimator
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	87	for central location: the mean is not necessarily a typical example of the
				88	data points. For more robust, although less efficient, measures of
				89	central location, see :func:`median` and :func:`mode`. (In this case,
				90	"efficient" refers to statistical efficiency rather than computational
				91	efficiency.)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	92
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	93	The sample mean gives an unbiased estimate of the true population mean,
				94	which means that, taken on average over all the possible samples,
				95	``mean(sample)`` converges on the true mean of the entire population. If
				96	data represents the entire population rather than a sample, then
				97	``mean(data)`` is equivalent to calculating the true population mean μ.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	98
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	99
				100	.. function:: median(data)
				101
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	102	Return the median (middle value) of numeric data, using the common "mean of
				103	middle two" method. If data is empty, :exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	104
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	105	The median is a robust measure of central location, and is less affected by
				106	the presence of outliers in your data. When the number of data points is
				107	odd, the middle data point is returned:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	108
				109	.. doctest::
				110
				111	>>> median([1, 3, 5])
				112	3
				113
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	114	When the number of data points is even, the median is interpolated by taking
				115	the average of the two middle values:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	116
				117	.. doctest::
				118
				119	>>> median([1, 3, 5, 7])
				120	4.0
				121
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	122	This is suited for when your data is discrete, and you don't mind that the
				123	median may not be an actual data point.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	124
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	125	.. seealso:: :func:`median_low`, :func:`median_high`, :func:`median_grouped`
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	126
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	127
				128	.. function:: median_low(data)
				129
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	130	Return the low median of numeric data. If data is empty,
				131	:exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	132
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	133	The low median is always a member of the data set. When the number of data
				134	points is odd, the middle value is returned. When it is even, the smaller of
				135	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	136
				137	.. doctest::
				138
				139	>>> median_low([1, 3, 5])
				140	3
				141	>>> median_low([1, 3, 5, 7])
				142	3
				143
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	144	Use the low median when your data are discrete and you prefer the median to
				145	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	146
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	147
				148	.. function:: median_high(data)
				149
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	150	Return the high median of data. If data is empty, :exc:`StatisticsError`
				151	is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	152
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	153	The high median is always a member of the data set. When the number of data
				154	points is odd, the middle value is returned. When it is even, the larger of
				155	the two middle values is returned.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	156
				157	.. doctest::
				158
				159	>>> median_high([1, 3, 5])
				160	3
				161	>>> median_high([1, 3, 5, 7])
				162	5
				163
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	164	Use the high median when your data are discrete and you prefer the median to
				165	be an actual data point rather than interpolated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	166
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	167
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	168	.. function:: median_grouped(data, interval=1)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	169
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	170	Return the median of grouped continuous data, calculated as the 50th
				171	percentile, using interpolation. If data is empty, :exc:`StatisticsError`
				172	is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	173
				174	.. doctest::
				175
				176	>>> median_grouped([52, 52, 53, 54])
				177	52.5
				178
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	179	In the following example, the data are rounded, so that each value represents
				180	the midpoint of data classes, e.g. 1 is the midpoint of the class 0.5-1.5, 2
				181	is the midpoint of 1.5-2.5, 3 is the midpoint of 2.5-3.5, etc. With the data
				182	given, the middle value falls somewhere in the class 3.5-4.5, and
				183	interpolation is used to estimate it:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	184
				185	.. doctest::
				186
				187	>>> median_grouped([1, 2, 2, 3, 4, 4, 4, 4, 4, 5])
				188	3.7
				189
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	190	Optional argument interval represents the class interval, and defaults
				191	to 1. Changing the class interval naturally will change the interpolation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	192
				193	.. doctest::
				194
				195	>>> median_grouped([1, 3, 3, 5, 7], interval=1)
				196	3.25
				197	>>> median_grouped([1, 3, 3, 5, 7], interval=2)
				198	3.5
				199
				200	This function does not check whether the data points are at least
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	201	interval apart.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	202
				203	.. impl-detail::
				204
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	205	Under some circumstances, :func:`median_grouped` may coerce data points to
				206	floats. This behaviour is likely to change in the future.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	207
				208	.. seealso::
				209
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	210	* "Statistics for the Behavioral Sciences", Frederick J Gravetter and
				211	Larry B Wallnau (8th Edition).
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	212
				213	* Calculating the `median <http://www.ualberta.ca/~opscan/median.html>`_.
				214
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	215	* The `SSMEDIAN
				216	<https://projects.gnome.org/gnumeric/doc/gnumeric-function-SSMEDIAN.shtml>`_
				217	function in the Gnome Gnumeric spreadsheet, including `this discussion
				218	<https://mail.gnome.org/archives/gnumeric-list/2011-April/msg00018.html>`_.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	219
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	220
				221	.. function:: mode(data)
				222
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	223	Return the most common data point from discrete or nominal data. The mode
				224	(when it exists) is the most typical value, and is a robust measure of
				225	central location.
				226
				227	If data is empty, or if there is not exactly one most common value,
				228	:exc:`StatisticsError` is raised.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	229
				230	``mode`` assumes discrete data, and returns a single value. This is the
				231	standard treatment of the mode as commonly taught in schools:
				232
				233	.. doctest::
				234
				235	>>> mode([1, 1, 2, 3, 3, 3, 3, 4])
				236	3
				237
				238	The mode is unique in that it is the only statistic which also applies
				239	to nominal (non-numeric) data:
				240
				241	.. doctest::
				242
				243	>>> mode(["red", "blue", "blue", "red", "green", "red", "red"])
				244	'red'
				245
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	246
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	247	.. function:: pstdev(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	248
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	249	Return the population standard deviation (the square root of the population
				250	variance). See :func:`pvariance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	251
				252	.. doctest::
				253
				254	>>> pstdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				255	0.986893273527251
				256
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	257
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	258	.. function:: pvariance(data, mu=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	259
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	260	Return the population variance of data, a non-empty iterable of real-valued
				261	numbers. Variance, or second moment about the mean, is a measure of the
				262	variability (spread or dispersion) of data. A large variance indicates that
				263	the data is spread out; a small variance indicates it is clustered closely
				264	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	265
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	266	If the optional second argument mu is given, it should be the mean of
				267	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	268	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	269
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	270	Use this function to calculate the variance from the entire population. To
				271	estimate the variance from a sample, the :func:`variance` function is usually
				272	a better choice.
				273
				274	Raises :exc:`StatisticsError` if data is empty.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	275
				276	Examples:
				277
				278	.. doctest::
				279
				280	>>> data = [0.0, 0.25, 0.25, 1.25, 1.5, 1.75, 2.75, 3.25]
				281	>>> pvariance(data)
				282	1.25
				283
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	284	If you have already calculated the mean of your data, you can pass it as the
				285	optional second argument mu to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	286
				287	.. doctest::
				288
				289	>>> mu = mean(data)
				290	>>> pvariance(data, mu)
				291	1.25
				292
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	293	This function does not attempt to verify that you have passed the actual mean
				294	as mu. Using arbitrary values for mu may lead to invalid or impossible
				295	results.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	296
				297	Decimals and Fractions are supported:
				298
				299	.. doctest::
				300
				301	>>> from decimal import Decimal as D
				302	>>> pvariance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				303	Decimal('24.815')
				304
				305	>>> from fractions import Fraction as F
				306	>>> pvariance([F(1, 4), F(5, 4), F(1, 2)])
				307	Fraction(13, 72)
				308
				309	.. note::
				310
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	311	When called with the entire population, this gives the population variance
				312	σ². When called on a sample instead, this is the biased sample variance
				313	s², also known as variance with N degrees of freedom.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	314
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	315	If you somehow know the true population mean μ, you may use this function
				316	to calculate the variance of a sample, giving the known population mean as
				317	the second argument. Provided the data points are representative
				318	(e.g. independent and identically distributed), the result will be an
				319	unbiased estimate of the population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	320
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	321
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	322	.. function:: stdev(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	323
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	324	Return the sample standard deviation (the square root of the sample
				325	variance). See :func:`variance` for arguments and other details.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	326
				327	.. doctest::
				328
				329	>>> stdev([1.5, 2.5, 2.5, 2.75, 3.25, 4.75])
				330	1.0810874155219827
				331
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	332
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	333	.. function:: variance(data, xbar=None)
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	334
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	335	Return the sample variance of data, an iterable of at least two real-valued
				336	numbers. Variance, or second moment about the mean, is a measure of the
				337	variability (spread or dispersion) of data. A large variance indicates that
				338	the data is spread out; a small variance indicates it is clustered closely
				339	around the mean.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	340
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	341	If the optional second argument xbar is given, it should be the mean of
				342	data. If it is missing or ``None`` (the default), the mean is
Ned Deily	3586673	2013-10-19 12:10:01 -0700	[diff] [blame]	343	automatically calculated.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	344
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	345	Use this function when your data is a sample from a population. To calculate
				346	the variance from the entire population, see :func:`pvariance`.
				347
				348	Raises :exc:`StatisticsError` if data has fewer than two values.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	349
				350	Examples:
				351
				352	.. doctest::
				353
				354	>>> data = [2.75, 1.75, 1.25, 0.25, 0.5, 1.25, 3.5]
				355	>>> variance(data)
				356	1.3720238095238095
				357
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	358	If you have already calculated the mean of your data, you can pass it as the
				359	optional second argument xbar to avoid recalculation:
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	360
				361	.. doctest::
				362
				363	>>> m = mean(data)
				364	>>> variance(data, m)
				365	1.3720238095238095
				366
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	367	This function does not attempt to verify that you have passed the actual mean
				368	as xbar. Using arbitrary values for xbar can lead to invalid or
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	369	impossible results.
				370
				371	Decimal and Fraction values are supported:
				372
				373	.. doctest::
				374
				375	>>> from decimal import Decimal as D
				376	>>> variance([D("27.5"), D("30.25"), D("30.25"), D("34.5"), D("41.75")])
				377	Decimal('31.01875')
				378
				379	>>> from fractions import Fraction as F
				380	>>> variance([F(1, 6), F(1, 2), F(5, 3)])
				381	Fraction(67, 108)
				382
				383	.. note::
				384
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	385	This is the sample variance s² with Bessel's correction, also known as
				386	variance with N-1 degrees of freedom. Provided that the data points are
				387	representative (e.g. independent and identically distributed), the result
				388	should be an unbiased estimate of the true population variance.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	389
Georg Brandl	eb2aeec	2013-10-21 08:57:26 +0200	[diff] [blame]	390	If you somehow know the actual population mean μ you should pass it to the
				391	:func:`pvariance` function as the mu parameter to get the variance of a
				392	sample.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	393
				394	Exceptions
				395	----------
				396
				397	A single exception is defined:
				398
Benjamin Peterson	4ea16e5	2013-10-20 17:52:54 -0400	[diff] [blame]	399	.. exception:: StatisticsError
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	400
Benjamin Peterson	44c3065	2013-10-20 17:52:09 -0400	[diff] [blame]	401	Subclass of :exc:`ValueError` for statistics-related exceptions.
Larry Hastings	f5e987b	2013-10-19 11:50:09 -0700	[diff] [blame]	402
				403	..
				404	# This modelines must appear within the last ten lines of the file.
				405	kate: indent-width 3; remove-trailing-space on; replace-tabs on; encoding utf-8;