EDA PART 1, Summary Statistics
Exploratory Data Analysis (EDA) is a systematic and unbiased approach to analyze and understand a dataset by employing statistical techniques and visualizations. It involves thoroughly examining the dataset from various perspectives, summarizing its characteristics, and identifying key patterns, trends, or anomalies without making any prior assumptions about the data's nature or underlying relationships. EDA aims to provide meaningful insights and uncover relevant features of the data that can guide further analysis and decision-making processes.
There are two types of data, categorical and numerical data.
Categorical data, also known as qualitative data, is a type of data that represents characteristics or attributes that belong to a specific category or group
While, Numerical data, also known as quantitative data, is a type of data that consists of numeric values representing measurable quantities or variables.
Univariate Analysis: Univariate analysis is a type of exploratory data analysis that focuses on analyzing or dealing with only one variable at a time. It involves examining and describing the data of a single variable, aiming to identify patterns and characteristics within that variable. Univariate analysis does not consider causes or relationships and primarily serves the purpose of providing insights and summarizing data.
Bivariate Analysis: Bivariate analysis is a type of exploratory data analysis that involves the analysis of two different variables simultaneously. It explores the relationship between these two variables, aiming to understand how changes in one variable may impact the other. Bivariate analysis delves into causality and relationships, seeking to identify associations and dependencies between the two variables under investigation.
Multivariate Analysis: Multivariate analysis is a type of exploratory data analysis that deals with datasets containing three or more variables. It examines the relationships and patterns between multiple variables, allowing for a more comprehensive analysis of the data. Multivariate analysis employs various statistical techniques and graphical representations to uncover complex relationships and interactions among the variables, facilitating a deeper understanding of the dataset as a whole.
EDA consists of two parts,
Non-graphical Analysis and Graphical Analysis
Non-graphical Analysis: Non-graphical analysis in exploratory data analysis involves examining and analyzing data using statistical tools and measures such summary statistic that quantitatively describes or summarizes features of a dataset. It focuses on understanding the characteristics and patterns for mostly one variable but can also be used for two or more variables.
Graphical Analysis: Graphical analysis is the most common part of exploratory data analysis that utilizes visualizations and charts to analyze and interpret data. It involves representing data in graphical forms to visually identify trends, patterns, distributions, relationships between variables or even compare different variables. Graphical analysis provides a comprehensive view of the data, allowing for a better understanding of the underlying structure and facilitating the exploration of multivariate relationships.
We'll start off by first performing the non graphical part then we will finish with graphical analysis in the second part
Mean:
The arithmetic mean is the sum of the observations divided by the number of observations. The arithmetic mean is by far the most frequently used measure of the middle or center of data. The mean is also referred to as the average The population mean, μ, is the arithmetic mean value of a population. For a finite population, the population mean is:
$$\mu = \dfrac{\sum_{i=1}^N X_i}{N} $$
where $N$ is the number of observations in the entire population and $X_i$ is the $i$th observation.
The sample mean is the arithmetic mean computed for a sample. A sample is a percentage of the total population in statistics. You can use the data from a sample to make inferences about a population as a whole. The concept of the mean can be applied to the observations in a sample with a slight change in notation.
$$\bar{x} = \dfrac{\sum_{i=1}^n X_i}{n} $$
where $n$ is the number of observations in the sample.
import pandas as pd
import numpy as np
import yfinance as yf
import scipy.stats as stats
import statistics
msft_daily = yf.download('MSFT', '2016-01-01', '2021-12-31', interval= '1d')['Adj Close']
msft_returns = msft_daily.pct_change(1).dropna()
average_return = str(round((np.mean(msft_returns) * 100),2)) + '%'
print(f'The average daily return for Microsoft stock is: {average_return}')
Weighted mean:
The ordinary arithmetic mean is where all sample observations are equally weighted by the factor 1/n (each of the data points contributes equally to the final average).
But with the weighted mean, some data points contribute more than others based on their weighting, the higher the weighting, the more it influences the final average. The weighted mean is also referred to as the weighted average
$$\bar{X}_w = {\sum_{i=1}^n w_i X_i} $$
where the sum of the weights equals 1; that is,$\sum_{i} w_i = 1$
aapl = yf.download('AAPL', '2021-10-01', '2021-12-31', interval= '1d')['Adj Close']
nvdia = yf.download('NVDA', '2021-10-01', '2021-12-31', interval= '1d')['Adj Close']
msft = yf.download('MSFT', '2021-10-01', '2021-12-31', interval= '1d')['Adj Close']
msft_ret = (msft[-1] - msft[0])/msft[0]
aapl_ret = (aapl[-1] - aapl[0])/aapl[0]
nvda_ret = (nvdia[-1] - nvdia[0])/aapl[0]
#portfolio return if 50% of capital was deplyoed in microsoft,30% in apple and 20% in nvidia
Wavg = (msft_ret * .50 + aapl_ret + .30 + nvda_ret * .20)/3
avg = (msft_ret + aapl_ret + nvda_ret)/3
weighted = str(round(Wavg,2)) + '%'
arith = str(round(avg,2)) + '%'
print(f"The Weighted mean return of the portfolio assuming a 50/30/20 split is: {weighted}")
print(f"The Arithmetic mean return of the portfolio assuming no split is:", arith)
The weighted mean is also very useful when calculating a theoretically expected outcome where each outcome has a different probability of occurring (more on this in probability concepts)
Harmonic mean:
The harmonic mean is a type of numerical average. It is calculated by dividing the number of observations by the reciprocal of each number in the series. Thus, the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals.
$$\bar{X}_h = \dfrac{n}{\sum_{i=1}^n \dfrac1X_i} $$
Geometric mean:
The geometric mean is most frequently used to average rates of change over time or to compute the growth rate of a variable.
For volatile numbers like stock returns, the geometric average provides a far more accurate measurement of the true return by taking into account year-over-year compounding that smooths the average.
$$ G = \sqrt[n]{X_1,X_2,X_3....X_n} $$
SnP_500 = pd.read_csv('/home/mj22/data/financials.csv')
SnP_500
print("The Harmonic mean of Price to Sales Ratio for companies in the S&P 500 is:", round(stats.hmean(SnP_500['Price/Sales']),2))
print("The Geometric mean of Price to Sales Ratio for companies in the S&P 500 is:", round(stats.gmean(SnP_500['Price/Sales']),2))
print("The Arithmetic mean of Price to Sales Ratio for companies in the S&P 500 is:", round(np.mean(SnP_500['Price/Sales']),2))
“A mathematical fact concerning the harmonic, geometric, and arithmetic means is that unless all the observations in a data set have the same value, the harmonic mean is less than the geometric mean, which in turn is less than the arithmetic mean” – Quantitative Investment Analysis, by DeFusco, McLeavey, Pinto, and Runkle
Trimmed mean:
A trimmed mean is a method of averaging that removes a small designated percentage of the largest and smallest values before calculating the mean. After removing the specified outlier observations, the trimmed mean is found using a standard arithmetic averaging formula. The use of a trimmed mean helps eliminate the influence of outliers or data points on the tails that may unfairly affect the traditional or arithmetic mean.
To trim the mean by a total of 40%, we remove the lowest 20% and the highest 20% of values, eliminating the scores of 8 and 9
data = [8, 2, 3, 4, 9]
mean = np.mean(data)
trimmed_data = [2, 3, 4]
trimmed_mean = np.mean(trimmed_data)
print(f"Hence, a mean trimmed at 40% would equal {trimmed_mean} versus {mean}")
print(f'The Median Price to Sales Ratio for Companies in the S&P 500 is:', round(np.median(SnP_500['Price/Sales']),2))
Mode:
The mode is the value that appears most frequently in a data set A distribution can have more than one mode or even no mode. When a distribution has one most frequently occurring value, the distribution is said to be unimodal. If a distribution has two most frequently occurring values, then it has two modes and we say it is bimodal. If the distribution has three most frequently occurring values, then it is trimodal. When all the values in a data set are different, the distribution has no mode because no value occurs more frequently than any other value.
mode = round(statistics.mode(SnP_500['Price/Sales']),2)
print(f'The Mode of the Price to Sales ratio for Companies in the S&P 500 is {mode}, indicating that it is the most commonly occurring value among the dataset')
maX = np.max(dividends)
miN = np.min(dividends)
Range= np.ptp(dividends)
print(f'The maximum dividend per share paid by Microsoft is {maX}$')
print(f'The minimum dividend per share paid by Microsoft is {miN}$')
print(f'Hence, the range is {Range}$')
Variance sample and population:
The variance and standard deviation are the two most widely used measures of dispersion Variance is defined as the average of the squared deviations around the mean. Population variance is a measure of the spread of population data. Hence, population variance can be defined as the average of the distances from each data point in a particular population to the mean squared, and it indicates how data points are spread out in the population. we can compute the population variance.Denoted by the symbol σ2
Population formula:
$$\sigma^2 = \dfrac{\sum_{i=1}^N(x_i - \mu)^2}N $$
While sample formula is:
$$s^2 = \dfrac{\sum_{i=1}^n(x_i - \bar x)^2}{n-1} $$
Standard deviation sample and population:
Standard Deviation (SD) is the positive square root of the variance. It is represented by the Greek letter ‘σ’ and is used to measure the amount of variation or dispersion of a set of data values relative to its mean (average), thus interpret the reliability of the data. If it is smaller then the data points lie close to the mean value, thus showing reliability. But if it is larger then data points spread far from the mean.
Population formula:
$$\sigma^2 = \sqrt{\dfrac{\sum_{i=1}^N(x_i - \mu)^2}N} $$
While sample formula is:
$$s^2 = \sqrt{\dfrac{\sum_{i=1}^n(x_i - \bar x)^2}{n-1}} $$
Hence, Variance Measures Dispersion within the Data Set while the standard deviation measures spread around the mean!
print("The variance of Microsoft's daily stock returns:", round(np.var(msft_returns),5))
print("The standard deviation of Microsoft's daily stock returns:", round(np.std(msft_returns),5))
Relationship between variables
Relationship between variables means, In a dataset, the values of one variable correspond to the values of another variable.
By conducting a non-graphical analysis of the relationship between variables, you can quantitatively assess their associations, dependencies, and impacts, providing valuable insights for further analysis and decision-making
Moreover, non-graphical analysis of the relationship between variables involves examining the numerical values in a matrix to understand the connections between variables. A covariance matrix and a correlation matrix are square matrices that display the pairwise relationships between different variables in a dataset. They provide valuable insights into the strength and direction of the relationships between variables
Covariance
Covariance provides insight into how two variables are related to one another. More precisely, covariance refers to the measure of how two random variables in a data set will change together. A positive covariance means that the two variables at hand are positively related, and they move in the same direction. A negative covariance means that the variables are inversely related, or that they move in opposite directions. Both variance and covariance measure how data points are distributed around a calculated mean. However, variance measures the spread of data along a single axis, while covariance examines the directional relationship between two variables.
Population formula:
$$\sigma_{xy} = \dfrac{\sum_{i=1}^N(x_i - \mu_x)*(y_i - \mu_y)}N $$
While sample formula is:
$$s_{xy} = \dfrac{\sum_{i=1}^n(x_i - \bar x)*(y_i - \bar y)}{n-1} $$
assets = ['META','AMZN','NFLX','GOOG','MSFT','NVDA','TSLA']
pf_data = pd.DataFrame()
for a in assets:
pf_data[a] = yf.download(a, start="2021-10-01", end="2021-12-31", index_col = 'Date', parse_dates=True)['Adj Close']
returns = pf_data.pct_change(1).dropna()
cov = returns.cov()
corr = returns.corr()
cov
Correlation coefficient
Correlation shows the strength of a relationship between two variables and is expressed numerically by the correlation coefficient. While covariance measures the direction of a relationship between two variables, correlation measures the strength of that relationship. There are many different measures of correlation but the most common one, and the one I use is the Pearson Coefficient of Correlation.
Output values of the Pearson Correlation Coefficient range between values of +1 and -1, or 100% and -100%, where +1 represents perfect positive correlation and -1 perfect negative correlation. A measure of 0 would suggest the two variables are perfectly uncorrelated, and there is no linear relationship between them. However, that doesn’t necessarily mean the variables are independent – as they might have a relationship that is not linear. Scatterplot charts are a good way of visualizing various values for correlation
Population formula:
$$p = \dfrac{\sigma_{xy}}{\sigma_y\sigma_x} $$
While sample formula is:
$$r = \dfrac{s_{xy}}{s_ys_x} $$
corr
One of the most common pitfalls of correlation analysis is that correlation is not causation!
Just because two variables have shown a historic correlation doesn’t mean that one of the variables causes the other to move. The causation of the two variables moving with a positive or negative correlation could be a third completely unconsidered variable OR a combination of many factors. In theory, we want to try and understand the causes for relationships between variables so we can have a more accurate idea about when those relationships might change and if they will. The reality is that this is very hard to achieve and so practically speaking correlation analysis is often used to summarise relationships and use them as forward-looking predicator under the caveat that we understand it is likely that there are many factors at play that are responsible for the causation of the relationship.