While non-graphical analysis provides us with valuable insights, graphical techniques provide a visual representation that enhances our understanding of the data. graphical analysis serves as a vital tool for exploring and understanding data.

Graphical analysis allows us to examine the distribution of data, providing insights into the patterns, shape, and spread of values within a dataset. Histograms, for example, display the frequency or proportion of data points within specific intervals, enabling us to identify peaks, gaps, or skewed distributions. Box plots provide a visual representation of the minimum, maximum, median, and quartiles, helping us understand the central tendencies and variability in the data. By exploring the distribution of data graphically, we can gain a deeper understanding of its characteristics and identify any outliers or anomalies.

Furthermore, Graphical analysis enables us to explore the relationships between different variables in a dataset. Scatter plots, for instance, plot data points as dots on a graph, allowing us to observe the correlation or association between two variables. This helps us identify trends, patterns, or potential dependencies. Additionally, line plots or time series plots provide a visual representation of how variables change over time, highlighting any trends or seasonal patterns. By examining these graphical representations, we can uncover valuable insights into the relationships and dependencies between variables, enabling us to make informed decisions or predictions.

Lastly, Graphical analysis facilitates the comparison of different datasets or categories, allowing us to identify similarities, differences, or trends. Bar charts or column charts, for example, provide a visual representation of categorical data, making it easy to compare the frequency or proportions of different categories. Grouped bar charts or stacked bar charts can be used to compare multiple categories simultaneously. By visually comparing data, we can identify variations, spot outliers, or detect patterns across different groups or time periods. This helps us make data-driven decisions and identify areas of improvement.

By leveraging visual representations, we gain valuable insights into the underlying patterns, trends, and characteristics of the data, empowering us to make informed decisions and draw meaningful conclusions.

We will explore some of the most commonly used visualization tools that I frequently use when performing graphical analysis

import numpy as np
import pandas as pd
import yfinance as yf
from yahoofinancials import YahooFinancials as yfin
import plotly.express as px
import plotly.graph_objs as go
import plotly.figure_factory as ff
from IPython.display import HTML
from plotly.offline import init_notebook_mode, iplot
init_notebook_mode(connected=True)  
import warnings
warnings.simplefilter("ignore")

HEATMAP

The correlation heatmap allows for a quick visual assessment of the relationships between variables in a dataset. It helps identify strong positive or negative correlations, patterns, and clusters among variables. This graphical representation provides a more intuitive and comprehensive understanding of the interdependencies among multiple variables compared to a tabular format.

Using the same portfolio of stocks from the previous part, when we examine the return correlation between the stocks, we can quickly observe which stocks move together. As suspected, there is no pair with a negative correlation, considering that all the stocks share common traits. They are all part of the S&P 500, and they all belong to the technology sector, except for Tesla. This explains why Tesla is the only stock that exhibits a less positive correlation with every other stock. However, since Tesla utilizes technology heavily in its electric vehicles, it still considered a tech company by some despite its primary focus on automotive production. Therefore, it exhibits similar characteristics to other tech stocks, which may explain why the correlation is not negative but still not as strong as the correlations among the purely technology-focused stocks.

assets = ['META','AMZN','NFLX','GOOG','MSFT','NVDA','TSLA']
pf_data = pd.DataFrame()
for a in assets:
    pf_data[a] = yf.download(a, start="2021-10-01", end="2021-12-31")['Adj Close']

returns = pf_data.pct_change(1).dropna()
cov = returns.cov()
corr = returns.corr()

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

(NOTE: REFRESH PAGE IF GRAPHS ARE NOT LOADING)

fig = px.imshow(corr)
fig.update_layout(width=1000, height=800)
fig.update_layout(template = "plotly_dark", title = 'A heat Map Of Stock Returns In a Portfolio') 
fig.show()

SCATTER MATRIX & SCATTER PLOT

Aside from a heatmap we can also use a scatter matrix to check the relationships between the stocks.

A scatter matrix, also known as a scatterplot matrix or pair plot, is a graphical tool used to explore the relationships between multiple variables in a dataset. It consists of a grid of scatterplots, where each scatterplot represents the relationship between two variables.

A scatter plot is a type of graph that displays the relationship between two variables. It consists of a horizontal x-axis and a vertical y-axis, where each data point is represented by a dot or marker on the plot. The position of each dot on the scatter plot corresponds to the values of the two variables being analyzed.

Scatter plots are particularly useful for visualizing the correlation or relationship between two continuous variables. By plotting the data points on the scatter plot, it becomes easier to observe any patterns, trends, or associations between the variables. The general shape or direction of the points on the scatter plot can provide insights into the strength and direction of the relationship.

fig = px.scatter_matrix(returns, title='A Scatter Matrix Of Stock Returns In A Portfolio', color_discrete_sequence=['firebrick'])
fig.update_layout(width=1200, height=800)
fig.update_layout(template = "plotly_dark")
fig.show()

When we examine the relationships from both the heatmap and scatter matrix, the pair that exhibits the highest correlation is Microsoft and Google. This can be explained by the fact that these two companies are often mentioned in the same context, as they share similarities in their business models. For instance, Microsoft has Bing while Google dominates the search engine market, both companies have cloud platforms (Azure and Google Cloud Platform), and they offer productivity suites such as Excel and Google Sheets, Word and Google Docs, Gmail and Outlook, among others. To gain a deeper understanding of the relationship between these two stocks, we can visualize their relationship separately and analyze it more closely.

fig = px.scatter(returns, x='MSFT', y='GOOG', title='Scatter Plot Of MSFT Return and GOOG Return',color="MSFT")#,trendline='ols', trendline_color_override='firebrick')#color_discrete_sequence=['firebrick'])
fig.update_layout(width=1000, height=800)
fig.update_layout(template = "plotly_dark")
fig.show()

We can definitely see a linear relationship here, when one does well the other does well and vice versa, this should be taken into consideration whenever one includes both of them in a portfolio

We can also plot a trendline, which is also known as an Ordinary Least Squares (OLS) line. it is often used in scatter plots to depict the overall trend or relationship between two variables.

By using a trendline or OLS line in a scatter plot, we can visually observe the direction and strength of the relationship between the variables. The line is positioned to best represent the general pattern of the data points, whether it be a positive or negative correlation, or even no apparent correlation.

fig = px.scatter(returns, x='MSFT', y='GOOG', title='Scatter Plot and Trendline Of MSFT Return and GOOG Return',color="MSFT",trendline='ols', trendline_color_override='firebrick')
fig.update_layout(width=1000, height=800)
fig.update_layout(template = "plotly_dark")
fig.show()

In addition to representing the overall trend, a trendline or OLS line in a scatter plot can also be utilized for prediction purposes using linear regression. By fitting a linear regression model to the data, we can establish a mathematical relationship between the variables, enabling us to make predictions or estimates based on this model.

Linear regression allows us to determine the equation of the line that best fits the data points, providing us with a predictive model. This model can then be used to forecast or estimate the value of one variable based on the value of the other variable. By leveraging the linear regression analysis, we can make informed predictions and gain insights into how changes in one variable may impact the other.

But again, caution should be exercised when interpreting the relationship between variables based on a scatter plot and linear regression analysis. Although changes in one variable may be associated with changes in the other variable, it does not necessarily imply causation!

BAR CHART

A bar chart, also known as a bar graph, is a popular visualization tool that presents data using rectangular bars of varying heights. Each bar represents a category or group, and the height of the bar corresponds to the value or frequency of that category. Bar charts are effective in comparing different categories or groups and visually displaying patterns or trends in the data.

For this particular case, I wanted to examine the volume of trades for each stock in the 4th quarter of 2021. As anticipated, Tesla emerged as the leader of the pack, with approximately 5 billion shares traded. Amazon closely followed with 4 billion shares traded. Surprisingly, Netflix has not been receiving the level of trading activity that I would have expected, considering the company's size and prominence

pf_data2 = pd.DataFrame()
for b in assets:
    pf_data2[b] = yf.download(b, start="2021-01-01", end="2021-12-31", interval='3mo')['Volume']

[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed
[*********************100%***********************]  1 of 1 completed

fig = px.bar(pf_data2.loc['2021-10-01'],color_discrete_sequence=['firebrick'])
fig.update_layout(width=1000, height=800, template = "plotly_dark")
fig.show()

STACKED BAR CHART

A stacked bar chart is a visualization tool that represents different categories or groups of data as stacked bars, where each bar segment corresponds to a subcategory or a portion of the whole

Using the same example, instead of focusing on the volume traded in a single quarter, I now intend to analyze the volume traded in each of the four quarters of 2021. The verdict remains consistent, with Tesla consistently trading the highest volume. Additionally, another noteworthy observation is that the volume for the tech stocks in the portfolio decreased in the 3rd quarter, the volume had been in a downward trend decreasing QoQ since early 2021.

fig = px.bar(pf_data2)
fig.update_layout(width=1000, height=800, template = "plotly_dark")
fig.show()

Alternatively, we can plot a side-by-side bar graph for each stock. While a stacked bar chart provides information on the total volume, a side-by-side bar graph presents a clearer depiction of how the volume for each stock has evolved throughout the year.

fig = px.bar(pf_data2,barmode='group')
fig.update_layout(width=1000, height=800, template = "plotly_dark")
fig.show()

LINE CHART

Line charts, also known as line graphs or time series plots, are effective visual representations for displaying trends and patterns over time. They are particularly useful when analyzing data that is continuous or sequential in nature, such as stock prices, temperature fluctuations, or population growth.

If we plot the stock prices, we can observe the progression of each stock throughout 2021.

fig = px.line(pf_data)
fig.update_layout(width=1200, height=800, template = "plotly_dark")
fig.show()

However, I typically examine how each stock has performed relative to one another. To achieve this, I normalize the data using cumulative returns, making it much easier to identify outperforming and underperforming stocks. By doing so, we can see that Nvidia is leading the pack, followed by Tesla, while Netflix and Meta have significantly underperformed in comparison.

cum_returns =   (1 + returns).cumprod() - 1
fig = px.line(cum_returns)
fig.update_layout(width=1200, height=800, template = "plotly_dark")
fig.show()

PIE CHART

A pie chart is a circular graphical representation that is divided into slices to illustrate the proportional composition of different categories or parts of a whole. Each slice represents a specific category, and its size is determined by the proportion or percentage it contributes to the total. Pie charts are effective in visualizing categorical data and showing the relative sizes or distributions of different categories. They are particularly useful for displaying data that can be grouped into distinct categories and highlighting the relative importance or contribution of each category. By examining a pie chart, we can quickly grasp the overall composition and relative significance of different components within the data.

If we want to determine the sectors that contribute the most or are most represented in the S&P 500, we can use a pie chart for this analysis. From the pie chart, we can observe that there are five dominant sectors, with four of them being cyclical. This observation may explain why the S&P 500 tends to underperform during economic downturns. Additionally, the telecommunication sector has the fewest number of companies in the index, indicating that it is the least represented sector

Data Source: https://www.kaggle.com/datasets/paytonfisher/sp-500-companies-with-financial-information?resource=download

SnP_500 = pd.read_csv('/home/mj22/data/financials.csv')

fig = px.pie(SnP_500, names='Sector')
fig.update_layout(width=1200, height=800, template = "plotly_dark")
fig.show()

BOX PLOT

A box plot is a graphical representation that provides a visual summary of the distribution of a dataset. It is particularly useful for comparing the distribution of multiple variables or groups. The plot consists of a box that represents the interquartile range (IQR), which contains the middle 50% of the data, with a line inside the box indicating the median. The whiskers extend from the box to the minimum and maximum values, excluding any outliers, which are represented as individual points. By examining the box plot, we can identify the central tendency, spread, skewness, and potential outliers in the data, making it a powerful tool for exploratory data analysis.

When we examine the box plot for Price/Sales for all the stocks in the index, we quickly notice the presence of outliers. The maximum value is 20, the minimum value is 0.15, and the median is 2.89. The first quartile is 1.62, and the third quartile is 4.71.

fig = px.box(SnP_500, y='Price/Sales', color_discrete_sequence=['firebrick'])
fig.update_layout(width=1000, height=800, template = "plotly_dark")
fig.show()

We can also analyze the Price/Sales ratio for each sector in the index, providing us with a more insightful observation. From this analysis, we can observe that the real estate sector has a significantly higher P/S ratio compared to other sectors, while the telecommunications sector appears to have the lowest. However, it's important to note that this is an inductive observation, as the index only includes the 500 largest companies, which may not present the complete picture.

fig = px.box(SnP_500, y='Price/Sales', x='Sector', color_discrete_sequence=['firebrick'])
fig.update_layout(width=1200, height=800, template = "plotly_dark")
fig.show()

VIOLIN PLOT

Alternatively we can use a violin plot, a violin plot combines a box plot with a kernel density plot. It displays the same summary statistics as a box plot, but also provides a more detailed view of the distribution. The plot is symmetrical and resembles a violin or a mirrored density plot. The width of the violin at each point represents the density of data points, with wider areas indicating higher density.

fig = px.violin(SnP_500, y='Earnings/Share', color_discrete_sequence=['firebrick'])
fig.update_layout(width=1000, height=800, template = "plotly_dark")
fig.show()

Compared to box plots, violin plots offer additional insights into the shape and multimodality of the distribution. They provide a more comprehensive visualization of the data, allowing for a better understanding of its characteristics. However, box plots are more compact and straightforward, making them useful for quick comparisons between multiple groups or variables.

Additionally, with a violin plot, we can plot scatter dots alongside to better visualize the concentration of data.

We can see that most of the companies in the index have earnings per share between 12 and -3, but the data is more concentrated around the range of 3 to 1. Additionally, the majority of companies have a positive earnings per share.

fig = px.violin(SnP_500, y='Earnings/Share', color_discrete_sequence=['firebrick'], points="all")
fig.update_layout(width=1000, height=800, template = "plotly_dark")
fig.show()

Again, we could also observe the earnings per share (EPS) for all the sectors. It appears that almost all of the sectors have a higher number of companies with positive earnings, except for the energy sector. This discrepancy raises further questions and warrants investigation to understand why the energy sector has more companies with negative earnings compared to the rest.

fig = px.violin(SnP_500, y='Earnings/Share', x='Sector', color_discrete_sequence=['firebrick'],points="all")
fig.update_layout(width=1200, height=800, template = "plotly_dark")
fig.show()

Both violin plots and box plots serve as valuable tools in exploratory data analysis, helping to identify central tendencies, dispersion, skewness, and potential outliers in a dataset. The choice between the two depends on the specific requirements and the level of detail desired in visualizing the data distribution.

Histogram

A histogram is a graphical representation that displays the distribution of a dataset. It consists of a series of bars, where each bar represents a range of values and the height of the bar corresponds to the frequency or count of observations falling within that range. Histograms provide a visual depiction of the data's frequency distribution, allowing us to identify patterns, skewness, and central tendencies. They are particularly useful for understanding the shape of the data and detecting any outliers or unusual patterns. By examining the histogram, we can gain insights into the underlying characteristics and distribution of the variable being analyzed

When we plot the weekly distribution of returns for the MSCI World Index, we can observe that most of the data points are situated around -1.5% to 2.5%. Additionally, we can identify outliers scattered along the distribution, indicating extreme or unusual returns compared to the majority of data points.

world = yf.download('IXUS', "2013-01-01", "2022-06-01",interval="1wk")['Adj Close']
world_returns = world.pct_change(1).dropna()

[*********************100%***********************]  1 of 1 completed

fig = px.histogram(world_returns, x=world_returns, nbins=80, color_discrete_sequence=['firebrick'])

# Customize the layout if needed
fig.update_layout(title="Distribution of Returns", xaxis_title="Returns", yaxis_title="Count",width=1500, height=900,template = "plotly_dark")
fig.show()

But the way I always use the distribution of returns is by analyzing the percentage of occurrence for each bin, which provides insights into the potential expectations for the asset going forward. Interpreting the data, we can derive the following conclusions:

In approximately 66% of the time, the MSCI returns anywhere from -1.5% to +2.5% in a given week.
If you were looking to go long on the index with a target return of 3%, the probability of achieving that would be 5.4%.
Conversely, if you were aiming to go short and expecting at least a 2% return, the probability of achieving that return is approximately 9%. However, both probabilities are statistically insignificant.

From this analysis, we can conclude that expecting a 3% return or a 2% return is unrealistic, and it may be necessary to adjust your target returns accordingly. Let's consider targeting a 1% return instead:

If you were to go long with a minimum target of a 1% return, the probability of that happening would be 33%.
On the other hand, if you were to go short with a minimum target of a 1% return, the probability of achieving that desired return would be 23%.

Although the odds appear to be better when adjusting our expectations, it is still statistically insignificant. Comparing it to a skydiving scenario where the parachute only works 33% of the time, taking such a risk would not be advisable.

fig = px.histogram(world_returns, nbins=80, color_discrete_sequence=['firebrick'], text_auto=True, histnorm='percent')


# Customize the layout if needed
fig.update_layout(title="Distribution of Returns", xaxis_title="Returns", yaxis_title="Count",width=1500, height=900,template = "plotly_dark")
fig.show()

Histograms are a great tool for exploring data. However, there is still more to uncover. As you may have noticed, we have begun delving into the realm of probability. In our next blog, we will continue this exploration so that we can gain a deeper understanding of distributions and appreciate concepts such as skewness, fat tails, probability of events, and more. By incorporating these theories, we will further enhance our data analysis capabilities.

References

"Quantitative Investment Analysis", by DeFusco, McLeavey, Pinto, and Runkle
"Introduction to Modern Statistics", by Mine Çetinkaya-Rundel and Johanna Hardin