Captivating Methods for Visualizing Numerical Data in Python
Written on
Chapter 1: The Importance of Data Visualization
Effectively displaying data in a digestible format is crucial for accurate interpretation. Data visualization techniques help structure information so that it can be easily understood visually. This approach aligns with our innate preference for visual data, as we tend to recognize patterns, trends, and anomalies more effortlessly when they are presented graphically.
Identifying the best method to visualize your data can be daunting. There are numerous ways to depict information, yet some methods are more informative than others, depending on the context. Having a specific question to guide your visualization choices is a productive first step. From there, you can choose the graph that best highlights the information you wish to explore.
This article will concentrate on numerical data, examining three common types of graphs suitable for this purpose. We will discuss their applications, how to interpret them, and how to implement them in Python.
Section 1.1: Preparing the Dataset
To illustrate the graphs discussed here, we will create a dataset using the make_regression function from scikit-learn. Our dataset will consist of 100 samples, each with four features, all of which are informative. Additionally, we will apply Gaussian noise with a standard deviation of 10 and set the random state to 25—these parameters are somewhat arbitrary.
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_regression
# Create a regression dataset
X, y = make_regression(
n_samples=100,
n_features=4,
n_informative=4,
noise=10,
random_state=25
)
# Display the first 5 observations
print(X[:5])
Section 1.2: Scatter Plots
Scatter plots are excellent for visualizing the relationships between two numerical variables. They help us understand how one variable may influence another. Each point in the scatter plot represents an individual observation.
# Generate line of best fit
m, b = np.polyfit(X[:, 2], y, 1)
best_fit = m * X + b
# Plotting
plt.subplots(figsize=(8, 5))
plt.scatter(X[:, 2], y, color="blue")
plt.plot(X, best_fit, color="red")
plt.title("Impact of Feature at Index 2 on Target Label")
plt.xlabel("Feature at Index 2")
plt.ylabel("Target Label")
plt.show()
In this case, we are particularly interested in whether there is a relationship between the two variables. Just as in our personal relationships, not all connections are identical. If a relationship exists, we can further analyze its characteristics for deeper insights.
Key aspects to consider include:
- The strength of the relationship
- The nature of the relationship (positive or negative)
- Whether it follows a linear or non-linear pattern
- Presence of any outliers
The visual representation above indicates a positive, fairly strong, linear relationship, with no apparent outliers.
Section 1.3: Histograms
Histograms are a staple in statistical data representation. They effectively display the general distribution of a numerical feature within a dataset by grouping data into intervals. Each observation is assigned to an appropriate interval, reflected in the height of the corresponding bar.
plt.subplots(figsize=(8, 5))
plt.hist(X[:, 0], bins=10) # Default of 10 bins
plt.xlabel("Feature at Index 0")
plt.title("Distribution of Feature at Index 0")
plt.ylabel("Number of Occurrences")
plt.show()
When examining histograms, pay attention to skewness and modality:
- Skewness indicates the asymmetry of the probability distribution.
- Modality refers to the number of peaks within the dataset.
Our visualizations suggest a symmetrical and unimodal distribution. It's essential to note that the bin width for our histogram is arbitrary; adjusting the number of bins can change the narrative the histogram conveys.
Section 1.4: Box Plots
Box plots offer another effective way to understand data spread. They highlight outliers, the interquartile range, and the median, all while utilizing minimal space, making it easier to compare distributions across different groups.
plt.subplots(figsize=(8, 5))
plt.boxplot(X, vert=False)
plt.ylabel("Features")
plt.show()
Our box plot allows us to assess the skewness of each feature and identify outliers, as well as the minimum and maximum values. In our case, all features appear symmetrical, although outliers are present in the first and fourth features, warranting further investigation in real-world scenarios.
Data visualization is crucial for effectively conveying insights. It is vital that our visualizations are easily comprehensible to the intended audience. A key strategy for achieving this is to ensure your graphs answer questions that matter to your audience. While this article covers only a few types of graphs, it provides a solid foundation for creating high-quality visual insights.
Thanks for reading!
If you appreciate content like this and wish to support my writing, consider subscribing. With a small monthly fee, you can unlock unlimited access to articles on Medium. Your support helps sustain my writing efforts.
Chapter 2: Video Insights on Data Visualization
In the following sections, we'll explore two informative videos that delve deeper into data visualization techniques using Python.
The first video, "Creating Beautiful Geospatial Data Visualizations with Python" by Adam Symington at SciPy 2022, provides a comprehensive look at crafting aesthetically pleasing geospatial visualizations.
The second video, "Data Visualization with Python I: Plotting Fundamentals," focuses on the foundational aspects of plotting in Python, essential for any data scientist.