楼主: 数据洞见
6050 33

[原创博文] 从 Python 3.4 到 Python 3.9 的提升提升情况的总结 [推广有奖]

11
玩于股涨之中 发表于 2022-1-2 15:35:33
seaborn: statistical data visualization

12
玩于股涨之中 发表于 2022-1-2 15:35:52
Seaborn is a Python data visualization library based on matplotlib. It provides a high-level interface for drawing attractive and informative statistical graphics.

For a brief introduction to the ideas behind the library, you can read the introductory notes or the paper. Visit the installation page to see how you can download the package and get started with it. You can browse the example gallery to see some of the things that you can do with seaborn, and then check out the tutorial or API reference to find out how.

To see the code or report a bug, please visit the GitHub repository. General support questions are most at home on stackoverflow or discourse, which have dedicated channels for seaborn.
已有 1 人评分经验 论坛币 收起 理由
yunnandlg + 100 + 100 精彩帖子

总评分: 经验 + 100  论坛币 + 100   查看全部评分

13
数据洞见 发表于 2022-1-16 08:45:07
Building structured multi-plot grids
When exploring multi-dimensional data, a useful approach is to draw multiple instances of the same plot on different subsets of your dataset. This technique is sometimes called either “lattice” or “trellis” plotting, and it is related to the idea of “small multiples”. It allows a viewer to quickly extract a large amount of information about a complex dataset. Matplotlib offers good support for making figures with multiple axes; seaborn builds on top of this to directly link the structure of the plot to the structure of your dataset.

The figure-level functions are built on top of the objects discussed in this chapter of the tutorial. In most cases, you will want to work with those functions. They take care of some important bookkeeping that synchronizes the multiple plots in each grid. This chapter explains how the underlying objects work, which may be useful for advanced applications.

Conditional small multiples
The FacetGrid class is useful when you want to visualize the distribution of a variable or the relationship between multiple variables separately within subsets of your dataset. A FacetGrid can be drawn with up to three dimensions: row, col, and hue. The first two have obvious correspondence with the resulting array of axes; think of the hue variable as a third dimension along a depth axis, where different levels are plotted with different colors.

Each of relplot(), displot(), catplot(), and lmplot() use this object internally, and they return the object when they are finished so that it can be used for further tweaking.

The class is used by initializing a FacetGrid object with a dataframe and the names of the variables that will form the row, column, or hue dimensions of the grid. These variables should be categorical or discrete, and then the data at each level of the variable will be used for a facet along that axis. For example, say we wanted to examine differences between lunch and dinner in the tips dataset:

tips = sns.load_dataset("tips")
g = sns.FacetGrid(tips, col="time")

14
数据洞见 发表于 2022-1-16 08:45:35
Visualizing distributions of data
An early step in any effort to analyze or model data should be to understand how the variables are distributed. Techniques for distribution visualization can provide quick answers to many important questions. What range do the observations cover? What is their central tendency? Are they heavily skewed in one direction? Is there evidence for bimodality? Are there significant outliers? Do the answers to these questions vary across subsets defined by other variables?

15
数据洞见 发表于 2022-1-16 08:46:09
penguins = sns.load_dataset("penguins")
sns.displot(penguins, x="flipper_length_mm")
../_images/distributions_3_0.png
This plot immediately affords a few insights about the flipper_length_mm variable. For instance, we can see that the most common flipper length is about 195 mm, but the distribution appears bimodal, so this one number does not represent the data well.

Choosing the bin size
The size of the bins is an important parameter, and using the wrong bin size can mislead by obscuring important features of the data or by creating apparent features out of random variability. By default, displot()/histplot() choose a default bin size based on the variance of the data and the number of observations. But you should not be over-reliant on such automatic approaches, because they depend on particular assumptions about the structure of your data. It is always advisable to check that your impressions of the distribution are consistent across different bin sizes. To choose the size directly, set the binwidth parameter:

sns.displot(penguins, x="flipper_length_mm", binwidth=3)
../_images/distributions_5_0.png
In other circumstances, it may make more sense to specify the number of bins, rather than their size:

sns.displot(penguins, x="flipper_length_mm", bins=20)
../_images/distributions_7_0.png
One example of a situation where defaults fail is when the variable takes a relatively small number of integer values. In that case, the default bin width may be too small, creating awkward gaps in the distribution:

tips = sns.load_dataset("tips")
sns.displot(tips, x="size")
../_images/distributions_9_0.png
One approach would be to specify the precise bin breaks by passing an array to bins:

sns.displot(tips, x="size", bins=[1, 2, 3, 4, 5, 6, 7])
../_images/distributions_11_0.png
This can also be accomplished by setting discrete=True, which chooses bin breaks that represent the unique values in a dataset with bars that are centered on their corresponding value.

sns.displot(tips, x="size", discrete=True)
../_images/distributions_13_0.png
It’s also possible to visualize the distribution of a categorical variable using the logic of a histogram. Discrete bins are automatically set for categorical variables, but it may also be helpful to “shrink” the bars slightly to emphasize the categorical nature of the axis:

sns.displot(tips, x="day", shrink=.8)
../_images/distributions_15_0.png
Conditioning on other variables
Once you understand the distribution of a variable, the next step is often to ask whether features of that distribution differ across other variables in the dataset. For example, what accounts for the bimodal distribution of flipper lengths that we saw above? displot() and histplot() provide support for conditional subsetting via the hue semantic. Assigning a variable to hue will draw a separate histogram for each of its unique values and distinguish them by color:

sns.displot(penguins, x="flipper_length_mm", hue="species")
../_images/distributions_17_0.png
By default, the different histograms are “layered” on top of each other and, in some cases, they may be difficult to distinguish. One option is to change the visual representation of the histogram from a bar plot to a “step” plot:

sns.displot(penguins, x="flipper_length_mm", hue="species", element="step")
../_images/distributions_19_0.png
Alternatively, instead of layering each bar, they can be “stacked”, or moved vertically. In this plot, the outline of the full histogram will match the plot with only a single variable:

sns.displot(penguins, x="flipper_length_mm", hue="species", multiple="stack")
../_images/distributions_21_0.png
The stacked histogram emphasizes the part-whole relationship between the variables, but it can obscure other features (for example, it is difficult to determine the mode of the Adelie distribution. Another option is “dodge” the bars, which moves them horizontally and reduces their width. This ensures that there are no overlaps and that the bars remain comparable in terms of height. But it only works well when the categorical variable has a small number of levels:

sns.displot(penguins, x="flipper_length_mm", hue="sex", multiple="dodge")
../_images/distributions_23_0.png
Because displot() is a figure-level function and is drawn onto a FacetGrid, it is also possible to draw each individual distribution in a separate subplot by assigning the second variable to col or row rather than (or in addition to) hue. This represents the distribution of each subset well, but it makes it more difficult to draw direct comparisons:

sns.displot(penguins, x="flipper_length_mm", col="sex")

16
数据洞见 发表于 2022-1-16 08:46:40
Normalized histogram statistics
Before we do, another point to note is that, when the subsets have unequal numbers of observations, comparing their distributions in terms of counts may not be ideal. One solution is to normalize the counts using the stat parameter:

sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density")
../_images/distributions_27_0.png
By default, however, the normalization is applied to the entire distribution, so this simply rescales the height of the bars. By setting common_norm=False, each subset will be normalized independently:

sns.displot(penguins, x="flipper_length_mm", hue="species", stat="density", common_norm=False)
../_images/distributions_29_0.png
Density normalization scales the bars so that their areas sum to 1. As a result, the density axis is not directly interpretable. Another option is to normalize the bars to that their heights sum to 1. This makes most sense when the variable is discrete, but it is an option for all histograms:

sns.displot(penguins, x="flipper_length_mm", hue="species", stat="probability")
../_images/distributions_31_0.png
Kernel density estimation
A histogram aims to approximate the underlying probability density function that generated the data by binning and counting observations. Kernel density estimation (KDE) presents a different solution to the same problem. Rather than using discrete bins, a KDE plot smooths the observations with a Gaussian kernel, producing a continuous density estimate:

sns.displot(penguins, x="flipper_length_mm", kind="kde")
../_images/distributions_33_0.png
Choosing the smoothing bandwidth
Much like with the bin size in the histogram, the ability of the KDE to accurately represent the data depends on the choice of smoothing bandwidth. An over-smoothed estimate might erase meaningful features, but an under-smoothed estimate can obscure the true shape within random noise. The easiest way to check the robustness of the estimate is to adjust the default bandwidth:

sns.displot(penguins, x="flipper_length_mm", kind="kde", bw_adjust=.25)
../_images/distributions_35_0.png
Note how the narrow bandwidth makes the bimodality much more apparent, but the curve is much less smooth. In contrast, a larger bandwidth obscures the bimodality almost completely:

sns.displot(penguins, x="flipper_length_mm", kind="kde", bw_adjust=2)
../_images/distributions_37_0.png
Conditioning on other variables
As with histograms, if you assign a hue variable, a separate density estimate will be computed for each level of that variable:

sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde")
../_images/distributions_39_0.png
In many cases, the layered KDE is easier to interpret than the layered histogram, so it is often a good choice for the task of comparison. Many of the same options for resolving multiple distributions apply to the KDE as well, however:

sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", multiple="stack")
../_images/distributions_41_0.png
Note how the stacked plot filled in the area between each curve by default. It is also possible to fill in the curves for single or layered densities, although the default alpha value (opacity) will be different, so that the individual densities are easier to resolve.

sns.displot(penguins, x="flipper_length_mm", hue="species", kind="kde", fill=True)
../_images/distributions_43_0.png
Kernel density estimation pitfalls
KDE plots have many advantages. Important features of the data are easy to discern (central tendency, bimodality, skew), and they afford easy comparisons between subsets. But there are also situations where KDE poorly represents the underlying data. This is because the logic of KDE assumes that the underlying distribution is smooth and unbounded. One way this assumption can fail is when a varible reflects a quantity that is naturally bounded. If there are observations lying close to the bound (for example, small values of a variable that cannot be negative), the KDE curve may extend to unrealistic values:

sns.displot(tips, x="total_bill", kind="kde")
../_images/distributions_45_0.png
This can be partially avoided with the cut parameter, which specifies how far the curve should extend beyond the extreme datapoints. But this influences only where the curve is drawn; the density estimate will still smooth over the range where no data can exist, causing it to be artifically low at the extremes of the distribution:

17
数据洞见 发表于 2022-1-16 08:47:11
sns.displot(tips, x="total_bill", kind="kde", cut=0)
../_images/distributions_47_0.png
The KDE approach also fails for discrete data or when data are naturally continuous but specific values are over-represented. The important thing to keep in mind is that the KDE will always show you a smooth curve, even when the data themselves are not smooth. For example, consider this distribution of diamond weights:

diamonds = sns.load_dataset("diamonds")
sns.displot(diamonds, x="carat", kind="kde")
../_images/distributions_49_0.png
While the KDE suggests that there are peaks around specific values, the histogram reveals a much more jagged distribution:

sns.displot(diamonds, x="carat")
../_images/distributions_51_0.png
As a compromise, it is possible to combine these two approaches. While in histogram mode, displot() (as with histplot()) has the option of including the smoothed KDE curve (note kde=True, not kind="kde"):

sns.displot(diamonds, x="carat", kde=True)
../_images/distributions_53_0.png
Empirical cumulative distributions
A third option for visualizing distributions computes the “empirical cumulative distribution function” (ECDF). This plot draws a monotonically-increasing curve through each datapoint such that the height of the curve reflects the proportion of observations with a smaller value:

sns.displot(penguins, x="flipper_length_mm", kind="ecdf")
../_images/distributions_55_0.png
The ECDF plot has two key advantages. Unlike the histogram or KDE, it directly represents each datapoint. That means there is no bin size or smoothing parameter to consider. Additionally, because the curve is monotonically increasing, it is well-suited for comparing multiple distributions:

sns.displot(penguins, x="flipper_length_mm", hue="species", kind="ecdf")
../_images/distributions_57_0.png
The major downside to the ECDF plot is that it represents the shape of the distribution less intuitively than a histogram or density curve. Consider how the bimodality of flipper lengths is immediately apparent in the histogram, but to see it in the ECDF plot, you must look for varying slopes. Nevertheless, with practice, you can learn to answer all of the important questions about a distribution by examining the ECDF, and doing so can be a powerful approach.

Visualizing bivariate distributions
All of the examples so far have considered univariate distributions: distributions of a single variable, perhaps conditional on a second variable assigned to hue. Assigning a second variable to y, however, will plot a bivariate distribution:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm")
../_images/distributions_60_0.png
A bivariate histogram bins the data within rectangles that tile the plot and then shows the count of observations within each rectangle with the fill color (analagous to a heatmap()). Similarly, a bivariate KDE plot smoothes the (x, y) observations with a 2D Gaussian. The default representation then shows the contours of the 2D density:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde")
../_images/distributions_62_0.png
Assigning a hue variable will plot multiple heatmaps or contour sets using different colors. For bivariate histograms, this will only work well if there is minimal overlap between the conditional distributions:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species")
../_images/distributions_64_0.png
The contour approach of the bivariate KDE plot lends itself better to evaluating overlap, although a plot with too many contours can get busy:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", hue="species", kind="kde")
../_images/distributions_66_0.png
Just as with univariate plots, the choice of bin size or smoothing bandwidth will determine how well the plot represents the underlying bivariate distribution. The same parameters apply, but they can be tuned for each variable by passing a pair of values:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5))
../_images/distributions_68_0.png
To aid interpretation of the heatmap, add a colorbar to show the mapping between counts and color intensity:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", binwidth=(2, .5), cbar=True)
../_images/distributions_70_0.png
The meaning of the bivariate density contours is less straightforward. Because the density is not directly interpretable, the contours are drawn at iso-proportions of the density, meaning that each curve shows a level set such that some proportion p of the density lies below it. The p values are evenly spaced, with the lowest level contolled by the thresh parameter and the number controlled by levels:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", thresh=.2, levels=4)
../_images/distributions_72_0.png
The levels parameter also accepts a list of values, for more control:

sns.displot(penguins, x="bill_length_mm", y="bill_depth_mm", kind="kde", levels=[.01, .05, .1, .8])

18
数据洞见 发表于 2022-1-16 08:47:59
Plotting joint and marginal distributions
The first is jointplot(), which augments a bivariate relatonal or distribution plot with the marginal distributions of the two variables. By default, jointplot() represents the bivariate distribution using scatterplot() and the marginal distributions using histplot():

sns.jointplot(data=penguins, x="bill_length_mm", y="bill_depth_mm")
../_images/distributions_80_0.png
Similar to displot(), setting a different kind="kde" in jointplot() will change both the joint and marginal plots the use kdeplot():

sns.jointplot(
    data=penguins,
    x="bill_length_mm", y="bill_depth_mm", hue="species",
    kind="kde"
)
../_images/distributions_82_0.png
jointplot() is a convenient interface to the JointGrid class, which offeres more flexibility when used directly:

g = sns.JointGrid(data=penguins, x="bill_length_mm", y="bill_depth_mm")
g.plot_joint(sns.histplot)
g.plot_marginals(sns.boxplot)
../_images/distributions_84_0.png
A less-obtrusive way to show marginal distributions uses a “rug” plot, which adds a small tick on the edge of the plot to represent each individual observation. This is built into displot():

sns.displot(
    penguins, x="bill_length_mm", y="bill_depth_mm",
    kind="kde", rug=True
)
../_images/distributions_86_0.png
And the axes-level rugplot() function can be used to add rugs on the side of any other kind of plot:

sns.relplot(data=penguins, x="bill_length_mm", y="bill_depth_mm")
sns.rugplot(data=penguins, x="bill_length_mm", y="bill_depth_mm")
../_images/distributions_88_0.png
Plotting many distributions
The pairplot() function offers a similar blend of joint and marginal distributions. Rather than focusing on a single relationship, however, pairplot() uses a “small-multiple” approach to visualize the univariate distribution of all variables in a dataset along with all of their pairwise relationships:

sns.pairplot(penguins)
../_images/distributions_90_0.png
As with jointplot()/JointGrid, using the underlying PairGrid directly will afford more flexibility with only a bit more typing:

g = sns.PairGrid(penguins)
g.map_upper(sns.histplot)
g.map_lower(sns.kdeplot, fill=True)
g.map_diag(sns.histplot, kde=True)

19
数据洞见 发表于 2022-1-16 08:48:19
Visualizing regression models
Many datasets contain multiple quantitative variables, and the goal of an analysis is often to relate those variables to each other. We previously discussed functions that can accomplish this by showing the joint distribution of two variables. It can be very helpful, though, to use statistical models to estimate a simple relationship between two noisy sets of observations. The functions discussed in this chapter will do so through the common framework of linear regression.

In the spirit of Tukey, the regression plots in seaborn are primarily intended to add a visual guide that helps to emphasize patterns in a dataset during exploratory data analyses. That is to say that seaborn is not itself a package for statistical analysis. To obtain quantitative measures related to the fit of regression models, you should use statsmodels. The goal of seaborn, however, is to make exploring a dataset through visualization quick and easy, as doing so is just as (if not more) important than exploring a dataset through tables of statistics.

20
数据洞见 发表于 2022-1-16 08:48:52
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_theme(color_codes=True)
tips = sns.load_dataset("tips")
Functions to draw linear regression models
Two main functions in seaborn are used to visualize a linear relationship as determined through regression. These functions, regplot() and lmplot() are closely related, and share much of their core functionality. It is important to understand the ways they differ, however, so that you can quickly choose the correct tool for particular job.

In the simplest invocation, both functions draw a scatterplot of two variables, x and y, and then fit the regression model y ~ x and plot the resulting regression line and a 95% confidence interval for that regression:

sns.regplot(x="total_bill", y="tip", data=tips);

您需要登录后才可以回帖 登录 | 我要注册

本版微信群
加好友,备注cda
拉您进交流群
GMT+8, 2026-1-22 05:03