数据分析的强力工具

​ 今天的课程我们将正式地接触数据分析的强力工具————pandas、numpy 以及 matplotlib。在以后的数据分析、算法建模的过程中,我们会大量使用到 pandas 进行数据的导入、导出以及增删改查工作。我们会使用 numpy 强大高效的矩阵运算能力进行建模工作。此外,我们将使用 matplotlib 将我们的一些工作可视化,绘制一些图表以丰富我们的汇报内容。通过几个阶段的学习,我们将熟练地掌握这些基本工具,进行复杂的数据分析工作。

1 numpy 的基本介绍

NumPy 是 Numerical Python 的缩写,是 Python 中用于数值计算的最重要的基础软件包之一。

#导入numpy
# import指令,打开门让pandas进来
# as 给numpy取个昵称,不然每次调用的时候都要输入 numpy.函数,现在就是 np.函数
import numpy as np

1.1 NumPy ndarray

NumPy ndarray:多维数组对象

numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

object:数组或嵌套的数列

dtype:数组元素的数据类型,可选

copy:对象是否需要复制,可选

order:创建数组的样式,C 为行方向,F 为列方向,A 为任意方向(默认)

subok:默认返回一个与基类类型一致的数组

ndmin:指定生成数组的最小维度

#定义一个一维数组
a = np.array([1,2,3])
print('a的数据类型是{}'.format(type(a)))
a的数据类型是<class 'numpy.ndarray'>
#定义一个二维数组
b = np.array([[1,1],[2,2]])
b
array([[1, 1],
[2, 2]])
c = np.array([[1,2,3],[2,3,4],[3,4,5]])
c
array([[1, 2, 3],
[2, 3, 4],
[3, 4, 5]])
#查看数组的维度
c.ndim
2
#查看数组形状
c.shape
(3, 3)
#查看数组中的数据类型
c.dtype
dtype('int32')
#创建一个零数组
np.zeros(10)
array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
np.zeros((3,4,3))#或者 np.empty((3,4,3))
array([[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]])
np.empty((3,4,3))
array([[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]],
[[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]]])
#创建全1矩阵
np.ones(10)
array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
np.ones((3,4,3))
array([[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]]])

1.2 numpy 的数据类型

bool:布尔型数据类型(True 或者 False)

int:默认的整数类型(类似于 C 语言中的 long,int32 或 int64)

intc:与 C 的 int 类型一样,一般是 int32 或 int 64

intp:用于索引的整数类型(一般情况下仍然是 int32 或 int64)

int8:字节(-128 to 127)

int16:整数(-32768 to 32767)

int32:整数(-2147483648 to 2147483647)

int64:整数(-9223372036854775808 to 9223372036854775807)

uint8:无符号整数(0 to 255)

uint16:无符号整数(0 to 65535)

uint32:无符号整数(0 to 4294967295)

uint64:无符号整数(0 to 18446744073709551615)

float16:半精度浮点数,包括:1 个符号位,5 个指数位,10 个尾数位

float32:单精度浮点数,包括:1 个符号位,8 个指数位,23 个尾数位

float64:双精度浮点数,包括:1 个符号位,11 个指数位,52 个尾数位

complex_:complex128 类型的简写,即 128 位复数

complex64:复数,表示双 32 位浮点数(实数部分和虚数部分)

complex128:复数,表示双 64 位浮点数(实数部分和虚数部分)

numpy 的数值类型实际上是 dtype 对象的实例。

1.3 数组的简单运算

#用随机数创造一个二维数组——np.random.randn(行数,列数)
data = np.random.randn(2,3)
data
array([[-0.14248726, 1.25007783, -0.46832963],
[-0.06727693, 0.4056693 , 0.92559399]])
#加法
data + data
array([[-0.28497451, 2.50015567, -0.93665925],
[-0.13455386, 0.81133859, 1.85118798]])
#减法
data - data
array([[0., 0., 0.],
[0., 0., 0.]])
#乘法
data * 10
array([[-1.42487256, 12.50077835, -4.68329626],
[-0.67276931, 4.05669295, 9.25593989]])
#除法
1/data
array([[ -7.01817152, 0.79995019, -2.13524822],
[-14.86393597, 2.46506209, 1.08038731]])
#幂
data ** 0.5
<ipython-input-16-47d2e170e061>:2: RuntimeWarning: invalid value encountered in sqrt
data ** 0.5
array([[ nan, 1.1180688 , nan],
[ nan, 0.63692173, 0.96207795]])

2 pandas 的基本介绍

pandas 是一个强大的分析结构化数据的工具集。它的使用基础是 Numpy(提供高性能的矩阵运算);用于数据挖掘和数据分析,同时也提供数据清洗功能.image1.png

image2.png

pandas 定义了两个基本的数据结构:

2.1 Series

Series 是能够保存任何类型数据(整数,字符串,浮点数,Python 对象等)的一维标记数组,轴标签统称为 index(索引)。与 python 中的 List 比较相似,但与之不同的是,Series 的每个元素还有一个唯一的标签(Label)。

image3.png

2.1.1 Seires 的创建

(1)我们可以通过 pandas 的 Series 方法直接创建 Series。

import pandas as pd
my_series=pd.Series([1,2,3,4],index=['a','b','c','d'])
print('my_series的数据类型是{}'.format(type(my_series)))
print(my_series)
my_series的数据类型是<class 'pandas.core.series.Series'>
a 1
b 2
c 3
d 4
dtype: int64
my_series=pd.Series(['a','b','c','d'])
print(my_series)
0 a
1 b
2 c
3 d
dtype: object

(2) 我们可以通过强制类型转换实现 series。

my_list=['a','b','c','d']
my_series_2=pd.Series(my_list)
my_series_2
0 a
1 b
2 c
3 d
dtype: object

2.1.2 Seires 的索引和切片

Series 可以理解为一种类似于字典的数据结构。index 在这里就类似于字典的“键”。当然也可以想数组一样进行索引。

my_series=pd.Series(['a','b','c','d','e'],index=[0,1,2,3,'a'])
my_series
0 a
1 b
2 c
3 d
a e
dtype: object
my_series['a']=6 #我们可以索引后修改series的值
print(my_series)
0 a
1 b
2 c
3 d
a 6
dtype: object
print(my_series['a']) #类似于字典的操作方式
6
print(my_series[1]) #类似于数组的索引方式
print(my_series[1:3]) #切片
b
1 b
2 c
dtype: object

2.1.3 Seires 的连接

既然 Series 具有和数组相似的特性,我们也可以使用 append 操作进行 Series 的组合连接操作————在 Series_1 后面追加一个 Series_2。

s1=pd.Series([1,2,3])
s2=pd.Series([4,5,6])
s1.append(s2)
0 1
1 2
2 3
0 4
1 5
2 6
dtype: int64

NOTE:学习的方法有很多。有的人喜欢系统性地学,遇到一个模块就该模块的所有东西都统一拉一遍。这种方式的好处是形成系统性,缺点也十分明显:在一个模块中其实是有很多东西都是用不到或者不常用的。这些不常用的方法很容易随着时间被遗忘掉。 因此,非常建议大家学习东西的前期的时候是需要什么学什么,建立起一种快速学习的感觉,把新学到的东西马上转变成生产力。然后在空余的时候,自己进行组织和归纳,形成体系。

2.2 DataFrame

Data Frame 是二维的带标签的数据结构,支持不同的列的元素是不同的数据类型,就像是一张表格一样,是 pandas 中最为常见的一种数据格式。

image4.png

2.2.1 DataFrame 的创建

DataFrame 主要可以通过以下几种方式创建:

  1. 使用 pandas 的 DataFrame 方法创建。
  2. 把字典转化成 DataFrame。
  3. 通过 pandas 导入数据 (这也是我们最常使用的创建 DataFrame 的方法)

(1)使用 pandas 的 DataFrame 方法创建。

my_df=pd.DataFrame([[1,2,3,4],[5,6,7,8]],columns=['a','b','c','d'],index=['first','second'])
my_df
abcd
first1234
second5678

(2)把字典转化成 DataFrame

my_dict={'a':[1,2],'b':[3,4],'c':[5,6]}
my_df_2=pd.DataFrame(my_dict)
my_df_2
abc
0135
1246

(3)读取文件,把文件导入并转化成 DataFrame

data=pd.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small',header=None)
data.head()#head默认为5,查看前5行
01234567891011121314
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
len(data.columns)#查看列索引数量
15
index=['age', 'workclass', 'fnlwgt', 'education',
'education-num', 'marital-status', 'occupation', 'relationship', 'race',
'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
'native-country', 'salary']
len(index)
15
data.columns=index
data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K

3 pandas 使用

3.1 了解数据

我们将凭借一个实际的例子,学习在特定场景下,如何使用这些工具,进行数据的增删改查操作。

首先,我们需要导入我们的数据。有一种文件叫 CSV(Comma-Separated Values)文件。CSV 文件以纯文本的形式存储表格数据(数字和文本)。

import pandas as pd
# 问题1:路径不对,大家只写了路径,没写文件名字和后缀
# 问题2:文件名字里,不要有空格。如果非要区分,用下划线连接。比如:重庆_渝北。
# 问题3:路径里,不要有中文
# 问题4: 有的人的电脑,在路径里需要有两个斜线进行分割。
data=pd.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small',header=None)
# 就是导入文件的操作
# 如果大家是要读取excel文件,可以使用read_excel方法,括号里同样是填写文件路径和文件名字
# data=pd.read_excel(路径和文件名)
# data=pd.read_csv(r"C:\Users\sh\Desktop\mydata.csv")
data.head()
01234567891011121314
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K

在导入数据之后,我们应该做的第一个操作就是从整体上查看我们的数据。我们应该注意的地方有几点:

(1)数据的数量——样本量 (2)数据的特征——特征的类别和特征的数量

在进行数据分析的时候,样本的数量,尤其是有效样本的数量,会影响我们对于分析工具的选择。这很好理解,在日常生活之中,当我们的处理的是水果的时候,我们会选择使用水果刀。而我们需要切肉的时候,会选择使用菜刀。我们先看看我们的数据表长什么样子。

data.head(2)
# 这就是我们的数据表,使用 head() 操作可以让我们直观地看到数据的前5行。
# 当然,我们也可以查看数据的前n行,试试data.head(n)看看有什么效果?
01234567891011121314
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K

我们可以使用 data.columns 来查看我们的特征(其实就是每一列的列名,在很多公司里面,把列名叫做字段)

data.columns
Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')

note:这里我们发现表格的 columns 为数字,不能表现出我们数据代表的含义。我们首先需要修改我们的表头。首先查看我们原有的 columns。记住,python 是一个 面向对象的语言。我们的 data 相当于是一个 DataFrame 类型的对象,columns 是该对象的一个属性。我们可以使用 data.columns 获取这个属性。

从结果可以看出,我们的 columns 是一个类似于列表的东西。字段名全是由数字表示的,很不方便区分。 我们自己建立一个由特征名字组成的列表(这些名字都是字符串),然后进行一个简单的替换。

index=['age', 'workclass', 'fnlwgt', 'education',
'education-num', 'marital-status', 'occupation', 'relationship', 'race',
'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
'native-country', 'salary']
data.columns=index
data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States<=50K
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States<=50K
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States<=50K
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K

看上去好多了!我们可以对我们的样本数量和特征进行描述了:

data.columns
Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
'marital-status', 'occupation', 'relationship', 'race', 'sex',
'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
'salary'],
dtype='object')
print('我们的这组数据有{}个样本,每个样本有{}个特征\n'.format(len(data),len(data.columns)))
print('这些特征是: {}'.format(' '.join(data.columns)))
我们的这组数据有9000个样本,每个样本有15个特征
这些特征是: age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
data.describe()
agefnlwgteducation-numcapital-gaincapital-losshours-per-week
count9000.0000009.000000e+039000.0000009000.0000009000.0000009000.000000
mean38.4461111.902018e+0510.0738891068.18300089.06055640.554000
std13.5806251.061110e+052.5455487327.946403403.77877212.320487
min17.0000001.930200e+041.0000000.0000000.0000001.000000
25%28.0000001.178770e+059.0000000.0000000.00000040.000000
50%37.0000001.786520e+0510.0000000.0000000.00000040.000000
75%47.0000002.383530e+0512.0000000.0000000.00000045.000000
max90.0000001.226583e+0616.00000099999.0000004356.00000099.000000

如果我们想要知道数值类型的特征更多的信息,我们可以尝试 describe()方法:

通过 describe(描述)命令,我们可以知道 data 的“数值”类型特征的信息。包括:

count:数量

mean: 平均数

std: 标准差

min: 最小值

3.2 数据操作

对数据的基本操作不外乎就是增删改查,我们将设置一些情景,方便我们去理解每一项操作。

3.2.1 查询数据

(1) 按字段(列)查询

DataFrame 也可以使用类似于字典的方式进行特定字段类容的查询。但是其和 Series 存在一些不同。Series 的 index 类似于字典的键,而 DataFrame 的 columns (或者称之为字段)类似于字典的键。当然我们也可以用面向对象的思路去操作(记住:python 是一个充分运用了面向对象思想的语言,DataFrame 本质上就是一个对象,每一列的数据都可以理解其属性。)

如果我们要提取出“年龄”这个字段的数据,我们可以采用以下的操作:

  1. 类似于字典的查询方法: data['age']
  2. 使用对象的属性: data.age
data[['age']]
age
039
150
238
353
428
......
899520
899623
899742
899853
899920

9000 rows × 1 columns

print("data['age']的数据类型是{}\n".format(type(data['age'])))
data['age'].head()
data['age']的数据类型是<class 'pandas.core.series.Series'>

0 39
1 50
2 38
3 53
4 28
Name: age, dtype: int64
data.age.head()
0 39
1 50
2 38
3 53
4 28
Name: age, dtype: int64

上面都是对单个字段的查询操作,如果我们要查询多个字段,需要使用如下的命令:

print("data['age']的数据类型是{}\n".format(type(data[['age','workclass']])))
data[['age','workclass']].head()
data['age']的数据类型是<class 'pandas.core.frame.DataFrame'>

ageworkclass
039State-gov
150Self-emp-not-inc
238Private
353Private
428Private

以上的都是按列(字段)索引,我们还可以是用其他方法实施按行索引————iloc 方法。 iloc 方法可以理解为 index locator。 把 DataFrame 当作列表来进行索引查询。

(2) 按行查询

data.head(1)
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States<=50K
data.iloc[0]
age 39
workclass State-gov
fnlwgt 77516
education Bachelors
education-num 13
marital-status Never-married
occupation Adm-clerical
relationship Not-in-family
race White
sex Male
capital-gain 2174
capital-loss 0
hours-per-week 40
native-country United-States
salary <=50K
Name: 0, dtype: object

(3) 条件查询

尽管我们学会了一些简单的查询方法,但是以上的方法未免显得有些不够灵活。如果我们想要按照某些条件进行查询呢?

使用 loc 方法我们可以实现条件查询。这里给出三个场景,帮助我们去理解条件查询的运用。

问题 1:老板说,帮我找出年龄小于 37 岁的人的全部信息,如何去做?

data.loc[data['age']<37].head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
831Private45781Masters14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States>50K
1130State-gov141297Bachelors13Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040India>50K
1223Private122272Bachelors13Never-marriedAdm-clericalOwn-childWhiteFemale0030United-States<=50K
1332Private205019Assoc-acdm12Never-marriedSalesNot-in-familyBlackMale0050United-States<=50K

问题 2:老板说,帮我找出年龄小于 37 岁,且 education-num 为 13 的人的全部信息,如何去做?

data.loc[(data.age<37) & (data['education-num']==13)]
# 在这里,‘&’符号表示的是"与"。表示两个条件同时满足才为真。
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba<=50K
1130State-gov141297Bachelors13Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040India>50K
1223Private122272Bachelors13Never-marriedAdm-clericalOwn-childWhiteFemale0030United-States<=50K
4224Private172987Bachelors13Married-civ-spouseTech-supportHusbandWhiteMale0050United-States<=50K
6030Private59496Bachelors13Married-civ-spouseSalesHusbandWhiteMale2407040United-States<=50K
................................................
892236Private66173Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteFemale0050United-States<=50K
892321Private182823Bachelors13Never-marriedSalesOwn-childWhiteMale0030United-States<=50K
892828Private96020Bachelors13Married-civ-spouseExec-managerialWifeWhiteFemale0050United-States>50K
895027Private216481Bachelors13Never-marriedProf-specialtyNot-in-familyWhiteFemale0040United-States<=50K
896422Private199266Bachelors13Never-marriedExec-managerialNot-in-familyWhiteFemale0030United-States<=50K

713 rows × 15 columns

问题 3:老板说,帮我找出年龄小于 37 岁的,education-num 为 13 的人的 workclass,如何去完成?

(data.loc[(data.age<37) & (data['education-num']==13)])['workclass']
4 Private
11 State-gov
12 Private
42 Private
60 Private
...
8922 Private
8923 Private
8928 Private
8950 Private
8964 Private
Name: workclass, Length: 713, dtype: object
data.loc[(data.age<37) & (data['education-num']==13),['workclass','age']]
workclassage
4Private28
11State-gov30
12Private23
42Private24
60Private30
.........
8922Private36
8923Private21
8928Private28
8950Private27
8964Private22

713 rows × 2 columns

3.2.2 修改数据

结合条件查询,我们可以进行定向的修改操作。

问题 4:老板说,把 salary<=50K 的设定为 0,把 salary>50k 的设定为 1,如何去完成?

data.loc[data.salary=='<=50K','salary']=1
data.loc[data.salary=='>50K','salary']=0
data.head(20)
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
039State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States1
150Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States1
238Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States1
353Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1
428Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1
537Private284582Masters14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States1
649Private1601879th5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica1
752Self-emp-not-inc209642HS-grad9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States0
831Private45781Masters14Never-marriedProf-specialtyNot-in-familyWhiteFemale14084050United-States0
942Private159449Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale5178040United-States0
1037Private280464Some-college10Married-civ-spouseExec-managerialHusbandBlackMale0080United-States0
1130State-gov141297Bachelors13Married-civ-spouseProf-specialtyHusbandAsian-Pac-IslanderMale0040India0
1223Private122272Bachelors13Never-marriedAdm-clericalOwn-childWhiteFemale0030United-States1
1332Private205019Assoc-acdm12Never-marriedSalesNot-in-familyBlackMale0050United-States1
1440Private121772Assoc-voc11Married-civ-spouseCraft-repairHusbandAsian-Pac-IslanderMale0040?0
1534Private2454877th-8th4Married-civ-spouseTransport-movingHusbandAmer-Indian-EskimoMale0045Mexico1
1625Self-emp-not-inc176756HS-grad9Never-marriedFarming-fishingOwn-childWhiteMale0035United-States1
1732Private186824HS-grad9Never-marriedMachine-op-inspctUnmarriedWhiteMale0040United-States1
1838Private2888711th7Married-civ-spouseSalesHusbandWhiteMale0050United-States1
1943Self-emp-not-inc292175Masters14DivorcedExec-managerialUnmarriedWhiteFemale0045United-States0

当然也可以直接给某一列赋值

data.age=20
data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalary
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States1
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States1
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States1
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1

3.2.3 增加数据

问题 5:老板说,他要派专人对每一个样本的数据进行核对,所以给数据增加一列,列名为 check,并把初始状态统一设置为 false。

【这是为了方便核对的人使用,当专人核对后,会人工把状态修改为 true】

data['check']=False
data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States1False
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States1False
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States1False
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1False
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1False

问题 6:假如核对的人也是一个 python 高手,他一目十行,需要把前 10 行的 check 值修改为 true,他如何操作?

data.loc[0:10,'check']='true'
data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States1true
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States1true
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States1true
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1true

3.2.4 删除数据

问题 7:老板说,他觉得 capital-loss 这一列没有意义,想要丢掉。

data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States1true
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States1true
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States1true
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1true
data.drop('capital-loss',axis=1)
# axis=1表示纵向,axis=0表示横向操作。
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gainhours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale217440United-States1true
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale013United-States1true
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale040United-States1true
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale040Cuba1true
................................................
899520Private3908175th-6th3Never-marriedHandlers-cleanersNot-in-familyWhiteMale025Mexico1False
899620?145964Some-college10Never-married?Not-in-familyWhiteMale040United-States1False
899720Private3042411th7SeparatedOther-serviceUnmarriedWhiteFemale038United-States1False
899820Private548361HS-grad9Married-civ-spouseCraft-repairHusbandWhiteMale040United-States1False
899920Private189148HS-grad9Married-civ-spouseHandlers-cleanersHusbandWhiteMale048United-States1False

9000 rows × 15 columns

data.drop([0,1,2],axis=0)
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarycheck
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1true
520Private284582Masters14Married-civ-spouseExec-managerialWifeWhiteFemale0040United-States1true
620Private1601879th5Married-spouse-absentOther-serviceNot-in-familyBlackFemale0016Jamaica1true
720Self-emp-not-inc209642HS-grad9Married-civ-spouseExec-managerialHusbandWhiteMale0045United-States0true
...................................................
899520Private3908175th-6th3Never-marriedHandlers-cleanersNot-in-familyWhiteMale0025Mexico1False
899620?145964Some-college10Never-married?Not-in-familyWhiteMale0040United-States1False
899720Private3042411th7SeparatedOther-serviceUnmarriedWhiteFemale0038United-States1False
899820Private548361HS-grad9Married-civ-spouseCraft-repairHusbandWhiteMale0040United-States1False
899920Private189148HS-grad9Married-civ-spouseHandlers-cleanersHusbandWhiteMale0048United-States1False

8997 rows × 16 columns

data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gaincapital-losshours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale2174040United-States1true
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale0013United-States1true
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale0040United-States1true
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale0040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale0040Cuba1true

记住,使用 drop 方法进行丢弃操作之后,需要重新对 data 进行赋值,更新我们的 data。

data=data.drop('capital-loss',axis=1)#从新赋值
#将处理后的文件导出
data.to_csv(r"C:\Users\sh\Desktop\2021_01_05.csv")
data.head()
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gainhours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale217440United-States1true
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale013United-States1true
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale040United-States1true
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale040Cuba1true
data # 展示的是原汁原味
ageworkclassfnlwgteducationeducation-nummarital-statusoccupationrelationshipracesexcapital-gainhours-per-weeknative-countrysalarycheck
020State-gov77516Bachelors13Never-marriedAdm-clericalNot-in-familyWhiteMale217440United-States1true
120Self-emp-not-inc83311Bachelors13Married-civ-spouseExec-managerialHusbandWhiteMale013United-States1true
220Private215646HS-grad9DivorcedHandlers-cleanersNot-in-familyWhiteMale040United-States1true
320Private23472111th7Married-civ-spouseHandlers-cleanersHusbandBlackMale040United-States1true
420Private338409Bachelors13Married-civ-spouseProf-specialtyWifeBlackFemale040Cuba1true
................................................
899520Private3908175th-6th3Never-marriedHandlers-cleanersNot-in-familyWhiteMale025Mexico1False
899620?145964Some-college10Never-married?Not-in-familyWhiteMale040United-States1False
899720Private3042411th7SeparatedOther-serviceUnmarriedWhiteFemale038United-States1False
899820Private548361HS-grad9Married-civ-spouseCraft-repairHusbandWhiteMale040United-States1False
899920Private189148HS-grad9Married-civ-spouseHandlers-cleanersHusbandWhiteMale048United-States1False

9000 rows × 15 columns

print(data)
age workclass fnlwgt education education-num \
0 20 State-gov 77516 Bachelors 13
1 20 Self-emp-not-inc 83311 Bachelors 13
2 20 Private 215646 HS-grad 9
3 20 Private 234721 11th 7
4 20 Private 338409 Bachelors 13
... ... ... ... ... ...
8995 20 Private 390817 5th-6th 3
8996 20 ? 145964 Some-college 10
8997 20 Private 30424 11th 7
8998 20 Private 548361 HS-grad 9
8999 20 Private 189148 HS-grad 9
marital-status occupation relationship race \
0 Never-married Adm-clerical Not-in-family White
1 Married-civ-spouse Exec-managerial Husband White
2 Divorced Handlers-cleaners Not-in-family White
3 Married-civ-spouse Handlers-cleaners Husband Black
4 Married-civ-spouse Prof-specialty Wife Black
... ... ... ... ...
8995 Never-married Handlers-cleaners Not-in-family White
8996 Never-married ? Not-in-family White
8997 Separated Other-service Unmarried White
8998 Married-civ-spouse Craft-repair Husband White
8999 Married-civ-spouse Handlers-cleaners Husband White
sex capital-gain hours-per-week native-country salary check
0 Male 2174 40 United-States 1 true
1 Male 0 13 United-States 1 true
2 Male 0 40 United-States 1 true
3 Male 0 40 United-States 1 true
4 Female 0 40 Cuba 1 true
... ... ... ... ... ... ...
8995 Male 0 25 Mexico 1 False
8996 Male 0 40 United-States 1 False
8997 Female 0 38 United-States 1 False
8998 Male 0 40 United-States 1 False
8999 Male 0 48 United-States 1 False
[9000 rows x 15 columns]

4 matplotlib 使用

导入模块: import matplotlib.pyplot as plt/import numpy as np

定义图像窗口:plt.figure()

画图:plt.plot(x,y)

定义坐标轴范围:plt.xlim()/plt.ylim()

定义坐标轴名称:plt.xlabel()/plt.ylabel()

定义坐标轴刻度及名称:plt.xticks()/plt.yticks()

设置图像边框颜色:ax = plt.gca() / ax.spines[].set_color()

调整刻度位置:ax.xaxis.set_ticks_position()/ax.yaxis.set_ticks_position()

调整边框(坐标轴)位置:ax.spines[].set_position()

添加图例:plt.legend()

画点:plt.scatter()

添加标注:plt.annotate()

添加注释:plt.text()

添加标题:plt.title()

设置格式:

设置中文字体:plt.rcParams['font.family']='Microsoft YaHei'

设置正常显示字符:plt.rcParams['axes.unicode_minus'] = False

一个图像层多个绘图区:plt.subplots()

import matplotlib
import matplotlib.pyplot as plt
import numpy as np
x1=[i for i in range(100)]
y1=np.sin(x1)
x2=[i for i in range(100)]
y2=[3* i for i in x2]
plt.plot(x1,y1)
plt.xlabel('avsf')
Text(0.5, 0, 'avsf')

output_137_1.png

plt.figure(1)
ax1=plt.subplot(211)#创建两行一列绘图区,选择第一个绘图区
ax1.set_title('a')
plt.xlabel("a")
plt.plot(x1,y1)
ax2=plt.subplot(212)#创建两行一列绘图区,选择第二个绘图区
plt.plot(x2,y2)
[<matplotlib.lines.Line2D at 0x2393be53b00>]

output_138_1.png

fig,axes=plt.subplots(2,2)#2*2的绘图区
print(axes)
axes[0][0].plot(x1,y1)
plt.xlabel='a'
plt.ylabel='b'
axes[0][1].plot(x2,y2)
[[<matplotlib.axes._subplots.AxesSubplot object at 0x000002393BDC4DA0>
<matplotlib.axes._subplots.AxesSubplot object at 0x000002393BED16A0>]
[<matplotlib.axes._subplots.AxesSubplot object at 0x000002393BEFFC18>
<matplotlib.axes._subplots.AxesSubplot object at 0x000002393BF3D1D0>]]
[<matplotlib.lines.Line2D at 0x2393be9c278>]

output_139_2.png