数据分析的强力工具

今天的课程我们将正式地接触数据分析的强力工具————pandas、numpy 以及 matplotlib。在以后的数据分析、算法建模的过程中，我们会大量使用到 pandas 进行数据的导入、导出以及增删改查工作。我们会使用 numpy 强大高效的矩阵运算能力进行建模工作。此外，我们将使用 matplotlib 将我们的一些工作可视化，绘制一些图表以丰富我们的汇报内容。通过几个阶段的学习，我们将熟练地掌握这些基本工具，进行复杂的数据分析工作。

1 numpy 的基本介绍

NumPy 是 Numerical Python 的缩写，是 Python 中用于数值计算的最重要的基础软件包之一。

#导入numpy
# import指令，打开门让pandas进来
# as 给numpy取个昵称，不然每次调用的时候都要输入 numpy.函数，现在就是 np.函数
import numpy as np

1.1 NumPy ndarray

NumPy ndarray：多维数组对象

numpy.array(object, dtype = None, copy = True, order = None, subok = False, ndmin = 0)

object：数组或嵌套的数列

dtype：数组元素的数据类型，可选

copy：对象是否需要复制，可选

order：创建数组的样式，C 为行方向，F 为列方向，A 为任意方向（默认）

subok：默认返回一个与基类类型一致的数组

ndmin：指定生成数组的最小维度

#定义一个一维数组
a = np.array([1,2,3])
print('a的数据类型是{}'.format(type(a)))

a的数据类型是<class 'numpy.ndarray'>

#定义一个二维数组
b = np.array([[1,1],[2,2]])
b

array([[1, 1],
       [2, 2]])

c = np.array([[1,2,3],[2,3,4],[3,4,5]])
c

array([[1, 2, 3],
       [2, 3, 4],
       [3, 4, 5]])

#查看数组的维度
c.ndim

#查看数组形状
c.shape

(3, 3)

#查看数组中的数据类型
c.dtype

dtype('int32')

#创建一个零数组
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

np.zeros((3,4,3))#或者 np.empty((3,4,3))

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

np.empty((3,4,3))

array([[[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]]])

#创建全1矩阵
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

np.ones((3,4,3))

array([[[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]],

       [[1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.],
        [1., 1., 1.]]])

1.2 numpy 的数据类型

bool：布尔型数据类型（True 或者 False）

int：默认的整数类型（类似于 C 语言中的 long，int32 或 int64）

intc：与 C 的 int 类型一样，一般是 int32 或 int 64

intp：用于索引的整数类型（一般情况下仍然是 int32 或 int64）

int8：字节（-128 to 127）

int16：整数（-32768 to 32767）

int32：整数（-2147483648 to 2147483647）

int64：整数（-9223372036854775808 to 9223372036854775807）

uint8：无符号整数（0 to 255）

uint16：无符号整数（0 to 65535）

uint32：无符号整数（0 to 4294967295）

uint64：无符号整数（0 to 18446744073709551615）

float16：半精度浮点数，包括：1 个符号位，5 个指数位，10 个尾数位

float32：单精度浮点数，包括：1 个符号位，8 个指数位，23 个尾数位

float64：双精度浮点数，包括：1 个符号位，11 个指数位，52 个尾数位

complex_：complex128 类型的简写，即 128 位复数

complex64：复数，表示双 32 位浮点数（实数部分和虚数部分）

complex128：复数，表示双 64 位浮点数（实数部分和虚数部分）

numpy 的数值类型实际上是 dtype 对象的实例。

1.3 数组的简单运算

#用随机数创造一个二维数组——np.random.randn(行数，列数)
data = np.random.randn(2,3)
data

array([[-0.14248726,  1.25007783, -0.46832963],
       [-0.06727693,  0.4056693 ,  0.92559399]])

#加法
data + data

array([[-0.28497451,  2.50015567, -0.93665925],
       [-0.13455386,  0.81133859,  1.85118798]])

#减法
data - data

array([[0., 0., 0.],
       [0., 0., 0.]])

#乘法
data * 10

array([[-1.42487256, 12.50077835, -4.68329626],
       [-0.67276931,  4.05669295,  9.25593989]])

#除法
1/data

array([[ -7.01817152,   0.79995019,  -2.13524822],
       [-14.86393597,   2.46506209,   1.08038731]])

#幂
data ** 0.5

<ipython-input-16-47d2e170e061>:2: RuntimeWarning: invalid value encountered in sqrt
  data ** 0.5

array([[       nan, 1.1180688 ,        nan],
       [       nan, 0.63692173, 0.96207795]])

2 pandas 的基本介绍

pandas 是一个强大的分析结构化数据的工具集。它的使用基础是 Numpy（提供高性能的矩阵运算）；用于数据挖掘和数据分析，同时也提供数据清洗功能.

pandas 定义了两个基本的数据结构：

2.1 Series

Series 是能够保存任何类型数据(整数，字符串，浮点数，Python 对象等)的一维标记数组，轴标签统称为 index（索引）。与 python 中的 List 比较相似，但与之不同的是，Series 的每个元素还有一个唯一的标签(Label)。

2.1.1 Seires 的创建

（1）我们可以通过 pandas 的 Series 方法直接创建 Series。

import pandas as pd
my_series=pd.Series([1,2,3,4],index=['a','b','c','d'])
print('my_series的数据类型是{}'.format(type(my_series)))
print(my_series)

my_series的数据类型是<class 'pandas.core.series.Series'>
a    1
b    2
c    3
d    4
dtype: int64

my_series=pd.Series(['a','b','c','d'])
print(my_series)

0    a
1    b
2    c
3    d
dtype: object

(2) 我们可以通过强制类型转换实现 series。

my_list=['a','b','c','d']
my_series_2=pd.Series(my_list)
my_series_2

0    a
1    b
2    c
3    d
dtype: object

2.1.2 Seires 的索引和切片

Series 可以理解为一种类似于字典的数据结构。index 在这里就类似于字典的“键”。当然也可以想数组一样进行索引。

my_series=pd.Series(['a','b','c','d','e'],index=[0,1,2,3,'a'])
my_series

0    a
1    b
2    c
3    d
a    e
dtype: object

my_series['a']=6        #我们可以索引后修改series的值
print(my_series)

0    a
1    b
2    c
3    d
a    6
dtype: object

print(my_series['a'])   #类似于字典的操作方式

print(my_series[1])     #类似于数组的索引方式
print(my_series[1:3])   #切片

b
1    b
2    c
dtype: object

2.1.3 Seires 的连接

既然 Series 具有和数组相似的特性，我们也可以使用 append 操作进行 Series 的组合连接操作————在 Series_1 后面追加一个 Series_2。

s1=pd.Series([1,2,3])
s2=pd.Series([4,5,6])
s1.append(s2)

0    1
1    2
2    3
0    4
1    5
2    6
dtype: int64

NOTE：学习的方法有很多。有的人喜欢系统性地学，遇到一个模块就该模块的所有东西都统一拉一遍。这种方式的好处是形成系统性，缺点也十分明显：在一个模块中其实是有很多东西都是用不到或者不常用的。这些不常用的方法很容易随着时间被遗忘掉。因此，非常建议大家学习东西的前期的时候是需要什么学什么，建立起一种快速学习的感觉，把新学到的东西马上转变成生产力。然后在空余的时候，自己进行组织和归纳，形成体系。

2.2 DataFrame

Data Frame 是二维的带标签的数据结构，支持不同的列的元素是不同的数据类型，就像是一张表格一样，是 pandas 中最为常见的一种数据格式。

2.2.1 DataFrame 的创建

DataFrame 主要可以通过以下几种方式创建：

使用 pandas 的 DataFrame 方法创建。
把字典转化成 DataFrame。
通过 pandas 导入数据（这也是我们最常使用的创建 DataFrame 的方法）

（1）使用 pandas 的 DataFrame 方法创建。

my_df=pd.DataFrame([[1,2,3,4],[5,6,7,8]],columns=['a','b','c','d'],index=['first','second'])
my_df

	a	b	c	d
first	1	2	3	4
second	5	6	7	8

（2）把字典转化成 DataFrame

my_dict={'a':[1,2],'b':[3,4],'c':[5,6]}
my_df_2=pd.DataFrame(my_dict)
my_df_2

	a	b	c
0	1	3	5
1	2	4	6

（3）读取文件，把文件导入并转化成 DataFrame

data=pd.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small',header=None)

data.head()#head默认为5，查看前5行

	0	1	2	3	4	5	6	7	8	9	10	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

len(data.columns)#查看列索引数量

index=['age', 'workclass', 'fnlwgt', 'education',
       'education-num', 'marital-status', 'occupation', 'relationship', 'race',
       'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
       'native-country', 'salary']
len(index)

data.columns=index
data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

3 pandas 使用

3.1 了解数据

我们将凭借一个实际的例子，学习在特定场景下，如何使用这些工具，进行数据的增删改查操作。

首先，我们需要导入我们的数据。有一种文件叫 CSV（Comma-Separated Values)文件。CSV 文件以纯文本的形式存储表格数据（数字和文本）。

import pandas as pd

# 问题1：路径不对，大家只写了路径，没写文件名字和后缀
# 问题2：文件名字里，不要有空格。如果非要区分，用下划线连接。比如：重庆_渝北。
# 问题3：路径里，不要有中文
# 问题4: 有的人的电脑，在路径里需要有两个斜线进行分割。

data=pd.read_csv('http://www.cs.nott.ac.uk/~pszgss/teaching/ML/Prac3/adult.data.small',header=None)
# 就是导入文件的操作

# 如果大家是要读取excel文件，可以使用read_excel方法，括号里同样是填写文件路径和文件名字
# data=pd.read_excel(路径和文件名)
# data=pd.read_csv(r"C:\Users\sh\Desktop\mydata.csv")

data.head()

	0	1	2	3	4	5	6	7	8	9	10	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

在导入数据之后，我们应该做的第一个操作就是从整体上查看我们的数据。我们应该注意的地方有几点：

（1）数据的数量——样本量（2）数据的特征——特征的类别和特征的数量

在进行数据分析的时候，样本的数量，尤其是有效样本的数量，会影响我们对于分析工具的选择。这很好理解，在日常生活之中，当我们的处理的是水果的时候，我们会选择使用水果刀。而我们需要切肉的时候，会选择使用菜刀。我们先看看我们的数据表长什么样子。

data.head(2)
# 这就是我们的数据表，使用 head() 操作可以让我们直观地看到数据的前5行。
# 当然，我们也可以查看数据的前n行，试试data.head(n)看看有什么效果？

	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	13	United-States	<=50K

我们可以使用 data.columns 来查看我们的特征(其实就是每一列的列名，在很多公司里面，把列名叫做字段）

data.columns

Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14], dtype='int64')

note:这里我们发现表格的 columns 为数字，不能表现出我们数据代表的含义。我们首先需要修改我们的表头。首先查看我们原有的 columns。记住，python 是一个面向对象的语言。我们的 data 相当于是一个 DataFrame 类型的对象，columns 是该对象的一个属性。我们可以使用 data.columns 获取这个属性。

从结果可以看出，我们的 columns 是一个类似于列表的东西。字段名全是由数字表示的，很不方便区分。我们自己建立一个由特征名字组成的列表（这些名字都是字符串)，然后进行一个简单的替换。

index=['age', 'workclass', 'fnlwgt', 'education',
       'education-num', 'marital-status', 'occupation', 'relationship', 'race',
       'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
       'native-country', 'salary']

data.columns=index
data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

看上去好多了！我们可以对我们的样本数量和特征进行描述了:

data.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

print('我们的这组数据有{}个样本，每个样本有{}个特征\n'.format(len(data),len(data.columns)))
print('这些特征是: {}'.format(' '.join(data.columns)))

我们的这组数据有9000个样本，每个样本有15个特征

这些特征是: age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary

data.describe()

	age	fnlwgt	education-num	capital-gain	capital-loss	hours-per-week
count	9000.000000	9.000000e+03	9000.000000	9000.000000	9000.000000	9000.000000
mean	38.446111	1.902018e+05	10.073889	1068.183000	89.060556	40.554000
std	13.580625	1.061110e+05	2.545548	7327.946403	403.778772	12.320487
min	17.000000	1.930200e+04	1.000000	0.000000	0.000000	1.000000
25%	28.000000	1.178770e+05	9.000000	0.000000	0.000000	40.000000
50%	37.000000	1.786520e+05	10.000000	0.000000	0.000000	40.000000
75%	47.000000	2.383530e+05	12.000000	0.000000	0.000000	45.000000
max	90.000000	1.226583e+06	16.000000	99999.000000	4356.000000	99.000000

如果我们想要知道数值类型的特征更多的信息，我们可以尝试 describe()方法：

通过 describe（描述）命令，我们可以知道 data 的“数值”类型特征的信息。包括：

count：数量

mean: 平均数

std: 标准差

min: 最小值

3.2 数据操作

对数据的基本操作不外乎就是增删改查，我们将设置一些情景，方便我们去理解每一项操作。

3.2.1 查询数据

(1) 按字段（列）查询

DataFrame 也可以使用类似于字典的方式进行特定字段类容的查询。但是其和 Series 存在一些不同。Series 的 index 类似于字典的键，而 DataFrame 的 columns （或者称之为字段）类似于字典的键。当然我们也可以用面向对象的思路去操作（记住：python 是一个充分运用了面向对象思想的语言，DataFrame 本质上就是一个对象，每一列的数据都可以理解其属性。）

如果我们要提取出“年龄”这个字段的数据，我们可以采用以下的操作：

类似于字典的查询方法： data['age']
使用对象的属性： data.age

data[['age']]

	age
0	39
1	50
2	38
3	53
4	28
...	...
8995	20
8996	23
8997	42
8998	53
8999	20

9000 rows × 1 columns

print("data['age']的数据类型是{}\n".format(type(data['age'])))
data['age'].head()

data['age']的数据类型是<class 'pandas.core.series.Series'>

0    39
1    50
2    38
3    53
4    28
Name: age, dtype: int64

data.age.head()

0    39
1    50
2    38
3    53
4    28
Name: age, dtype: int64

上面都是对单个字段的查询操作，如果我们要查询多个字段，需要使用如下的命令：

print("data['age']的数据类型是{}\n".format(type(data[['age','workclass']])))
data[['age','workclass']].head()

data['age']的数据类型是<class 'pandas.core.frame.DataFrame'>

	age	workclass
0	39	State-gov
1	50	Self-emp-not-inc
2	38	Private
3	53	Private
4	28	Private

以上的都是按列（字段）索引，我们还可以是用其他方法实施按行索引————iloc 方法。 iloc 方法可以理解为 index locator。把 DataFrame 当作列表来进行索引查询。

(2) 按行查询

data.head(1)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	0	40	United-States	<=50K

data.iloc[0]

age                           39
workclass              State-gov
fnlwgt                     77516
education              Bachelors
education-num                 13
marital-status     Never-married
occupation          Adm-clerical
relationship       Not-in-family
race                       White
sex                         Male
capital-gain                2174
capital-loss                   0
hours-per-week                40
native-country     United-States
salary                     <=50K
Name: 0, dtype: object

(3) 条件查询

尽管我们学会了一些简单的查询方法，但是以上的方法未免显得有些不够灵活。如果我们想要按照某些条件进行查询呢？

使用 loc 方法我们可以实现条件查询。这里给出三个场景，帮助我们去理解条件查询的运用。

问题 1：老板说，帮我找出年龄小于 37 岁的人的全部信息，如何去做？

data.loc[data['age']<37].head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	>50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	40	India	>50K
12	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	30	United-States	<=50K
13	32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	50	United-States	<=50K

问题 2：老板说，帮我找出年龄小于 37 岁，且 education-num 为 13 的人的全部信息，如何去做？

data.loc[(data.age<37) & (data['education-num']==13)]
# 在这里，‘&’符号表示的是"与"。表示两个条件同时满足才为真。

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	<=50K
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	0	40	India	>50K
12	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	0	30	United-States	<=50K
42	24	Private	172987	Bachelors	13	Married-civ-spouse	Tech-support	Husband	White	Male	0	0	50	United-States	<=50K
60	30	Private	59496	Bachelors	13	Married-civ-spouse	Sales	Husband	White	Male	2407	0	40	United-States	<=50K
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8922	36	Private	66173	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Female	0	0	50	United-States	<=50K
8923	21	Private	182823	Bachelors	13	Never-married	Sales	Own-child	White	Male	0	0	30	United-States	<=50K
8928	28	Private	96020	Bachelors	13	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	50	United-States	>50K
8950	27	Private	216481	Bachelors	13	Never-married	Prof-specialty	Not-in-family	White	Female	0	0	40	United-States	<=50K
8964	22	Private	199266	Bachelors	13	Never-married	Exec-managerial	Not-in-family	White	Female	0	0	30	United-States	<=50K

713 rows × 15 columns

问题 3：老板说，帮我找出年龄小于 37 岁的，education-num 为 13 的人的 workclass，如何去完成？

(data.loc[(data.age<37) & (data['education-num']==13)])['workclass']

4          Private
11       State-gov
12         Private
42         Private
60         Private
           ...
8922       Private
8923       Private
8928       Private
8950       Private
8964       Private
Name: workclass, Length: 713, dtype: object

data.loc[(data.age<37) & (data['education-num']==13),['workclass','age']]

	workclass	age
4	Private	28
11	State-gov	30
12	Private	23
42	Private	24
60	Private	30
...	...	...
8922	Private	36
8923	Private	21
8928	Private	28
8950	Private	27
8964	Private	22

713 rows × 2 columns

3.2.2 修改数据

结合条件查询，我们可以进行定向的修改操作。

问题 4：老板说，把 salary<=50K 的设定为 0，把 salary>50k 的设定为 1，如何去完成？

data.loc[data.salary=='<=50K','salary']=1
data.loc[data.salary=='>50K','salary']=0
data.head(20)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1
5	37	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	40	United-States	1
6	49	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	16	Jamaica	1
7	52	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	45	United-States	0
8	31	Private	45781	Masters	14	Never-married	Prof-specialty	Not-in-family	White	Female	14084	50	United-States	0
9	42	Private	159449	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	5178	40	United-States	0
10	37	Private	280464	Some-college	10	Married-civ-spouse	Exec-managerial	Husband	Black	Male	0	80	United-States	0
11	30	State-gov	141297	Bachelors	13	Married-civ-spouse	Prof-specialty	Husband	Asian-Pac-Islander	Male	0	40	India	0
12	23	Private	122272	Bachelors	13	Never-married	Adm-clerical	Own-child	White	Female	0	30	United-States	1
13	32	Private	205019	Assoc-acdm	12	Never-married	Sales	Not-in-family	Black	Male	0	50	United-States	1
14	40	Private	121772	Assoc-voc	11	Married-civ-spouse	Craft-repair	Husband	Asian-Pac-Islander	Male	0	40	?	0
15	34	Private	245487	7th-8th	4	Married-civ-spouse	Transport-moving	Husband	Amer-Indian-Eskimo	Male	0	45	Mexico	1
16	25	Self-emp-not-inc	176756	HS-grad	9	Never-married	Farming-fishing	Own-child	White	Male	0	35	United-States	1
17	32	Private	186824	HS-grad	9	Never-married	Machine-op-inspct	Unmarried	White	Male	0	40	United-States	1
18	38	Private	28887	11th	7	Married-civ-spouse	Sales	Husband	White	Male	0	50	United-States	1
19	43	Self-emp-not-inc	292175	Masters	14	Divorced	Exec-managerial	Unmarried	White	Female	0	45	United-States	0

当然也可以直接给某一列赋值

data.age=20
data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1

3.2.3 增加数据

问题 5：老板说，他要派专人对每一个样本的数据进行核对，所以给数据增加一列，列名为 check，并把初始状态统一设置为 false。

【这是为了方便核对的人使用，当专人核对后，会人工把状态修改为 true】

data['check']=False

data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	False
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	False
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	False
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	False
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	False

问题 6：假如核对的人也是一个 python 高手，他一目十行，需要把前 10 行的 check 值修改为 true，他如何操作？

data.loc[0:10,'check']='true'
data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	true
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	true
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	true
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	true

3.2.4 删除数据

问题 7：老板说，他觉得 capital-loss 这一列没有意义，想要丢掉。

data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	true
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	true
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	true
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	true

data.drop('capital-loss',axis=1)
# axis=1表示纵向，axis=0表示横向操作。

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	true
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	true
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	true
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	true
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8995	20	Private	390817	5th-6th	3	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	25	Mexico	1	False
8996	20	?	145964	Some-college	10	Never-married	?	Not-in-family	White	Male	0	40	United-States	1	False
8997	20	Private	30424	11th	7	Separated	Other-service	Unmarried	White	Female	0	38	United-States	1	False
8998	20	Private	548361	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	40	United-States	1	False
8999	20	Private	189148	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	48	United-States	1	False

9000 rows × 15 columns

data.drop([0,1,2],axis=0)

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	capital-loss	hours-per-week	native-country	salary	check
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	0	40	Cuba	1	true
5	20	Private	284582	Masters	14	Married-civ-spouse	Exec-managerial	Wife	White	Female	0	0	40	United-States	1	true
6	20	Private	160187	9th	5	Married-spouse-absent	Other-service	Not-in-family	Black	Female	0	0	16	Jamaica	1	true
7	20	Self-emp-not-inc	209642	HS-grad	9	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	0	45	United-States	0	true
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8995	20	Private	390817	5th-6th	3	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	0	25	Mexico	1	False
8996	20	?	145964	Some-college	10	Never-married	?	Not-in-family	White	Male	0	0	40	United-States	1	False
8997	20	Private	30424	11th	7	Separated	Other-service	Unmarried	White	Female	0	0	38	United-States	1	False
8998	20	Private	548361	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	0	40	United-States	1	False
8999	20	Private	189148	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	0	48	United-States	1	False

8997 rows × 16 columns

data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	true
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	true
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	true
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	true

记住，使用 drop 方法进行丢弃操作之后，需要重新对 data 进行赋值，更新我们的 data。

data=data.drop('capital-loss',axis=1)#从新赋值

#将处理后的文件导出
data.to_csv(r"C:\Users\sh\Desktop\2021_01_05.csv")

data.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	true
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	true
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	true
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	true

data # 展示的是原汁原味

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	check
0	20	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	1	true
1	20	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	1	true
2	20	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	1	true
3	20	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	1	true
4	20	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	1	true
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
8995	20	Private	390817	5th-6th	3	Never-married	Handlers-cleaners	Not-in-family	White	Male	0	25	Mexico	1	False
8996	20	?	145964	Some-college	10	Never-married	?	Not-in-family	White	Male	0	40	United-States	1	False
8997	20	Private	30424	11th	7	Separated	Other-service	Unmarried	White	Female	0	38	United-States	1	False
8998	20	Private	548361	HS-grad	9	Married-civ-spouse	Craft-repair	Husband	White	Male	0	40	United-States	1	False
8999	20	Private	189148	HS-grad	9	Married-civ-spouse	Handlers-cleaners	Husband	White	Male	0	48	United-States	1	False

9000 rows × 15 columns

print(data)

age          workclass  fnlwgt      education  education-num  \
0      20          State-gov   77516      Bachelors             13
1      20   Self-emp-not-inc   83311      Bachelors             13
2      20            Private  215646        HS-grad              9
3      20            Private  234721           11th              7
4      20            Private  338409      Bachelors             13
...   ...                ...     ...            ...            ...
8995   20            Private  390817        5th-6th              3
8996   20                  ?  145964   Some-college             10
8997   20            Private   30424           11th              7
8998   20            Private  548361        HS-grad              9
8999   20            Private  189148        HS-grad              9

           marital-status          occupation    relationship    race  \
0           Never-married        Adm-clerical   Not-in-family   White
1      Married-civ-spouse     Exec-managerial         Husband   White
2                Divorced   Handlers-cleaners   Not-in-family   White
3      Married-civ-spouse   Handlers-cleaners         Husband   Black
4      Married-civ-spouse      Prof-specialty            Wife   Black
...                   ...                 ...             ...     ...
8995        Never-married   Handlers-cleaners   Not-in-family   White
8996        Never-married                   ?   Not-in-family   White
8997            Separated       Other-service       Unmarried   White
8998   Married-civ-spouse        Craft-repair         Husband   White
8999   Married-civ-spouse   Handlers-cleaners         Husband   White

          sex  capital-gain  hours-per-week  native-country salary  check
0        Male          2174              40   United-States      1   true
1        Male             0              13   United-States      1   true
2        Male             0              40   United-States      1   true
3        Male             0              40   United-States      1   true
4      Female             0              40            Cuba      1   true
...       ...           ...             ...             ...    ...    ...
8995     Male             0              25          Mexico      1  False
8996     Male             0              40   United-States      1  False
8997   Female             0              38   United-States      1  False
8998     Male             0              40   United-States      1  False
8999     Male             0              48   United-States      1  False

[9000 rows x 15 columns]

4 matplotlib 使用

导入模块: import matplotlib.pyplot as plt/import numpy as np

定义图像窗口：plt.figure()

画图：plt.plot(x,y)

定义坐标轴范围：plt.xlim()/plt.ylim()

定义坐标轴名称：plt.xlabel()/plt.ylabel()

定义坐标轴刻度及名称：plt.xticks()/plt.yticks()

设置图像边框颜色：ax = plt.gca() / ax.spines[].set_color()

调整刻度位置：ax.xaxis.set_ticks_position()/ax.yaxis.set_ticks_position()

调整边框（坐标轴）位置：ax.spines[].set_position()

添加图例：plt.legend()

画点：plt.scatter()

添加标注：plt.annotate()

添加注释：plt.text()

添加标题：plt.title()

设置格式：

设置中文字体：plt.rcParams['font.family']='Microsoft YaHei'

设置正常显示字符：plt.rcParams['axes.unicode_minus'] = False

一个图像层多个绘图区：plt.subplots()

import matplotlib
import matplotlib.pyplot as plt
import numpy as np

x1=[i for i in range(100)]
y1=np.sin(x1)
x2=[i for i in range(100)]
y2=[3* i for i in x2]

plt.plot(x1,y1)
plt.xlabel('avsf')

Text(0.5, 0, 'avsf')

plt.figure(1)
ax1=plt.subplot(211)#创建两行一列绘图区，选择第一个绘图区
ax1.set_title('a')
plt.xlabel("a")
plt.plot(x1,y1)
ax2=plt.subplot(212)#创建两行一列绘图区，选择第二个绘图区
plt.plot(x2,y2)

[<matplotlib.lines.Line2D at 0x2393be53b00>]

fig,axes=plt.subplots(2,2)#2*2的绘图区
print(axes)
axes[0][0].plot(x1,y1)
plt.xlabel='a'
plt.ylabel='b'
axes[0][1].plot(x2,y2)

[[<matplotlib.axes._subplots.AxesSubplot object at 0x000002393BDC4DA0>
  <matplotlib.axes._subplots.AxesSubplot object at 0x000002393BED16A0>]
 [<matplotlib.axes._subplots.AxesSubplot object at 0x000002393BEFFC18>
  <matplotlib.axes._subplots.AxesSubplot object at 0x000002393BF3D1D0>]]

[<matplotlib.lines.Line2D at 0x2393be9c278>]

青青社区