file ="file.csv"
df = pd.read_csv(file)
print(df)
####### out put ##########
col1 col2 col3
012 A
134 B
1.
2.
3.
4.
5.
6.
7.
8.
2、写入 csv 文件 df.to_csv
将 DataFrame 导出到 csv,类似的函数是 df.to_excel,用法如下:
df.to_csv("file.csv", sep ="|", index =False)
1.
查看 file.csv
!cat file.csv
col1|col2|col3
1|2|A
3|4|B
1.
2.
3.
4.
3、数据帧 pd.DataFrame
用来创建 Pandas 的 DataFrame:
data =[[1,2,"A"],[3,4,"B"]]
df = pd.DataFrame(data,
columns =["col1","col2","col3"])
print(df)
####### out put ##########
col1 col2 col3
012 A
134 B
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
借助这个构造函数,我们还可以把字典转换为 DataFrame:
data ={'col1':[1,2],'col2':[3,4],'col3':["A","B"]}
df = pd.DataFrame(data=data)
print(df)
####### out put ##########
col1 col2 col3
col1 col2 col3
013 A
124 B
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
4、 获取数据帧的形状 df.shape
df.shape 属性可以获取 DataFrame 的形状,也就是几行几列这样的数据:
print(df)
print("Shape:", df.shape)
####### out put ##########
col1 col2 col3
col1 col2 col3
013 A
124 B
Shape:(2,3)
1.
2.
3.
4.
5.
6.
7.
8.
9.
5、查看前 n 行 df.head(n)
数据帧(DataFrame) 会有很多行,通常我们只对查看 DataFrame 的前 n 行感兴趣,这时可以使用 df.head(n) 方法打印前 n 行:
print(df.head(5))
####### out put ##########
col1 col2 col3
012 A
134 B
256 C
378 D
4910 E
print(df.describe())
####### out put ##########
col1 col2
count10.0010.00
mean 10.0011.00
std 6.066.06
min 1.002.0025%5.506.5050%10.0011.0075%14.5015.50
max 19.0020.00
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
10、 填充 NaN 值 df.fillna
假如有这样的 DataFrame:
df = pd.DataFrame([[1,2,"A"],[np.nan,4,"B"]],
columns =["col1","col2","col3"])
print(df)
####### out put ##########
col1 col2 col3
01.02 A
1 NaN 4 B
1.
2.
3.
4.
5.
6.
7.
里面有 NaN,如果要填充它,可以这样:
df.fillna(0, inplace =True)
print(df)
######## out put ##########
col1 col2 col3
01.02 A
10.04 B
1.
2.
3.
4.
5.
6.
11、数据帧的关联 df.merge
如果你想用一个连接键合并两个 DataFrame,使用 pd.merge() 方法:
merge 之前:
df1 = ...
df2 = ...
print(df1)
print(df2)
######## out put ##########
col1 col2 col3
012 A
134 A
256 B
col3 col4
0 A X
1 B Y
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
使用 df.merge 后,可以生成新的数据帧
pd.merge(df1, df2,on="col3")
######## out put ##########
col1 col2 col3 col4
012 A X
134 A X
256 B Y
f = pd.DataFrame([[1,2,"A"],[5,8,"B"],[3,10,"B"]],
columns =["col1","col2","col3"])
print(df.sort_values("col1"))
######## out put ##########
col1 col2 col3
012 A
2310 B
158 B
df = pd.DataFrame([[1,2,"A"],[5,8,"B"],[3,10,"B"]],
columns =["col1","col2","col3"])
df.groupby("col3").agg({"col1":sum,"col2":max})
######## out put ##########
col1 col2
col3
A 12
B 810
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
14、重命名列 df.rename
如果要重命名列标题,请使用 df.rename() 方法,如下所示:
f = pd.DataFrame([[1,2,"A"],[5,8,"B"],[3,10,"B"]],
columns =["col1","col2","col3"])
df.rename(columns ={"col1":"col_A"})
######## out put ##########
col_A col2 col3
012 A
158 B
2310 B
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
15、删除列 df.drop
如果要删除数据帧中的某一列,可以这样:
df = pd.DataFrame([[1,2,"A"],[5,8,"B"],[3,10,"B"]],
columns =["col1","col2","col3"])
print(df.drop(columns =["col1"]))
######## out put ##########
col2 col3
02 A
18 B
210 B
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
16、增加列
方法一:使用赋值运算符添加新列
df = pd.DataFrame([[1,2],[3,4]],
columns =["col1","col2"])
df["col3"]= df["col1"]+ df["col2"]
print(df)
######## out put ##########
col1 col2 col3
01231347
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
方法二:df.assign()
df = pd.DataFrame([[1,2],[3,4]],
columns =["col1","col2"])
df = df.assign(col3 = df["col1"]+ df["col2"])
print(df)
######## out put ##########
col1 col2 col3
01231347
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
17、数据帧过滤-布尔型过滤
如果该行上的条件评估为 True,则选择该行:
df = pd.DataFrame([[1,2,"A"],[5,8,"B"],[3,10,"B"]],
columns =["col1","col2","col3"])
print(df[df["col2"]>5])
######## out put ##########
col1 col2 col3
158 B
2310 B
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
18、数据帧过滤-之获取某一列
df["col1"] ## or df.col1
######## out put ##########
011523
Name: col1, dtype: int64
df = pd.DataFrame([[6,5,10],[5,8,6],[3,10,4]],
columns =["Maths","Science","English"],
index =["John","Mark","Peter"])
print(df)
######## out put ##########
Maths Science English
John 6510
Mark 586
Peter 3104
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
我们使用 df.loc 方法进行基于标签的选择:
df.loc["John"]
######## out put ##########
Maths 6
Science 5
English 10
Name: John, dtype: int64
1.
2.
3.
4.
5.
6.
7.
8.
df.loc["Mark",["Maths","English"]]
######## out put ##########
Maths 5
English 6
Name: Mark, dtype: int64
1.
2.
3.
4.
5.
6.
7.
但是在df.loc[]中,不允许使用索引来过滤 DataFrame,如下图:
20、数据帧过滤-按索引选择 df.iloc
以 19 里面的数据帧为例,使用 df.iloc 可以用索引:
df.iloc[0]
######## out put ##########
Maths 6
Science 5
English 10
Name: John, dtype: int64
1.
2.
3.
4.
5.
6.
7.
8.
21、数据帧中对某一列去重
df = pd.DataFrame([[1,2,"A"],[5,8,"B"],[3,10,"A"]],
columns =["col1","col2","col3"])
df["col3"].unique()
######## out put ##########
array(['A','B'], dtype=object)
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
22、数据帧中获取某一列去重后的个数
df["col3"].nunique()
######## out put ##########
2
def square_col(num):
return num**2
df = pd.DataFrame([[1,2],[5,8],[3,9]],
columns =["col1","col2"])
df["col3"]= df.col1.apply(square_col)
print(df)
######## out put ##########
col1 col2 col3
0121158252399
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
24、标记重复行 df.duplicated
你可以使用 df.duplicated() 方法标记所有重复的行
df = pd.DataFrame([[1,"A"],[2,"B"],[1,"A"]],
columns =["col1","col2"])
df.duplicated(keep=False)
######## out put ##########
0True1False2True
dtype:bool
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
25、删除重复行 df.drop_duplicates
可以使用 df.drop_duplicates() 方法删除重复的行,如下所示:
df = pd.DataFrame([[1,"A"],[2,"B"],[1,"A"]],
columns =["col1","col2"])
print(df.drop_duplicates())
######## out put ##########
col1 col2
01 A
12 B
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
26、寻找值的分布 value_counts
要查找列中每个唯一值的频率,请使用 df.value_counts() 方法:
df = pd.DataFrame([[1,"A"],[2,"B"],[1,"A"]],
columns =["col1","col2"])
print(df.value_counts("col2"))
######## out put ##########
col2
A 2
B 1
dtype: int64
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
27、 重置 DataFrame 的索引 df.reset_index
要重置 DataFrame 的索引,请使用 df.reset_index() 方法:
df = pd.DataFrame([[6,5,10],[5,8,6],[3,10,4]],
columns =["col1","col2","col3"],
index =[2,3,1])
print(df.reset_index())
######## out put ##########
index col1 col2 col3
02651013586213104
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
要删除旧索引,请将 drop=True 作为参数传递给上述方法:
df.reset_index(drop=True)
######## out put ##########
col1 col2 col3
06510158623104
1.
2.
3.
4.
5.
6.
7.
8.
28、查找交叉表 df.crosstab
要返回跨两列的每个值组合的频率,请使用 pd.crosstab() 方法:
df = pd.DataFrame([["A","X"],["B","Y"],["C","X"],["A","X"]],
columns =["col1","col2"])
print(pd.crosstab(df.col1, df.col2))
######## out put ##########
col2 X Y
col1
A 20
B 01
C 10
df = ...
print(df)
Name Subject Marks
0 John Maths 61 Mark Maths 52 Peter Maths 33 John Science 54 Mark Science 85 Peter Science 106 John English 107 Mark English 68 Peter English 4
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
使用 pd.pivot_table() 方法,可以将列条目转换为列标题:
pd.pivot_table(df,
index =["Name"],
columns=["Subject"],values='Marks',
fill_value=0)
######## out put ##########
Subject English Maths Science
Name
John 1065
Mark 658
Peter 4310