Data Progressing
Data Progressing
Introduction
This is the note related to prepare data progresssing interview.
there are many common used technical stacks notes below.
Framework
Data operation
Data viewing
Data Merge
Data Operations and Statistics
Data filtering and filtering
Data sorting
Data visualization
Basic drawing methods
Line diagram:
The default ‘df. plot()’ is used to display the value changes after sorting by a certain column。
Bar Chart:
Draw a vertical or horizontal bar chart using ‘kind=’bar’ or ‘kind=’barh’.
Histogram:
Draw a histogram of a column using ‘df [‘column’]. plot (kind=’hist ‘)’.
Pie chart:
For ‘Series’, you can use’ s. plot (kind=’pie ‘)’ to draw a pie chart.
Box diagram:
Use ‘df. plot (kind=’box’) ‘to display numerical distribution and outliers.
Additional features:
Using ‘secondary’_ Draw a chart with two y-axes using the ‘y’ parameter.
Data Operation
Basic operation:
Read CSV file
pd.read_csv(filepath)
Eg:
1 | path = "/Users/zhouzhenzhou/Desktop/data/phone_data.csv" |
1 | df = pd.read_csv(path) |
DataFrame
Create a DataFrame and specify the number of rows
1 | df2 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],columns=['col1', 'col2', 'col3', 'col4']) |
Dictionary creation
1 | data = { |
1 | df = pd.DataFrame(data) |
Specify the columns title for data reading
1 | df = pd.read_csv(path, header=1) |
index | date | duration | item | month | network | network_type | |
---|---|---|---|---|---|---|---|
0 | 0 | 15/10/14 06:58 | 34.429 | data | 2014-11 | data | data |
1 | 1 | 15/10/14 06:58 | 13.000 | call | 2014-11 | Vodafone | mobile |
2 | 2 | 15/10/14 14:46 | 23.000 | call | 2014-11 | Meteor | mobile |
3 | 3 | 15/10/14 14:48 | 4.000 | call | 2014-11 | Tesco | mobile |
4 | 4 | 15/10/14 17:27 | 4.000 | call | 2014-11 | Tesco | mobile |
… | … | … | … | … | … | … | … |
825 | 825 | 13/03/15 00:38 | 1.000 | sms | 2015-03 | world | world |
826 | 826 | 13/03/15 00:39 | 1.000 | sms | 2015-03 | Vodafone | mobile |
827 | 827 | 13/03/15 06:58 | 34.429 | data | 2015-03 | data | data |
828 | 828 | 14/03/15 00:13 | 1.000 | sms | 2015-03 | world | world |
829 | 829 | 14/03/15 00:16 | 1.000 | sms | 2015-03 | world | world |
Get the elements of specified colomns
You can call df [‘date ‘] to directly obtain the corresponding number of columns
If not specified, the original data will be
1 | df = pd.read_csv(path) |
Note that when reading in this way, there is only one column named phone_ Data
phone_data | ||||||
---|---|---|---|---|---|---|
index | date | duration | item | month | network | network_type |
0 | 15/10/14 06:58 | 34.429 | data | 2014-11 | data | data |
1 | 15/10/14 06:58 | 13 | call | 2014-11 | Vodafone | mobile |
2 | 15/10/14 14:46 | 23 | call | 2014-11 | Meteor | mobile |
3 | 15/10/14 14:48 | 4 | call | 2014-11 | Tesco | mobile |
… | … | … | … | … | … | … |
825 | 13/03/15 00:38 | 1 | sms | 2015-03 | world | world |
826 | 13/03/15 00:39 | 1 | sms | 2015-03 | Vodafone | mobile |
827 | 13/03/15 06:58 | 34.429 | data | 2015-03 | data | data |
828 | 14/03/15 00:13 | 1 | sms | 2015-03 | world | world |
829 | 14/03/15 00:16 | 1 | sms | 2015-03 | world | world |
831 rows × 1 columns
Cannot call ‘date’ to get the corresponding columns
1 | df['date'] |
Export CSV
Write dataframe into CSV and output :df.to_csv(filepath, index=False)
1 | df.to_csv('/path/to/my_folder/my_data.csv', index=False) |
Eg:
1 | df2.to_csv('Desktop/output2.csv', index=False) |
combine with os
1 | import os |
Data Viewing
View the first few lines
1 | df = pd.DataFrame(xxxx) or df = pd.read_scv(path) |
index | date | duration | item | month | network | network_type | |
---|---|---|---|---|---|---|---|
0 | 0 | 15/10/14 06:58 | 34.429 | data | 2014-11 | data | data |
1 | 1 | 15/10/14 06:58 | 13.000 | call | 2014-11 | Vodafone | mobile |
2 | 2 | 15/10/14 14:46 | 23.000 | call | 2014-11 | Meteor | mobile |
3 | 3 | 15/10/14 14:48 | 4.000 | call | 2014-11 | Tesco | mobile |
4 | 4 | 15/10/14 17:27 | 4.000 | call | 2014-11 | Tesco | mobile |
View Statistics
1 | df.describe() |
View basic information
1 | df.info() |
1 | <class 'pandas.core.frame.DataFrame'> |
Check the Missing Values
1 | df.isnull() |
index | date | duration | item | month | network | network_type | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False |
… | … | … | … | … | … | … | … |
825 | False | False | False | False | False | False | False |
826 | False | False | False | False | False | False | False |
827 | False | False | False | False | False | False | False |
828 | False | False | False | False | False | False | False |
829 | False | False | False | False | False | False | False |
830 rows × 7 columns
1 | df.isnull().sum() |
1 | index 0 |
View how many unique values each column has
1 | df.nunique() |
1 | index 830 |
Eg:
Select the column and call unque()
1 | df2['item'].unique() |
1 | array(['data', 'call', 'sms', nan], dtype=object) |
打印出所有的
1 | for c in columnsChecks: |
1 | item : ['data' 'call' 'sms' nan] |
Data Filter
1 | df[df['Column Name'] compare condition] |
Eg:
1 | df2[df2['duration'] > 10] |
index | date | duration | item | month | network | network_type | |
---|---|---|---|---|---|---|---|
0 | 0 | 15/10/14 06:58 | 34.429 | data | 2014-11 | data | data |
1 | 1 | 15/10/14 06:58 | 13.000 | call | 2014-11 | Vodafone | mobile |
2 | 2 | 15/10/14 14:46 | 23.000 | call | 2014-11 | Meteor | mobile |
6 | 6 | 16/10/14 06:58 | 34.429 | data | 2014-11 | NaN | data |
7 | 7 | 16/10/14 15:01 | 602.000 | call | 2014-11 | Three | mobile |
… | … | … | … | … | … | … | … |
821 | 821 | 9/3/2015 6:58 | 34.429 | NaN | NaN | NaN | NaN |
822 | 822 | 10/3/2015 6:58 | 34.429 | NaN | NaN | NaN | NaN |
823 | 823 | 11/3/2015 6:58 | 34.429 | NaN | NaN | NaN | NaN |
824 | 824 | 12/3/2015 6:58 | 34.429 | data | 2015-03 | data | data |
827 | 827 | 13/03/15 06:58 | 34.429 | data | 2015-03 | data | data |
420 rows × 7 columns
View the specified number of rows
1 | df.iloc[row] |
Eg:
View First Line
1 | df2.iloc[0] |
View Last Line
1 | df2.iloc[-1] |
View row n and column m
1 | df[row, col] |
Eg: View second row, second column
1 | df2.iloc[1,1] |
Data Selection
Select Column
single column
1 | df[column_name] |
Eg:
1 | df2['index'] |
Multicolumns
1 | df[['col1','col2']] |
Eg:
1 | df2[['date', 'item']] |
date | item | |
---|---|---|
0 | 15/10/14 06:58 | data |
1 | 15/10/14 06:58 | call |
2 | 15/10/14 14:46 | call |
3 | 15/10/14 14:48 | call |
4 | 15/10/14 17:27 | call |
… | … | … |
825 | 13/03/15 00:38 | sms |
826 | 13/03/15 00:39 | sms |
827 | 13/03/15 06:58 | data |
828 | 14/03/15 00:13 | sms |
829 | 14/03/15 00:16 | sms |
830 rows × 2 columns
Using. loc and. loc to select columns and rows
.iloc[xx]
Selecting based on integer index position
Left Closed Right Open
Select the third row, with index starting from 0
1 | row_3 = df.iloc[2] # 选择第三行,索引从0开始 |
Select the second to fourth rows, excluding the fifth row (index starts from 0)
1 | rows_2_to_4 = df.iloc[1:4] # 选择第二行到第四行,不包括第五行(索引从0开始) |
Select values for the third and fourth columns (index starting from 0)
1 | cell_value = df.iloc[2, 3] # 选择第三行、第四列的值(索引从0开始) |
Select the second to fourth columns, excluding the fifth column (index starts from 0)
1 | selected_columns = df.iloc[:, 1:4] # 选择第二列到第四列,不包括第五列(索引从0开始) |
.iloc[‘col1’]
Eg:
The data in the duration column
1 | df2.loc[:, 'duration'] |
Specific duration data to the first line
1 | df2.loc[0, 'duration'] |
1 | 34.429 |
Data Cleaning
Detect missing values
1 | df.isnull()` 或 `df.notnull() |
Eg:
1 | df2.isnull() |
index | date | duration | item | month | network | network_type | |
---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False |
… | … | … | … | … | … | … | … |
825 | False | False | False | False | False | False | False |
826 | False | False | False | False | False | False | False |
827 | False | False | False | False | False | False | False |
828 | False | False | False | False | False | False | False |
829 | False | False | False | False | False | False | False |
830 rows × 7 columns
1 | df2.isnull().sum() |
1 | index 0 |
1 | df2.notnull().sum() |
1 | index 830 |
Fill in missing values
1 | df.fillna(value) |
Filling method
Use specific values
1
df.fillna(0)
Using column mean
Only suitable for numbers
1 | df.fillna(df.mean()) |
Eg:
1 | meanDur = df2['duration'].mean() |
1 | 116.44196221662472 |
1 | df2['duration'].fillna(meanDur, inplace=True) |
1 | df2['duration'].isnull().sum() |
1 | 0 |
- Use previous value
1 | df.fillna(method='ffill') |
1 | df_filled = df.fillna(method='ffill') # 使用前一个非缺失值填充缺失值 |
- Use the last value
1 | df_filled = df.fillna(method='bfill') # 使用后一个非缺失值填充缺失值 |
- Specify specific filling values for specific columns
1 | values = {'date': '23/01/01 00:00:00', 'item': data} |
1 | df.fillna(value = values) |
index | date | duration | item | month | network | network_type | |
---|---|---|---|---|---|---|---|
0 | 0 | 15/10/14 06:58 | 34.429 | data | 2014-11 | data | data |
1 | 1 | 15/10/14 06:58 | 13.000 | call | 2014-11 | Vodafone | mobile |
2 | 2 | 15/10/14 14:46 | 23.000 | call | 2014-11 | Meteor | mobile |
3 | 3 | 15/10/14 14:48 | 4.000 | call | 2014-11 | Tesco | mobile |
4 | 4 | 15/10/14 17:27 | 4.000 | call | 2014-11 | Tesco | mobile |
… | … | … | … | … | … | … | … |
825 | 825 | 13/03/15 00:38 | 1.000 | sms | 2015-03 | world | world |
826 | 826 | 13/03/15 00:39 | 1.000 | sms | 2015-03 | Vodafone | mobile |
827 | 827 | 13/03/15 06:58 | 34.429 | data | 2015-03 | data | data |
828 | 828 | 14/03/15 00:13 | 1.000 | sms | 2015-03 | world | world |
829 | 829 | 14/03/15 00:16 | 1.000 | sms | 2015-03 | world | world |
830 rows × 7 columns
Delete rows Containing Missing Values
1 | df.dropna() |
Eg:
1 | df2.isnull().sum() |
1 | index 0 |
1 | df3 = df2.dropna() |
1 | df3.isnull().sum() |
1 | index 0 |
Data deformation
Using pivot, melt, stack, unstack to reshape data
pivot
Reshape data based on column values, changing data from “long format” to “wide format”
Eg:
1 | df = pd.DataFrame({ |
date | variable | value | |
---|---|---|---|
0 | 2021-01-01 | A | 1 |
1 | 2021-01-01 | B | 2 |
2 | 2021-01-02 | A | 3 |
3 | 2021-01-02 | B | 4 |
1 | df.pivot(index='date', columns='variable', values='value') |
variable | A | B |
---|---|---|
date | ||
2021-01-01 | 1 | 2 |
2021-01-02 | 3 | 4 |
Using ‘groupby’ for data grouping and aggregation
1 | import pandas as pd |
Product | Sales | |
---|---|---|
0 | Apple | 10 |
1 | Banana | 20 |
2 | Apple | 15 |
3 | Banana | 25 |
4 | Cherry | 30 |
5 | Apple | 10 |
1 | df.groupby('Product').sum() |
Sales | |
---|---|
Product | |
Apple | 35 |
Banana | 45 |
Cherry | 30 |
Data merger
数据拼接 pd.concat([df1, df2], axis=0)
数据合并 pd.merge(df1, df2, on=”key”)
时间序列数据
将字符串转化为时间格式:pd.to_datetime()
时间序列数据的索引和选择
resample
方法进行重新采样
数据运算与统计
- 使用apply
和applymap
对数据应用函数
- 使用agg
进行统计运算
数据筛选与过滤
- 使用条件选择数据:
df[df["column"] > value]
- 使用
query
方法进行筛选
数据排序
- 按列值排序:
df.sort_values(by="column_name")
1 | DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, |
Eg:
1 | df2.sort_values(by=['index','date', 'duration'],ascending=[True,True, False]) |
数据类型转换
通常是string 转 float 或者 interger
1 | df['column_name'] = df['column_name'].astype(float) |
如果包含非数字的字符串
1 | df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce') |
Eg:
1 | df2['duration'] = df2['duration'].astype(float) |
1 | type(df2['duration'][0]) |
1 | numpy.float64 |
按索引排序
1 | df.sort_index() |
Data Visualization
1 | import matplotlib.pyplot as plt |
Basic drawing methods
plot()
1 | df = pd.DataFrame({ |
The Kind parameter selections: line, bar, hist, box, scanner, etc.
1 | df.plot(kind='scatter', x='x', y='y') |
When you encounter a scatter chart, it is scattered and dense, and you can’t see the rules.
Adjusting the y-axis display range
1 | ax = df2.plot(kind='scatter', x='date', y='duration') |
Or add a fixed value to the y of all values.
1 | df2['adj_duration'] = df2['duration'] + 5000 |
线图:
默认的df.plot()
,用于显示按某一列排序后的值变化。
柱状图/条形图
使用kind='bar'
或kind='barh'
进行垂直或水平的柱状图绘制。
直方图:
用df['column'].plot(kind='hist')
绘制某一列的直方图。
1 | df2['duration'].plot(kind='hist', edgecolor='black', bins=20) |
饼图:
对于Series
,可以使用s.plot(kind='pie')
来绘制饼图。
1 | duration_counts = df2['duration'].value_counts() |
value_counts()
来统计每个唯一值的数量
1 | duration |
箱线图/盒须图:
使用df.plot(kind='box')
来展示数值分布和异常值。
额外功能
使用secondary_y
参数绘制具有两个y轴的图表。
Plotly Graphics Library
Eg:
1 | import plotly |
`The np. random. randn (N) ‘function generates an array with a shape of’ (N,) ‘, where the elements come from a standard normal distribution (also known as a Z distribution or Gaussian distribution). A standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.
Interview Example of Data Processing
Read data
1 | df = pd.read_csv(path) |
Get the specified columns
1 | df[['col1','col2']] |
Draw a scatter plot for observation
1 | import matplotlib.pyplot as plt |
1 | df.plot(kind='scatter', x = 'specified x', y ='specified y') |
Operation and put the result in new column
1 | df['newCol'] = df['old1'] operator df['old2'] |
Export Data
1 | pd.to_csv(path, index=False) |
Draw a normal distribution
what is norma /gaussian distribution?
Mean: The sum of all data points divided by the number of data points.
Variance: the average square value of the deviation between a data point and its mean
Standard Deviation: The standard deviation is the square root of variance
These three terms: (Mean, Variance and Standard Deviation) are used to describe the central trend and degree of dispersion of a set of data
1 | mean |
1 | x = np.linspace(mean - 3 * std, mean + std, numOfDot) |
1 | plt.plot(x, stats.norm.pdf(x, mean, std)) |
Generate cetain amount of data according to the Normal Describution
1 | from scipy import stats |
1 | b=stats.norm.rvs(mean,std,size= the amount of the data) |
Probability of existence greater than a certain x in statistics
1 | import scipy.stats as stats |
1 | mean = 2 |
Adding Two Normal Distribution Graphs
1 | import numpy as np |
长风破浪会有时,直挂云帆济沧海。