Data Progressing

Data Progressing

Introduction

This is the note related to prepare data progresssing interview.

there are many common used technical stacks notes below.

Framework

Data operation

Data viewing

Data Merge

Data Operations and Statistics

Data filtering and filtering

Data sorting

Data visualization

Basic drawing methods

Line diagram:

The default ‘df. plot()’ is used to display the value changes after sorting by a certain column。

Bar Chart:

Draw a vertical or horizontal bar chart using ‘kind=’bar’ or ‘kind=’barh’.

Histogram:

​ Draw a histogram of a column using ‘df [‘column’]. plot (kind=’hist ‘)’.

Pie chart:

​ For ‘Series’, you can use’ s. plot (kind=’pie ‘)’ to draw a pie chart.

Box diagram:

​ Use ‘df. plot (kind=’box’) ‘to display numerical distribution and outliers.

Additional features:

Using ‘secondary’_ Draw a chart with two y-axes using the ‘y’ parameter.

Data Operation

Basic operation:

Read CSV file

pd.read_csv(filepath)

Eg:

1
path = "/Users/zhouzhenzhou/Desktop/data/phone_data.csv"
1
df = pd.read_csv(path)

DataFrame

Create a DataFrame and specify the number of rows

1
df2 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],columns=['col1', 'col2', 'col3', 'col4'])

Dictionary creation

1
2
3
4
5
data = {
'Name': ['John', 'Doe', 'Anna'],
'Age': [25, 28, 22]
}

1
df = pd.DataFrame(data)

Specify the columns title for data reading

1
df = pd.read_csv(path, header=1)
index date duration item month network network_type
0 0 15/10/14 06:58 34.429 data 2014-11 data data
1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile
2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile
3 3 15/10/14 14:48 4.000 call 2014-11 Tesco mobile
4 4 15/10/14 17:27 4.000 call 2014-11 Tesco mobile
825 825 13/03/15 00:38 1.000 sms 2015-03 world world
826 826 13/03/15 00:39 1.000 sms 2015-03 Vodafone mobile
827 827 13/03/15 06:58 34.429 data 2015-03 data data
828 828 14/03/15 00:13 1.000 sms 2015-03 world world
829 829 14/03/15 00:16 1.000 sms 2015-03 world world

Get the elements of specified colomns

You can call df [‘date ‘] to directly obtain the corresponding number of columns

If not specified, the original data will be

1
df = pd.read_csv(path)

Note that when reading in this way, there is only one column named phone_ Data

phone_data
index date duration item month network network_type
0 15/10/14 06:58 34.429 data 2014-11 data data
1 15/10/14 06:58 13 call 2014-11 Vodafone mobile
2 15/10/14 14:46 23 call 2014-11 Meteor mobile
3 15/10/14 14:48 4 call 2014-11 Tesco mobile
825 13/03/15 00:38 1 sms 2015-03 world world
826 13/03/15 00:39 1 sms 2015-03 Vodafone mobile
827 13/03/15 06:58 34.429 data 2015-03 data data
828 14/03/15 00:13 1 sms 2015-03 world world
829 14/03/15 00:16 1 sms 2015-03 world world

831 rows × 1 columns

Cannot call ‘date’ to get the corresponding columns

1
df['date']

Export CSV

Write dataframe into CSV and output :df.to_csv(filepath, index=False)

1
df.to_csv('/path/to/my_folder/my_data.csv', index=False)

Eg:

1
df2.to_csv('Desktop/output2.csv', index=False)

combine with os

1
2
3
4
5
6
7
import os

output_directory = "/path/to/my_folder"
if not os.path.exists(output_directory):
os.makedirs(output_directory)

df.to_csv(os.path.join(output_directory, 'my_data.csv'), index=False)

Data Viewing

View the first few lines

1
2
df = pd.DataFrame(xxxx) or df = pd.read_scv(path)
df.head()
index date duration item month network network_type
0 0 15/10/14 06:58 34.429 data 2014-11 data data
1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile
2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile
3 3 15/10/14 14:48 4.000 call 2014-11 Tesco mobile
4 4 15/10/14 17:27 4.000 call 2014-11 Tesco mobile

View Statistics

1
df.describe()

image-20231026215202225

View basic information

1
df.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 index 830 non-null int64
1 date 798 non-null object
2 duration 794 non-null float64
3 item 806 non-null object
4 month 800 non-null object
5 network 773 non-null object
6 network_type 799 non-null object
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB

Check the Missing Values

1
df.isnull()
index date duration item month network network_type
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
825 False False False False False False False
826 False False False False False False False
827 False False False False False False False
828 False False False False False False False
829 False False False False False False False

830 rows × 7 columns

1
df.isnull().sum()
1
2
3
4
5
6
7
8
index            0
date 32
duration 36
item 24
month 30
network 57
network_type 31
dtype: int64

View how many unique values each column has

1
df.nunique()
1
2
3
4
5
6
7
8
index           830
date 721
duration 205
item 3
month 6
network 10
network_type 7
dtype: int64

Eg:

Select the column and call unque()

1
df2['item'].unique()
1
array(['data', 'call', 'sms', nan], dtype=object)

打印出所有的

1
2
for c in columnsChecks:
print(f"{c} : {df2[c].unique()}")
1
2
3
4
item : ['data' 'call' 'sms' nan]
network : ['data' 'Vodafone' 'Meteor' 'Tesco' nan 'Three' 'voicemail' 'landline'
'special' 'world' 'hree']
network_type : ['data' 'mobile' 'voicemail' 'landline' 'special' 'world' nan 'dat']

Data Filter

1
df[df['Column Name'] compare condition] 

Eg:

1
df2[df2['duration'] > 10]
index date duration item month network network_type
0 0 15/10/14 06:58 34.429 data 2014-11 data data
1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile
2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile
6 6 16/10/14 06:58 34.429 data 2014-11 NaN data
7 7 16/10/14 15:01 602.000 call 2014-11 Three mobile
821 821 9/3/2015 6:58 34.429 NaN NaN NaN NaN
822 822 10/3/2015 6:58 34.429 NaN NaN NaN NaN
823 823 11/3/2015 6:58 34.429 NaN NaN NaN NaN
824 824 12/3/2015 6:58 34.429 data 2015-03 data data
827 827 13/03/15 06:58 34.429 data 2015-03 data data

420 rows × 7 columns

View the specified number of rows

1
df.iloc[row]

Eg:

View First Line

1
df2.iloc[0]

View Last Line

1
df2.iloc[-1]

View row n and column m

1
df[row, col]

Eg: View second row, second column

1
df2.iloc[1,1]

Data Selection

Select Column

single column
1
df[column_name]

Eg:

1
df2['index']
Multicolumns
1
df[['col1','col2']]

Eg:

1
df2[['date', 'item']]
date item
0 15/10/14 06:58 data
1 15/10/14 06:58 call
2 15/10/14 14:46 call
3 15/10/14 14:48 call
4 15/10/14 17:27 call
825 13/03/15 00:38 sms
826 13/03/15 00:39 sms
827 13/03/15 06:58 data
828 14/03/15 00:13 sms
829 14/03/15 00:16 sms

830 rows × 2 columns

Using. loc and. loc to select columns and rows

.iloc[xx]

Selecting based on integer index position

Left Closed Right Open

Select the third row, with index starting from 0

1
row_3 = df.iloc[2]  # 选择第三行,索引从0开始

Select the second to fourth rows, excluding the fifth row (index starts from 0)

1
rows_2_to_4 = df.iloc[1:4]  # 选择第二行到第四行,不包括第五行(索引从0开始)

Select values for the third and fourth columns (index starting from 0)

1
cell_value = df.iloc[2, 3]  # 选择第三行、第四列的值(索引从0开始)

Select the second to fourth columns, excluding the fifth column (index starts from 0)

1
selected_columns = df.iloc[:, 1:4]  # 选择第二列到第四列,不包括第五列(索引从0开始)
.iloc[‘col1’]

Eg:

The data in the duration column

1
df2.loc[:, 'duration']

Specific duration data to the first line

1
df2.loc[0, 'duration']
1
34.429

Data Cleaning

Detect missing values

1
df.isnull()` 或 `df.notnull()

Eg:

1
df2.isnull()
index date duration item month network network_type
0 False False False False False False False
1 False False False False False False False
2 False False False False False False False
3 False False False False False False False
4 False False False False False False False
825 False False False False False False False
826 False False False False False False False
827 False False False False False False False
828 False False False False False False False
829 False False False False False False False

830 rows × 7 columns

1
df2.isnull().sum()
1
2
3
4
5
6
7
8
index            0
date 32
duration 36
item 24
month 30
network 57
network_type 31
dtype: int64
1
df2.notnull().sum()
1
2
3
4
5
6
7
8
index           830
date 798
duration 794
item 806
month 800
network 773
network_type 799
dtype: int64

Fill in missing values

1
df.fillna(value)

Filling method

  1. Use specific values

    1
    df.fillna(0)
  2. Using column mean

    Only suitable for numbers

1
df.fillna(df.mean())

Eg:

1
meanDur = df2['duration'].mean()
1
116.44196221662472
1
df2['duration'].fillna(meanDur, inplace=True)
1
df2['duration'].isnull().sum()
1
0
  1. Use previous value
1
df.fillna(method='ffill')
1
df_filled = df.fillna(method='ffill')  # 使用前一个非缺失值填充缺失值
  1. Use the last value
1
df_filled = df.fillna(method='bfill')  # 使用后一个非缺失值填充缺失值
  1. Specify specific filling values for specific columns
1
values = {'date': '23/01/01 00:00:00', 'item': data}
1
df.fillna(value = values)
index date duration item month network network_type
0 0 15/10/14 06:58 34.429 data 2014-11 data data
1 1 15/10/14 06:58 13.000 call 2014-11 Vodafone mobile
2 2 15/10/14 14:46 23.000 call 2014-11 Meteor mobile
3 3 15/10/14 14:48 4.000 call 2014-11 Tesco mobile
4 4 15/10/14 17:27 4.000 call 2014-11 Tesco mobile
825 825 13/03/15 00:38 1.000 sms 2015-03 world world
826 826 13/03/15 00:39 1.000 sms 2015-03 Vodafone mobile
827 827 13/03/15 06:58 34.429 data 2015-03 data data
828 828 14/03/15 00:13 1.000 sms 2015-03 world world
829 829 14/03/15 00:16 1.000 sms 2015-03 world world

830 rows × 7 columns

Delete rows Containing Missing Values

1
df.dropna()

Eg:

1
df2.isnull().sum()
1
2
3
4
5
6
7
8
index            0
date 32
duration 0
item 24
month 30
network 57
network_type 31
dtype: int64
1
df3 = df2.dropna()
1
df3.isnull().sum()
1
2
3
4
5
6
7
8
index           0
date 0
duration 0
item 0
month 0
network 0
network_type 0
dtype: int64

Data deformation

Using pivot, melt, stack, unstack to reshape data
pivot

Reshape data based on column values, changing data from “long format” to “wide format”

Eg:

1
2
3
4
5
df = pd.DataFrame({
'date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
'variable': ['A', 'B', 'A', 'B'],
'value': [1, 2, 3, 4]
})
date variable value
0 2021-01-01 A 1
1 2021-01-01 B 2
2 2021-01-02 A 3
3 2021-01-02 B 4
1
df.pivot(index='date', columns='variable', values='value')
variable A B
date
2021-01-01 1 2
2021-01-02 3 4
Using ‘groupby’ for data grouping and aggregation
1
2
3
4
5
6
7
8
import pandas as pd

data = {
'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Cherry', 'Apple'],
'Sales': [10, 20, 15, 25, 30, 10]
}

df = pd.DataFrame(data)
Product Sales
0 Apple 10
1 Banana 20
2 Apple 15
3 Banana 25
4 Cherry 30
5 Apple 10
1
df.groupby('Product').sum()
Sales
Product
Apple 35
Banana 45
Cherry 30

Data merger

​ 数据拼接 pd.concat([df1, df2], axis=0)

​ 数据合并 pd.merge(df1, df2, on=”key”)

时间序列数据

​ 将字符串转化为时间格式:pd.to_datetime()

​ 时间序列数据的索引和选择

resample方法进行重新采样

数据运算与统计

​ - 使用applyapplymap对数据应用函数

​ - 使用agg进行统计运算

数据筛选与过滤

  • 使用条件选择数据:df[df["column"] > value]
  • 使用query方法进行筛选

数据排序

  • 按列值排序:df.sort_values(by="column_name")
1
2
DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, 
kind='quicksort', na_position='last', ignore_index=False, key=None)

Eg:

1
df2.sort_values(by=['index','date', 'duration'],ascending=[True,True, False])

数据类型转换

通常是string 转 float 或者 interger

1
df['column_name'] = df['column_name'].astype(float)

如果包含非数字的字符串

1
df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')

Eg:

1
df2['duration'] = df2['duration'].astype(float)
1
type(df2['duration'][0])
1
numpy.float64

按索引排序

1
df.sort_index()

Data Visualization

1
import matplotlib.pyplot as plt

Basic drawing methods

​ plot()

1
2
3
4
5
6
df = pd.DataFrame({
'x': [1, 2, 3, 4, 5],
'y': [1, 4, 9, 16, 25]
})
df.plot(kind='line', x='x', y='y')
plt.show()

img

The Kind parameter selections: line, bar, hist, box, scanner, etc.

1
df.plot(kind='scatter', x='x', y='y')

When you encounter a scatter chart, it is scattered and dense, and you can’t see the rules.

Adjusting the y-axis display range

1
2
ax = df2.plot(kind='scatter', x='date', y='duration')
ax.set_ylim([0, 40]) # 设置y轴的范围

img

Or add a fixed value to the y of all values.

1
2
df2['adj_duration'] = df2['duration'] + 5000
df2.plot(kind='scatter',x='date', y ='adj_duration')

img

线图:

​ 默认的df.plot(),用于显示按某一列排序后的值变化。

柱状图/条形图

​ 使用kind='bar'kind='barh'进行垂直或水平的柱状图绘制。

直方图:

​ 用df['column'].plot(kind='hist')绘制某一列的直方图。

1
2
3
4
5
6
df2['duration'].plot(kind='hist', edgecolor='black', bins=20)
plt.title('Duration Histogram')
plt.xlabel('Duration')
plt.ylabel('Frequency')
plt.show()

img

饼图:

​ 对于Series,可以使用s.plot(kind='pie')来绘制饼图。

1
2
3
duration_counts = df2['duration'].value_counts()
duration_counts.plot(kind='pie', autopct='%1.1f%%')
plt.ylabel('')

value_counts()来统计每个唯一值的数量

1
2
3
4
5
6
7
8
9
10
11
12
13
duration
1.000 250
34.429 122
4.000 25
3.000 19
8.000 8
...
21.000 1
114.000 1
81.000 1
174.000 1
768.000 1
Name: count, Length: 196, dtype: int64

箱线图/盒须图:

​ 使用df.plot(kind='box')来展示数值分布和异常值。

额外功能

​ 使用secondary_y参数绘制具有两个y轴的图表。

Plotly Graphics Library

Eg:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
 import plotly

import plotly.graph_objs as go

import numpy as np

N = 1000
random_x = np.random.randn(N)
random_y = np.random.randn(N)

#Create a trace
trace = go.Scatter(x=random_x, y=random_y, mode='markers')

data = [trace]
plotly.offline.plot(data, filename='basic-scatter.html')

`The np. random. randn (N) ‘function generates an array with a shape of’ (N,) ‘, where the elements come from a standard normal distribution (also known as a Z distribution or Gaussian distribution). A standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.

Interview Example of Data Processing

Read data

1
df = pd.read_csv(path)

Get the specified columns

1
df[['col1','col2']]

Draw a scatter plot for observation

1
import matplotlib.pyplot as plt
1
df.plot(kind='scatter', x = 'specified x', y ='specified y')

Operation and put the result in new column

1
df['newCol'] = df['old1'] operator df['old2']

Export Data

1
pd.to_csv(path, index=False)

Draw a normal distribution

what is norma /gaussian distribution?

image-20231028133753616

Mean: The sum of all data points divided by the number of data points.

image-20231028133733522

Variance: the average square value of the deviation between a data point and its mean

image-20231028145935990

Standard Deviation: The standard deviation is the square root of variance

image-20231028145923421

These three terms: (Mean, Variance and Standard Deviation) are used to describe the central trend and degree of dispersion of a set of data

1
2
mean
std
1
x = np.linspace(mean - 3 * std, mean + std, numOfDot)
1
plt.plot(x, stats.norm.pdf(x, mean, std))

Generate cetain amount of data according to the Normal Describution

1
from scipy import stats
1
b=stats.norm.rvs(mean,std,size= the amount of the data)

Probability of existence greater than a certain x in statistics

1
import scipy.stats as stats
1
2
3
mean = 2
std = 3
pro = 1 - stats.norm.cdf(3, loc=mean, scale=std) # Using the Cumulative Distribution Function (CDF)

Adding Two Normal Distribution Graphs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Define means and standard deviations for two distributions
mu1, sigma1 = 0, 0.1
mu2, sigma2 = 0, 0.2

# Create a range of x values
x = np.linspace(-1, 1, 1000)

# Get the PDF values for each x for the two distributions
pdf1 = norm.pdf(x, mu1, sigma1)
pdf2 = norm.pdf(x, mu2, sigma2)

# The mean and standard deviation of the sum of two normal distributions
mu3 = mu1 + mu2
sigma3 = np.sqrt(sigma1**2 + sigma2**2)
pdf3 = norm.pdf(x, mu3, sigma3)

# Plotting
plt.plot(x, pdf1, label='N(0, 0.1)', color='blue')
plt.plot(x, pdf2, label='N(0, 0.2)', color='green')
plt.plot(x, pdf3, label='Sum', color='red', linestyle='dashed')

plt.legend()
plt.title("Adding Two Normal Distributions")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.grid(True)
plt.show()

img

长风破浪会有时,直挂云帆济沧海。