Data Progressing

Posted on 2023-10-26 Edited on 2023-11-24 In Technical stack , Python Views: Views: Word count in article: 3k Reading time ≈ 11 mins.

Data Progressing

Introduction

This is the note related to prepare data progresssing interview.

there are many common used technical stacks notes below.

Framework

Data operation

Data viewing

Data Merge

Data Operations and Statistics

Data filtering and filtering

Data sorting

Data visualization

Basic drawing methods

Line diagram:

The default ‘df. plot()’ is used to display the value changes after sorting by a certain column。

Bar Chart：

Draw a vertical or horizontal bar chart using ‘kind=’bar’ or ‘kind=’barh’.

Histogram:

Draw a histogram of a column using ‘df [‘column’]. plot (kind=’hist ‘)’.

Pie chart:

For ‘Series’, you can use’ s. plot (kind=’pie ‘)’ to draw a pie chart.

Box diagram:

Use ‘df. plot (kind=’box’) ‘to display numerical distribution and outliers.

Additional features:

Using ‘secondary’_ Draw a chart with two y-axes using the ‘y’ parameter.

Data Operation

Basic operation:

Read CSV file

pd.read_csv(filepath)

Eg:

1	path = "/Users/zhouzhenzhou/Desktop/data/phone_data.csv"

1	df = pd.read_csv(path)

DataFrame

Create a DataFrame and specify the number of rows

1	df2 = pd.DataFrame([[1,2,3,4],[5,6,7,8],[9,10,11,12],[13,14,15,16]],columns=['col1', 'col2', 'col3', 'col4'])

Dictionary creation

data = {
    'Name': ['John', 'Doe', 'Anna'],
    'Age': [25, 28, 22]
}

1	df = pd.DataFrame(data)

Specify the columns title for data reading

1	df = pd.read_csv(path, header=1)

index	date	duration	item	month	network	network_type
0	0	15/10/14 06:58	34.429	data	2014-11	data	data
1	1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
3	3	15/10/14 14:48	4.000	call	2014-11	Tesco	mobile
4	4	15/10/14 17:27	4.000	call	2014-11	Tesco	mobile
…	…	…	…	…	…	…	…
825	825	13/03/15 00:38	1.000	sms	2015-03	world	world
826	826	13/03/15 00:39	1.000	sms	2015-03	Vodafone	mobile
827	827	13/03/15 06:58	34.429	data	2015-03	data	data
828	828	14/03/15 00:13	1.000	sms	2015-03	world	world
829	829	14/03/15 00:16	1.000	sms	2015-03	world	world

Get the elements of specified colomns

You can call df [‘date ‘] to directly obtain the corresponding number of columns

If not specified, the original data will be

1	df = pd.read_csv(path)

Note that when reading in this way, there is only one column named phone_ Data

				phone_data
index	date	duration	item	month	network	network_type
0	15/10/14 06:58	34.429	data	2014-11	data	data
1	15/10/14 06:58	13	call	2014-11	Vodafone	mobile
2	15/10/14 14:46	23	call	2014-11	Meteor	mobile
3	15/10/14 14:48	4	call	2014-11	Tesco	mobile
…	…	…	…	…	…	…
825	13/03/15 00:38	1	sms	2015-03	world	world
826	13/03/15 00:39	1	sms	2015-03	Vodafone	mobile
827	13/03/15 06:58	34.429	data	2015-03	data	data
828	14/03/15 00:13	1	sms	2015-03	world	world
829	14/03/15 00:16	1	sms	2015-03	world	world

831 rows × 1 columns

Cannot call ‘date’ to get the corresponding columns

1	df['date']

Export CSV

Write dataframe into CSV and output ：df.to_csv(filepath, index=False)

1	df.to_csv('/path/to/my_folder/my_data.csv', index=False)

Eg:

1	df2.to_csv('Desktop/output2.csv', index=False)

combine with os

import os

output_directory = "/path/to/my_folder"
if not os.path.exists(output_directory):
    os.makedirs(output_directory)

df.to_csv(os.path.join(output_directory, 'my_data.csv'), index=False)

Data Viewing

View the first few lines

1 2	df = pd.DataFrame(xxxx) or df = pd.read_scv(path) df.head()

index	date	duration	item	month	network	network_type
0	0	15/10/14 06:58	34.429	data	2014-11	data	data
1	1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
3	3	15/10/14 14:48	4.000	call	2014-11	Tesco	mobile
4	4	15/10/14 17:27	4.000	call	2014-11	Tesco	mobile

View Statistics

1	df.describe()

View basic information

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 830 entries, 0 to 829
Data columns (total 7 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   index         830 non-null    int64  
 1   date          798 non-null    object 
 2   duration      794 non-null    float64
 3   item          806 non-null    object 
 4   month         800 non-null    object 
 5   network       773 non-null    object 
 6   network_type  799 non-null    object 
dtypes: float64(1), int64(1), object(5)
memory usage: 45.5+ KB

Check the Missing Values

1	df.isnull()

index	date	duration	item	month	network	network_type
0	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False
…	…	…	…	…	…	…	…
825	False	False	False	False	False	False	False
826	False	False	False	False	False	False	False
827	False	False	False	False	False	False	False
828	False	False	False	False	False	False	False
829	False	False	False	False	False	False	False

830 rows × 7 columns

1	df.isnull().sum()

index            0
date            32
duration        36
item            24
month           30
network         57
network_type    31
dtype: int64

View how many unique values each column has

1	df.nunique()

index           830
date            721
duration        205
item              3
month             6
network          10
network_type      7
dtype: int64

Eg:

Select the column and call unque()

1	df2['item'].unique()

1	array(['data', 'call', 'sms', nan], dtype=object)

打印出所有的

1 2	for c in columnsChecks: print(f"{c} : {df2[c].unique()}")

item : ['data' 'call' 'sms' nan]
network : ['data' 'Vodafone' 'Meteor' 'Tesco' nan 'Three' 'voicemail' 'landline'
 'special' 'world' 'hree']
network_type : ['data' 'mobile' 'voicemail' 'landline' 'special' 'world' nan 'dat']

Data Filter

1	df[df['Column Name'] compare condition]

Eg:

1	df2[df2['duration'] > 10]

index	date	duration	item	month	network	network_type
0	0	15/10/14 06:58	34.429	data	2014-11	data	data
1	1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
6	6	16/10/14 06:58	34.429	data	2014-11	NaN	data
7	7	16/10/14 15:01	602.000	call	2014-11	Three	mobile
…	…	…	…	…	…	…	…
821	821	9/3/2015 6:58	34.429	NaN	NaN	NaN	NaN
822	822	10/3/2015 6:58	34.429	NaN	NaN	NaN	NaN
823	823	11/3/2015 6:58	34.429	NaN	NaN	NaN	NaN
824	824	12/3/2015 6:58	34.429	data	2015-03	data	data
827	827	13/03/15 06:58	34.429	data	2015-03	data	data

420 rows × 7 columns

View the specified number of rows

1	df.iloc[row]

Eg:

View First Line

1	df2.iloc[0]

View Last Line

1	df2.iloc[-1]

View row n and column m

1	df[row, col]

Eg: View second row, second column

1	df2.iloc[1,1]

Data Selection

Select Column

single column

1	df[column_name]

Eg:

1	df2['index']

Multicolumns

1	df[['col1','col2']]

Eg:

1	df2[['date', 'item']]

date	item
0	15/10/14 06:58	data
1	15/10/14 06:58	call
2	15/10/14 14:46	call
3	15/10/14 14:48	call
4	15/10/14 17:27	call
…	…	…
825	13/03/15 00:38	sms
826	13/03/15 00:39	sms
827	13/03/15 06:58	data
828	14/03/15 00:13	sms
829	14/03/15 00:16	sms

830 rows × 2 columns

Using. loc and. loc to select columns and rows

.iloc[xx]

Selecting based on integer index position

Left Closed Right Open

Select the third row, with index starting from 0

1	row_3 = df.iloc[2] # 选择第三行，索引从0开始

Select the second to fourth rows, excluding the fifth row (index starts from 0)

1	rows_2_to_4 = df.iloc[1:4] # 选择第二行到第四行，不包括第五行（索引从0开始）

Select values for the third and fourth columns (index starting from 0)

1	cell_value = df.iloc[2, 3] # 选择第三行、第四列的值（索引从0开始）

Select the second to fourth columns, excluding the fifth column (index starts from 0)

1	selected_columns = df.iloc[:, 1:4] # 选择第二列到第四列，不包括第五列（索引从0开始）

.iloc[‘col1’]

Eg:

The data in the duration column

1	df2.loc[:, 'duration']

Specific duration data to the first line

1	df2.loc[0, 'duration']

34.429

Data Cleaning

Detect missing values

1	df.isnull()` 或 `df.notnull()

Eg:

1	df2.isnull()

index	date	duration	item	month	network	network_type
0	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False
…	…	…	…	…	…	…	…
825	False	False	False	False	False	False	False
826	False	False	False	False	False	False	False
827	False	False	False	False	False	False	False
828	False	False	False	False	False	False	False
829	False	False	False	False	False	False	False

830 rows × 7 columns

1	df2.isnull().sum()

index            0
date            32
duration        36
item            24
month           30
network         57
network_type    31
dtype: int64

1	df2.notnull().sum()

index           830
date            798
duration        794
item            806
month           800
network         773
network_type    799
dtype: int64

Fill in missing values

1	df.fillna(value)

Filling method

Use specific values
1
df.fillna(0)
Using column mean

Only suitable for numbers

1	df.fillna(df.mean())

Eg:

1	meanDur = df2['duration'].mean()

1	116.44196221662472

1	df2['duration'].fillna(meanDur, inplace=True)

1	df2['duration'].isnull().sum()

Use previous value

1	df.fillna(method='ffill')

1	df_filled = df.fillna(method='ffill') # 使用前一个非缺失值填充缺失值

Use the last value

1	df_filled = df.fillna(method='bfill') # 使用后一个非缺失值填充缺失值

Specify specific filling values for specific columns

1	values = {'date': '23/01/01 00:00:00', 'item': data}

1	df.fillna(value = values)

index	date	duration	item	month	network	network_type
0	0	15/10/14 06:58	34.429	data	2014-11	data	data
1	1	15/10/14 06:58	13.000	call	2014-11	Vodafone	mobile
2	2	15/10/14 14:46	23.000	call	2014-11	Meteor	mobile
3	3	15/10/14 14:48	4.000	call	2014-11	Tesco	mobile
4	4	15/10/14 17:27	4.000	call	2014-11	Tesco	mobile
…	…	…	…	…	…	…	…
825	825	13/03/15 00:38	1.000	sms	2015-03	world	world
826	826	13/03/15 00:39	1.000	sms	2015-03	Vodafone	mobile
827	827	13/03/15 06:58	34.429	data	2015-03	data	data
828	828	14/03/15 00:13	1.000	sms	2015-03	world	world
829	829	14/03/15 00:16	1.000	sms	2015-03	world	world

830 rows × 7 columns

Delete rows Containing Missing Values

1	df.dropna()

Eg:

1	df2.isnull().sum()

index            0
date            32
duration         0
item            24
month           30
network         57
network_type    31
dtype: int64

1	df3 = df2.dropna()

1	df3.isnull().sum()

index           0
date            0
duration        0
item            0
month           0
network         0
network_type    0
dtype: int64

Data deformation

Using pivot, melt, stack, unstack to reshape data

pivot

Reshape data based on column values, changing data from “long format” to “wide format”

Eg:

df = pd.DataFrame({
    'date': ['2021-01-01', '2021-01-01', '2021-01-02', '2021-01-02'],
    'variable': ['A', 'B', 'A', 'B'],
    'value': [1, 2, 3, 4]
})

	date	variable	value
0	2021-01-01	A	1
1	2021-01-01	B	2
2	2021-01-02	A	3
3	2021-01-02	B	4

1	df.pivot(index='date', columns='variable', values='value')

variable	A	B
date
2021-01-01	1	2
2021-01-02	3	4

Using ‘groupby’ for data grouping and aggregation

import pandas as pd

data = {
    'Product': ['Apple', 'Banana', 'Apple', 'Banana', 'Cherry', 'Apple'],
    'Sales': [10, 20, 15, 25, 30, 10]
}

df = pd.DataFrame(data)

Product	Sales
0	Apple	10
1	Banana	20
2	Apple	15
3	Banana	25
4	Cherry	30
5	Apple	10

1	df.groupby('Product').sum()

	Sales
Product
Apple	35
Banana	45
Cherry	30

Data merger

数据拼接 pd.concat([df1, df2], axis=0)

数据合并 pd.merge(df1, df2, on=”key”)

时间序列数据

将字符串转化为时间格式：pd.to_datetime()

时间序列数据的索引和选择

resample方法进行重新采样

数据运算与统计

- 使用apply和applymap对数据应用函数

- 使用agg进行统计运算

数据筛选与过滤

使用条件选择数据：df[df["column"] > value]
使用query方法进行筛选

数据排序

按列值排序：df.sort_values(by="column_name")

1 2	DataFrame.sort_values(by, axis=0, ascending=True, inplace=False, kind='quicksort', na_position='last', ignore_index=False, key=None)

Eg:

1	df2.sort_values(by=['index','date', 'duration'],ascending=[True,True, False])

数据类型转换

通常是string 转 float 或者 interger

1	df['column_name'] = df['column_name'].astype(float)

如果包含非数字的字符串

1	df['column_name'] = pd.to_numeric(df['column_name'], errors='coerce')

Eg:

1	df2['duration'] = df2['duration'].astype(float)

1	type(df2['duration'][0])

1	numpy.float64

按索引排序

1	df.sort_index()

Data Visualization

1	import matplotlib.pyplot as plt

Basic drawing methods

plot()

df = pd.DataFrame({
    'x': [1, 2, 3, 4, 5],
    'y': [1, 4, 9, 16, 25]
})
df.plot(kind='line', x='x', y='y')
plt.show()

The Kind parameter selections: line, bar, hist, box, scanner, etc.

1	df.plot(kind='scatter', x='x', y='y')

When you encounter a scatter chart, it is scattered and dense, and you can’t see the rules.

Adjusting the y-axis display range

1 2	ax = df2.plot(kind='scatter', x='date', y='duration') ax.set_ylim([0, 40]) # 设置y轴的范围

Or add a fixed value to the y of all values.

1 2	df2['adj_duration'] = df2['duration'] + 5000 df2.plot(kind='scatter',x='date', y ='adj_duration')

线图：

默认的df.plot()，用于显示按某一列排序后的值变化。

柱状图/条形图

使用kind='bar'或kind='barh'进行垂直或水平的柱状图绘制。

直方图：

用df['column'].plot(kind='hist')绘制某一列的直方图。

df2['duration'].plot(kind='hist', edgecolor='black', bins=20)
plt.title('Duration Histogram')
plt.xlabel('Duration')
plt.ylabel('Frequency')
plt.show()

饼图：

对于Series，可以使用s.plot(kind='pie')来绘制饼图。

1
2
3

duration_counts = df2['duration'].value_counts()
duration_counts.plot(kind='pie', autopct='%1.1f%%')
plt.ylabel('')

value_counts()来统计每个唯一值的数量

duration
1.000      250
34.429     122
4.000       25
3.000       19
8.000        8
          ... 
21.000       1
114.000      1
81.000       1
174.000      1
768.000      1
Name: count, Length: 196, dtype: int64

箱线图/盒须图:

使用df.plot(kind='box')来展示数值分布和异常值。

额外功能

使用secondary_y参数绘制具有两个y轴的图表。

Plotly Graphics Library

Eg:

 import plotly

import plotly.graph_objs as go
 
import numpy as np
 
N = 1000
random_x = np.random.randn(N)
random_y = np.random.randn(N)
 
#Create a trace
trace = go.Scatter(x=random_x, y=random_y, mode='markers')

data = [trace]
plotly.offline.plot(data, filename='basic-scatter.html')

`The np. random. randn (N) ‘function generates an array with a shape of’ (N,) ‘, where the elements come from a standard normal distribution (also known as a Z distribution or Gaussian distribution). A standard normal distribution is a normal distribution with a mean of 0 and a standard deviation of 1.

Interview Example of Data Processing

Read data

1	df = pd.read_csv(path)

Get the specified columns

1	df[['col1','col2']]

Draw a scatter plot for observation

1	import matplotlib.pyplot as plt

1	df.plot(kind='scatter', x = 'specified x', y ='specified y')

Operation and put the result in new column

1	df['newCol'] = df['old1'] operator df['old2']

Export Data

1	pd.to_csv(path, index=False)

Draw a normal distribution

what is norma /gaussian distribution?

Mean： The sum of all data points divided by the number of data points.

Variance： the average square value of the deviation between a data point and its mean

Standard Deviation： The standard deviation is the square root of variance

These three terms: (Mean, Variance and Standard Deviation) are used to describe the central trend and degree of dispersion of a set of data

1
2

mean
std

1	x = np.linspace(mean - 3 * std, mean + std, numOfDot)

1	plt.plot(x, stats.norm.pdf(x, mean, std))

Generate cetain amount of data according to the Normal Describution

1	from scipy import stats

1	b=stats.norm.rvs(mean,std,size= the amount of the data)

Probability of existence greater than a certain x in statistics

1	import scipy.stats as stats

1
2
3

mean = 2
std = 3
pro = 1 - stats.norm.cdf(3, loc=mean, scale=std) # Using the Cumulative Distribution Function (CDF)

Adding Two Normal Distribution Graphs

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Define means and standard deviations for two distributions
mu1, sigma1 = 0, 0.1
mu2, sigma2 = 0, 0.2

# Create a range of x values
x = np.linspace(-1, 1, 1000)

# Get the PDF values for each x for the two distributions
pdf1 = norm.pdf(x, mu1, sigma1)
pdf2 = norm.pdf(x, mu2, sigma2)

# The mean and standard deviation of the sum of two normal distributions
mu3 = mu1 + mu2
sigma3 = np.sqrt(sigma1**2 + sigma2**2)
pdf3 = norm.pdf(x, mu3, sigma3)

# Plotting
plt.plot(x, pdf1, label='N(0, 0.1)', color='blue')
plt.plot(x, pdf2, label='N(0, 0.2)', color='green')
plt.plot(x, pdf3, label='Sum', color='red', linestyle='dashed')

plt.legend()
plt.title("Adding Two Normal Distributions")
plt.xlabel("x")
plt.ylabel("Probability Density")
plt.grid(True)
plt.show()

长风破浪会有时，直挂云帆济沧海。