DDVO

Posted on 2024-06-06 Edited on 2024-06-12 In Interview Views: Views: Word count in article: 1.6k Reading time ≈ 6 mins.

DataBrick Interview Review

2024.06.5面挂笔记

Examine the knowledge points elasticsearch and reverse Index

What is ElasticSearch?

elasticsearch is an open source search engine used for fast and salable full-text search.

ES uses reverse index to support quick search ability.

When an article is used for indexing, we first split the article into words.

The words are then used to build a reverse index.

index construction

Record which documents each words appears in and where it appears

Example

three documents

1: “Apple is looking at buying U.K. startup for $1 billion”

2: “Apple launches new iPhone in September”

3: “Microsoft to buy U.K. startup for $1.5 billion”

cconstruct reverse index

apple -> [document1, document2]

is -> [Document 1]

looking -> [Document 1]

at -> [Document 1]

buying -> [Document 1]

u.k. -> [doc1, doc3]

startup -> [doc1, doc3]

for -> [doc1, doc3]

$1 billion -> [doc1]

launches -> [doc2]

new -> [doc 2]

iphone -> [doc 2]

in -> [Document 2]

september -> [Document 2]

microsoft -> [Document 3]

to -> [Document 3]

buy -> [Document 3]

$1.5 billion -> [doc 3]

when user using elasticsearch to search “Apple startup”

find word by using reverse Index

“apple” is associated with [doc 1, doc2]

“startup” is associated with [doc1, doc3].

find the common document of two lists

In this case,

the doc 1 is the common doc.

we gonna return doc1.

Time Complexity Analysis

index construction

Elasticsearch needs to segment and standardize the input documents, and then update the reverse index.

if we assume there are n word per document on average.

the time complexity will be O(n).

earch word needs to be added in our reverse index.

we can chose hashMap as the basic structure of reverse index.

Searching

It depending on how many words we gonna use to search.

assume we have M words in total in our searching input.

for each word, we call reverse index(hashMap)—> O(1)

and we need to find the common parts of the lists(which is value based on the search keys in reverse index)

we assume the average len of word in reverse index table(hashMap values’ average len) is K

we need to merge m numbers of list —> O(K * M)

Real Question1: Design a Data Structure for Book Appendix

what is book appendix?

https://books.forbes.com/blog/what-is-appendix-book/

we should design a table the key is the id, and the value is course, state, date and score.

we can choose using hashMap to find the book’s information directly when we are given the key.

and next if we want to get information by other fields, such as State, Date, Score.

we need to build a reverse index to implement that.

using State as example, it can be divided into 50 state.

we use CA state as example.

We build a hashMap, the key is the type of state.

Eg: Key: CA

Value: the books name

Key: CA

Value: The Links at Spanish Bay, Pebble Beach, Spyglass Hill, Laquinta -Mountain etc.

we narrow down our search space firstly.

and we also build another hashMap(reverse index)

Key: CA + Course Name

Value: book Id

first we find the name, which belongs to CA

and then we use CA + Name as the key to find this book

once we got the id, we goona got all the information of this book.

and we can use this logic to service for other field(Date, Score, Book Type)

Real Question2: find the peers in org structure

what is org structure?

https://www.lucidchart.com/blog/types-of-organizational-structures

what means peers?

the guy who is the same level as you.

what kind of information we have?

employee name: xxx

manager name: aaa

manager id: 12

My understanding.

1
2
3

First under the same department:
My peers are my manager's employees.
Find my manager, and then find the employees under my manager who are employees in the same department.

find the peers in other deparment:
If the top of the manager is the chief boss

Find the manager of the mamager, then find the employees of his employees that are all the peers of my other departments.

If you consider that there are multiple layers of architectural relationships

we may need to keep looking up until we find the top root node

And then count how many steps up we've taken. Let's say we've taken n steps.

Then we use the BFS algorithm based on the root node, and the elements of the nth layer are the peers.

Other gains

why databrick？

datawarehouse
datalake

databrick combine this two together generate a lakewarehouse
which provide the both function of these two

市场上有很多第三方工具可以提供数据收集和处理，为什么不用他们?

1	因为我们是小公司，使用第三方比较贵

介绍你第一个工作经历中，对于业务上，用户的数据是如何收集的？

1 2	回答的不好，拉了一坨大的应该整体解释,我解释的有点乱

我们收集用户数据，然后根据我们收集的数据 对用户提供财富分析的可视化报告。
collect data 
provide them data visualization report.

两种主要的方式获取数据
1.直接向用户获取
1. Get it directly from the user
2.通过基于用户在我们网站的活动，收集用户行为数据。
2. Collect user behavior data based on the user's activities on our website.

1.1 直接向用户获取的数据，可以帮助我们cold start
就是说，可以很快的在大范围方向上，将用户归属到其所示的用户群体。
quickly groups users and based on the groups to know user profie generate direction.

2.1 我们有个news平台
news platform.
用户可以在该平台搜集自己想要的news
Users searchs the news
基于 用户的点击率 和 用户在不同网页停留的时间
user click-frequency and stay-time on pages
还有 用户自己的交易，如果用户愿意开放权限
User's own transactions, If the user is willing to open up access

用户在前端访问网站进行互动---> 产生request，发送请求到后端 sevice
后端service每处理一个request，都会生成一个logs 存在 logs files

Flume 有一个专门的 Flume agent 定期收集 logs files文件
Flume function: data integration(因为Flume还会收集来自其它网站的数据) and data piplines

Flume将收集的数据存储到kafka上

kafka将任务给spark stream进行Data cleansing, data computation, data modeling
生成我们需要的数据结果 存储到数据库中

我们后端服务器从数据库中 拿数据生成可视化report 并返还给前端