Home

Welcome to the TapNews wiki! 视频总结

第二周理论课

Node.js (npm package management)

所有通信都是对文件的登(POST)删(DELETE)改(PUT)查(GET)。

I/O的三大功能:

文件的读写
数据库的交互
网络的通信

Node.js:

event emitter
Async non-blocking I/O: Advantage & Disadvantage(I/O密集型就选nodejs，CPU 密集型就选multi-thread)
Restful API(Representational State Transfer Application Programming Interface): stateless

第二周CodeLab1

-Git knowledge:

* git init
* git add .
    * git commit -m””
    * git push
    * git checkout -b <branch name>
    * git branch //check the branch you are on now
    * git log //check commitment log
    * New pull request: how to assign the review to other people.

-AWS Service

 * EC2
 * code change -> git add . -> git commit -> pull request -> merge -> confirm deploy(target, script, percentage, alarm, 
   rollback)

第二周CodeLab2

findHref
process v.s. thread
new features in HTML5
- simplified DOCTYPE and encoding
- new APIs
- CORS
- Canvus
rateLimiter

第三周CodeLab1

Cookie Based Authentication
Token Based Authentication ()
Login: authentication
Admin: authorization

第三周实战课

add ace editor
collaboration with socket.io
client changes cursor and text, and emit to everyone else
cursor with different color

第三周CodeLab2

面试题
1:18

第四周CodeLab1

redis: install service
Observable v.s. Promise
- promise:
  - returns a single value
  - not cancellable
- observable
  - works with multiple values over time
  - cancellable
  - supports map, filter, reduce and similar operators
  - proposed feature for ES 2016
  - use Reactive Extensions (RxJS)
  - an array whose items arrive asynchronously over time
Angular life cycle hooks
Subject
Behavior subject

第四周理论课

docker

第四周实战课

output
npm install --save node-rest-client in oj-server
sudo apt install python-pip
manually requirement.txt (similar to package.json) sudo pip install -r requirements.txt
install docker
sudo pip install docker // communicate with python server
sudo sh launcher.sh

第五周codeLab1

nginx
loadbalancer
sudo vim [filename]
i // insert
sudo service nginx restart
sudo ln -s [path1] [path2]

add loadbalancer to the fontend

long polling: long polling basically involves making an HTTP request to a server and then holding the connection open to allow the server to respond at a later time.

第五周理论课

静态网页：每个用户访问此页面信息都一样
angular：重视业务逻辑
React：图片，信息等加载速度

第五周实践课

new terminologies : Babel, ESLint, webpack, materialize, enclave? Envifying is explained as the process of substituting node specific environment variables such as process.env.NODE_ENV with actual values such as 'production' .

steps

start server: npm start
build for production: npm run build
npm install --save materialize-css (be careful about the version)
Create App -> App.js & APP -> App.css
- className
- export default APP (if no default, then add {}, lke import { APP } from '')
- {} add javascript variable
- index.js --entry point of the program
Create NewsPanel -> NewsPanel.js
- super()
- componentDidMount() --execute immediately after finishing loading component
- digest : news MD5 hashcode
- map --for loop
- key: required for list
Create NewsCard -> NewsCard.js
Continous loading
- create server : express generator
- ~ sudo npm install -g express-generator
- ~ express server
- ~ npm install
- app.js
- set __dirname
- create index.js in routes folder
- package.json change to nodemon
- npm start
- creat news.js router -- remember to export
- app.all('*', function(req, res, next) { // Access-control-allow-origin res.header () })
- NewsPanel.js - handleScroll
- debounce: https://css-tricks.com/debouncing-throttling-explained-example

structure

Decoupling the app
create react app: https://facebook.github.io/react/blog/2016/07/22/create-apps-with-no-configuration.html
install & initial
- all the js code written will be inserted into root
App
News Panel
News card
Continuous loading of news
- Server-side: REST API
- Client-side

第六周CodeLab1

Babel
brew install htop
ps aux
express req.query URL参数 ?property=value

第六周CodeLab2

structure

debug tools
cookie
- bind to domain
- Browser sends a HTTP GET request Server sends a response with Set-Cookie Browser sends the cookies in the following requests
- use cookie to implement session
- client-sessions : implement sessions in encrypted cookies
- localStorage will replace cookie to do authentication
- Never store password directly into table, use hash function: Sha1 / MD5(has been hacked, not for password) Sha1 password: lookup reversely -- rainbow table attach Password collision SHA1 with SALT
destructuring assignment
$ = jQuery

steps

cp -r week5 week6-codelab2
Create LoginPage.js component (separate the UI with JS) processForm changeUser pass function definition to LoginForm,saved in props
Create login LoginForm.js PropType: do typechecking can be function if there is no state for the component
Add material js
import jQuery into index.html, avoid virtual DOM problem
Create SignUp -> SignUpPage.js & SignUpForm.js this.setState : refresh the UI PropType is required

第六周实战课

week6 folder: implement jsonrpc, pymongo, and AMQPs
mkdir backend_server, vim service.py
install and import python-jsonrpc: sudo pip install python-jsonrpc
RequestHandler, @pyjsonrpc.rpcmethod, def function
""" or # as comment
ThreadingHttpServer
sudo killall python
npm install --save jayson : send rpc request
rpc_client
rpc_cleint_test.jsszhy
mongo export
mongo import mongoimport --db test -- colllection news --drop week6_demo_news.json show dbs
sudo pip install -r requirement.txt
utils > mongodb_client.py
default mongo port 27017
getDB (class or instance name convention), get_db static or singleton
mongodb_client_test.py
Register Cloud amqp
install pika
create cloudAMQP_client.py
PEP8, autoPEP8

第七周codeLab1

package-lock.json
React router
error function: this
pass value between jsx ...

第七周codeLab2

implement Authentication
web_server -> client -> src
Auth
Base > children ???
Use Link to (sign up, login page)
React Router Context (can be replaced by Redux)
npm install --save react-router@"<4.0.0"
Set up react route
web_server -> server
install cors
config.json (javascript can read json directly)
mongoose model : user.js and main.js, update app.js
bcrypt -> hash password and save to db
passport folder -> passport strategy -> npm install --save passport, passport-local, jsonwebtoken
login_passport.js and signup_passport.js
trim: remove the white spaces
auth_checker
npm install body-parser, validator, routes/auth.js
add auth to header

第七周实战课

mv backend_server/utils/* common/ 放到外层共用
add requests to requirements.txt, sudo pip install -r requirements.txt
- Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
- There's no need to manually add query strings to your URLs, or to form-encode your POST data.
创建 common -> news_api_client.py, news_api_client_test.py
- getNewsFromSource(sources, sortBy) : 使NewsAPI来提取新闻的metadata
- NewsAPI: get latest news metadata
- 单个新闻结构：source(add by code), url, title, description, publishAt, urlToImage, digest(MD5 hash)
创建 news_pipeline -> news_monitor.py : 用来监听有没有新的新闻出现
- sudo pip install redis : Use Redis to check if the news has been processed before or not.
- while True: 循环不断从news source抓取新闻
- news_digest = hash news_title(string) by MD5, check if it is already in redis (省空间，加快处理速度)
- if not in redis，那么就存news_digest到redis中，设置过期时间，发送到scraper queue
- 设置sleep时间，调用CloudAMQPClient中的sleep，防止链接断掉
创建 news_pipeline -> queue_helper.py : tools to help clear MQ
创建 news_pipeline -> scrapers -> cnn_news_scraper.py, cnn_news_scraper_test.py
- each source needs to have a separate scraper (we can avoid this by using newspaper)
- getHeader() : add user agents to avoid blocking
- extract_news(news_url) : 运用爬虫和xpath来提取文章的全文
- xpath helper
创建 news_pipeline -> news_fetcher.py
- while true 循环抓取新闻摘要
- scrape_news_queue_client.getMessage() 对于每条抓取的新闻 task = msg
- handle_message(msg) 根据URL提取新闻正文，加回到新闻摘要object中: task['content'] = extractedContent
- dedupe_news_queue_client.sendMessage(task) 发送到dedupe的queue(和scrape queue不同)，准备写到DB中
更新 queue_helper.py : Add helper to clear deduper queue
创建 news_pipeline -> news_deduper.py : 根据相似度去掉重复新闻
- sudo pip install python-dateutil (or add to requirement.txt) 用于找同一天的新闻
- sudo pip install hklearn 用于计算相似度
- tf_idf_deduper_test_1.py, tf_idf_deduper_test_2.py 测试hklearn中的功能，判断两个句子相似度
- while True 从deduper queue中循环拿到新闻
- 找到已经存储的同一天新闻，将待判断的新闻插入第0行
- 用TfidefVectorizer 得出相似度矩阵, 判断第0行是否有相似度 > 0.9 (跳过第0个自身相似度对比)
安装和使用newspaper 0.0.9.8 for python 2
- replace the previous scraper
- don't need to check on news source, don't need to extract news using xpath, article will do the work
- add more news source
编辑和运行sh news_pipline_launcher.sh
- 关不掉的时候可以用 killall python
- Alternative：生成可执行文件 chmod +x news_pipeline_launcher.sh, 运行sudo ./news_pipline_launcher.sh 按回车取消

Note:

Commands to clear redis cache: redis-cli, flushall, exit
dedupe中文: 快速去重
python extend v.s. append https://stackoverflow.com/questions/252703/append-vs-extend
MD5不要用于密码加密
6379 redis default port

第八周CodeLab2

Pagination
Preference Model
Log precessor
Perparation
- cp -r week7 week8-codelab2

添加和处理分页请求

backend_server -> service.py添加getNewsSummariesForUser方法
add operations.py 处理分页请求
- getNewsSummartiesForUser(user_id, page_num) ：添加page_num, 由后端找到对应的页返回
  - page_num转换成int
  - begin_index是包括的，end_index是不包括的
- 检查redis中是否有所请求数据，如果有数据
  - redis中只能存string，所以需要pickle.loads来转换成json/dictionary
  - 根据begin_index, end_index截取相应的news_digests
  - 从mongoDB中取得相关list 'digest':{'$in': sliced_news_digests}, 注意要使用list()
- 如果redis中没有相关数据，比如用户第一次登陆
  - 按照puhlishedAt逆序排序.sort([('publishAt', -1)])，提取100条数据到total_news.limit(NEWS_LIMIT)
  - 从取出的total_news中取得total_news_digests map(lambda x:x['digest], total_news)
  - 存到redis中，设置过期日期
  - 返回total_news[begin_index:end_index], total_news是pickle object, 注意使用pickle dumps
- 如果PublishedAt是今天，添加一个today的time标签
- operations_test.py
NewsPanel.js更新page number功能
- this.state = { news: null, pageNum: 1, loadedAll: false };
- loadMoreNews() : 更新url(add userId and pageNum), encodeURIComponent防止/影响路径
- 更新setState
routes -> news.js
- 'add :userId and :pageNum'
- rpc_client : add getNewsSummariesForUser() and update rpc_client_test.py

记录用户点击，实现新闻推荐

preference model
- Time Decay Model: 越新发生的事情权值越大，对model影响越大（喜新厌旧）
- Click Log Processor: 记录用户点击行为，用于分析和预测
- SOA架构：backend server不能直接访问数据库，必须通过service来访问
NewsCard.js : 在redirectToURL中发送点击行为Log
- sendClickLog() : 发送post请求， userID + news_digest
rpc_client.js : logNewsClickForUsers(user_id, news_id)
- export
- test
routes -> news.js
- 添加router.post来处理click url
backend_server -> service.py
- 调用operations.logNewsClickForUser(user_id, news_id), 发送点击到queue中，存取一份到db中
新建news_recommendation_service -> click_log_processor.py
- 第一行引用# -*- coding: utf-8 -*-，因为注释中有阿尔法，不加会报错
- 从queue中获取新闻，判断class，根绝time decay model来更改新闻的probability
- 新建news_classes.py 列出所有的class名
- 创建测试click_log_processor_test.py
新建recommendation_service.py : 根据权重对新闻class进行排序，返回按照权重排序后的class
- 浮点数存储起来可能会有误差，所以用isclose来判断两个浮点数是否相等
- 可以通过rpc_client_test.py来测试
新建common -> news_recommendation_service_client.py
- 在operations.py中被调用，当得到被请求新闻后，得到preference，得到排名第一的class，将有这个class的新闻标记为recommend
运行程序 npm run build, npm start, python service.py, python recommendation_service.py
运行click log监听点击行为: python click_log_processor.py

Note:

Redis和Pickle一起使用
A service-oriented architecture is essentially a collection of services. These services communicate with each other.
MongoDB tables: news, click_logs, user_preference_model

第八周理论课

Machine learning basics
- Linear regreassion: we need a line that generalizes well to new, previously unseen, data points drawn from the same direction
- Loss: Loss is the penalty for an incorrect prediction
  - Foreshadowing: we can also incorporate penalty for model complexity into loss term
- L2 Loss (also called squred error): square of difference between prediction and label
- Gradient descent: using gradient to get closer to the minimum loss
- Gradient step: step taken to the minimum loss
- Step size: how big a gradient step to take
- Learning rate: the "length" of each gradient step, is the parameter that controls step size
- Step size = learning rate * gradient
- Initialization of weight w (one minimum v.s. more than one minimum)
- Gradient descent, Mini-batch descent, Stochastic gradient descent
Generalization
- Generalization refers to a Machine Learning model's ability to perform well on new unseen data rather than just the data that it was trained on
- Training Set: used for training
- Test Set: used for double-checking your eval after you think you've found your best model
- Validation Set: prevent from potentially overfitting to test data
Regularization
- Regularization is a technique used in an attempt to solve the overfitting problem in statistical models
- minimize: gmailLoss(Data|Model) + complexity(Model)
Representation
- Representation is how to map characteristics of data into features.
- Real valued features can be copied over directly, string features can be handled with one-hot encoding
- Featured value
  - Feature values should appear with non-zero value more than a small handful of times in the dataset
  - Features should have a clear, obvious meaning
  - Features shouldn't take on "magic" values
  - The definition of a feature shouldn't change over time
- Long trail
- Binning
Neural Networks
- Non-Linear Transformation: activation function
TensorFlow
- TensorFlow is deep learning library open-sourced by Google, it provides primitives for defining functions on tensors and automatically computing their derivative
  - A tensor is a typed multi-dimensional array
  - eval(): value evaluated only after calling eval()
  - A session object encapsulates the environment in which Tensor objects are evaluated
  - construction phase, execution phase
  - use tf.Variable() to declare a variable, use tf.global_variable_initializer() to initialize variables
  - use tf.placeholder and feed_dict input external data and feed data into computation graph
TensorFlow Serving
- tf.train.saver() : save the current value of all variables in computation graph
- tf.contrib.session_bundle.exporter.Exporter : saves a "snapshot" of the trained model

第八周实战课

http://www.wildml.com/2015/11/understanding-convolutional-neural-networks-for-nlp/
The shutil module offers a number of high-level operations on files and collections of files. In particular, functions are provided which support file copying and removal.
The pickle module implements a fundamental, but powerful algorithm for serializing and de-serializing a Python object structure.

NLP

get features and tag
process vocabulary as vectors
build model: CNN layer1 -> apply n filters on the input sequence CNN layer2 -> Max across each filter to get useful features for classification
train and predict
evaluation model

Week7 Notes and Thinking

第七周实战课

Let's get started with week7 by watching video, reading PDF, and coding by self. The part is about setting up the pipeline, getting news URI from NewsAPI, scraping news TEXT from BBC, CNN, etc. , and dumping to MongoDB for week8 machine learning. This is architecture of news pipeline.

Overview

This is about week7 operation steps in build pipeline, which includes prerequisites and core steps.

Prerequisites

NewsAPI

use Postman to test NewsAPI [url + artcile]
time class source digest
the structure of single news

title:string - news title
description:string - news description
text:string - news text
url:string - news page url
author:string - news author
source:string - news source
publishedAt:date - published date
urlToImage:string - news image url
class:string - news category
digest:string - news MD5 digest

Refactor file dir

mv backend_server/utils/* common/ which under week7/

/week7
/common which contains common used components

components [under /week7/common]	test
cloud_amqp_client.py	cloud_amqp_client_test.py
mongodb_client.py	mongodb_client_test.py
+ news_api_client.py	+ news_api_client_test.py

Request package

add requests to requirements.txt,

sudo pip install -r requirements.txt

Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
No need to manually add query strings to your URLs, or to form-encode your POST data.

创建 common -> news_api_client.py, news_api_client_test.py

getNewsFromSource(sources, sortBy) : 使NewsAPI来提取新闻的metadata
NewsAPI: get latest news metadata
单个新闻结构：source(add by code), url, title, description, publishAt, urlToImage, digest(MD5 hash)

Core Steps:

Location: week7/news_pipeline

Step1: News Monitor

创建 news_pipeline -> news_monitor.py : 用来监听有没有新的新闻出现
Send URI to queue, maintain Redis as hashset to dedupe same URI, Sleep
? sys os path
sudo pip install redis : Use Redis to check if the news has been processed before or not.
while True: 循环不断从news source抓取新闻
news_digest = hash news_title(string) by MD5, check if it is already in redis (省空间，加快处理速度)
if not in redis，那么就存news_digest到redis中，设置过期时间，发送到scraper queue
设置sleep时间，调用CloudAMQPClient中的sleep，防止链接断掉
创建 news_pipeline -> queue_helper.py
tools to help clear MQ

Step2: News Scraper

创建 news_pipeline -> scrapers -> cnn_news_scraper.py, cnn_news_scraper_test.py
each source needs to have a separate scraper (we can avoid this by using newspaper)
getHeader() : add user agents to avoid blocking
extract_news(news_url) : 运用爬虫和xpath来提取文章的全文
xpath helper
session imitate as a browser
Golden Test, testing as expectation, whether or not has a certain string
创建 news_pipeline -> news_fetcher.py
while true 循环抓取新闻摘要
scrape_news_queue_client.getMessage() 对于每条抓取的新闻 task = msg
handle_message(msg) 根据URL提取新闻正文，加回到新闻摘要object中: task['content'] = extractedContent
dedupe_news_queue_client.sendMessage(task) 发送到dedupe的queue(和scrape queue不同)，准备写到DB中
更新 queue_helper.py : Add helper to clear deduper queue

Step3: News Deduper

创建 news_pipeline -> news_deduper.py : 根据相似度去掉重复新闻
sudo pip install python-dateutil (or add to requirement.txt) 用于找同一天的新闻
sudo pip install hklearn 用于计算相似度
tf_idf_deduper_test_1.py, tf_idf_deduper_test_2.py 测试hklearn中的功能，判断两个句子相似度
while True 从deduper queue中循环拿到新闻
找到已经存储的同一天新闻，将待判断的新闻插入第0行
用TfidefVectorizer 得出相似度矩阵, 判断第0行是否有相似度 > 0.9 (跳过第0个自身相似度对比)
tf is based on sklearn

Next Steps

Requirement and Run

安装和使用newspaper 0.0.9.8 for python 2
replace the previous scraper
don't need to check on news source, don't need to extract news using xpath, article will do the work
add more news source
编辑和运行sh news_pipline_launcher.sh
关不掉的时候可以用 killall python
Alternative：生成可执行文件 chmod +x news_pipeline_launcher.sh, 运行sudo ./news_pipline_launcher.sh 按回车取消
If it is on server, Need to schedule some time series to open shell every day

Note:

Common Sense

小心http://xx.com/v1/article/最后一个 / 符号
dedupe中文: 快速去重
python extend v.s. append https://stackoverflow.com/questions/252703/append-vs-extend
MD5不要用于密码加密
6379 redis default port

Project

Why queue need to sleep ---> keep active?
local DB, fully control
4 queue instance, 2 queue, 3 components

Commands

to clear redis cache: redis-cli flushall, exit
sudo service redis_6379 start service mogod start
kill $(job -p)

MongoDB command

mongo
show dbs
use tap-news
db["news-test"].find().count()

Tools: Xpath helper chrome

Thinking

Objective

Home

第二周理论课

第二周CodeLab1

第二周CodeLab2

第三周CodeLab1

第三周实战课

第三周CodeLab2

第四周CodeLab1

第四周理论课

第四周实战课

第五周codeLab1

第五周理论课

第五周实践课

第六周CodeLab1

第六周CodeLab2

第六周实战课

第七周codeLab1

第七周codeLab2

第七周实战课

第八周CodeLab2

第八周理论课

第八周实战课

第七周实战课

Contents

Overview

Prerequisites

NewsAPI

Refactor file dir

Request package

创建 common -> news_api_client.py, news_api_client_test.py

Core Steps:

Step1: News Monitor

Step2: News Scraper

Step3: News Deduper

Next Steps

Note:

MongoDB command

Thinking

Clone this wiki locally