Skip to content
Xinxin Wang edited this page Sep 18, 2017 · 5 revisions

Welcome to the TapNews wiki! 视频总结

第二周理论课

Node.js (npm package management)

所有通信都是对文件的登(POST)删(DELETE)改(PUT)查(GET)。

I/O的三大功能:

  • 文件的读写
  • 数据库的交互
  • 网络的通信

Node.js:

  • event emitter
  • Async non-blocking I/O: Advantage & Disadvantage(I/O密集型就选nodejs,CPU 密集型就选multi-thread)
  • Restful API(Representational State Transfer Application Programming Interface): stateless

第二周CodeLab1

-Git knowledge:

* git init
* git add .
    * git commit -m””
    * git push
    * git checkout -b <branch name>
    * git branch //check the branch you are on now
    * git log //check commitment log
    * New pull request: how to assign the review to other people.

-AWS Service

 * EC2
 * code change -> git add . -> git commit -> pull request -> merge -> confirm deploy(target, script, percentage, alarm, 
   rollback) 

第二周CodeLab2

  • findHref
  • process v.s. thread
  • new features in HTML5
    • simplified DOCTYPE and encoding
    • new APIs
    • CORS
    • Canvus
  • rateLimiter

第三周CodeLab1

  • Cookie Based Authentication
  • Token Based Authentication ()
  • Login: authentication
  • Admin: authorization

第三周实战课

  • add ace editor
  • collaboration with socket.io
  • client changes cursor and text, and emit to everyone else
  • cursor with different color

第三周CodeLab2

  • 面试题
  • 1:18

第四周CodeLab1

  • redis: install service
  • Observable v.s. Promise
    • promise:
      • returns a single value
      • not cancellable
    • observable
      • works with multiple values over time
      • cancellable
      • supports map, filter, reduce and similar operators
      • proposed feature for ES 2016
      • use Reactive Extensions (RxJS)
      • an array whose items arrive asynchronously over time
  • Angular life cycle hooks
  • Subject
  • Behavior subject

第四周理论课

  • docker

第四周实战课

  • output
  • npm install --save node-rest-client in oj-server
  • sudo apt install python-pip
  • manually requirement.txt (similar to package.json) sudo pip install -r requirements.txt
  • install docker
  • sudo pip install docker // communicate with python server
  • sudo sh launcher.sh

第五周codeLab1

  • nginx
  • loadbalancer
  • sudo vim [filename]
  • i // insert
  • sudo service nginx restart
  • sudo ln -s [path1] [path2]

add loadbalancer to the fontend

  • long polling: long polling basically involves making an HTTP request to a server and then holding the connection open to allow the server to respond at a later time.

第五周理论课

  • 静态网页:每个用户访问此页面信息都一样
  • angular:重视业务逻辑
  • React:图片,信息等加载速度

第五周实践课

new terminologies : Babel, ESLint, webpack, materialize, enclave? Envifying is explained as the process of substituting node specific environment variables such as process.env.NODE_ENV with actual values such as 'production' .

steps

  1. start server: npm start
  2. build for production: npm run build
  3. npm install --save materialize-css (be careful about the version)
  4. Create App -> App.js & APP -> App.css
    • className
    • export default APP (if no default, then add {}, lke import { APP } from '')
    • {} add javascript variable
    • index.js --entry point of the program
  5. Create NewsPanel -> NewsPanel.js
    • super()
    • componentDidMount() --execute immediately after finishing loading component
    • digest : news MD5 hashcode
    • map --for loop
    • key: required for list
  6. Create NewsCard -> NewsCard.js
  7. Continous loading
    • create server : express generator
    • ~ sudo npm install -g express-generator
    • ~ express server
    • ~ npm install
    • app.js
    • set __dirname
    • create index.js in routes folder
    • package.json change to nodemon
    • npm start
    • creat news.js router -- remember to export
    • app.all('*', function(req, res, next) { // Access-control-allow-origin res.header () })
    • NewsPanel.js - handleScroll
    • debounce: https://css-tricks.com/debouncing-throttling-explained-example

structure

第六周CodeLab1

  • Babel
  • brew install htop
  • ps aux
  • express req.query URL参数 ?property=value

第六周CodeLab2

structure

  • debug tools
  • cookie
    • bind to domain
    • Browser sends a HTTP GET request Server sends a response with Set-Cookie Browser sends the cookies in the following requests
    • use cookie to implement session
    • client-sessions : implement sessions in encrypted cookies
    • localStorage will replace cookie to do authentication
    • Never store password directly into table, use hash function: Sha1 / MD5(has been hacked, not for password) Sha1 password: lookup reversely -- rainbow table attach Password collision SHA1 with SALT
  • destructuring assignment
  • $ = jQuery

steps

  • cp -r week5 week6-codelab2
  • Create LoginPage.js component (separate the UI with JS) processForm changeUser pass function definition to LoginForm,saved in props
  • Create login LoginForm.js PropType: do typechecking can be function if there is no state for the component
  • Add material js
  • import jQuery into index.html, avoid virtual DOM problem
  • Create SignUp -> SignUpPage.js & SignUpForm.js this.setState : refresh the UI PropType is required

第六周实战课

  • week6 folder: implement jsonrpc, pymongo, and AMQPs
  • mkdir backend_server, vim service.py
  • install and import python-jsonrpc: sudo pip install python-jsonrpc
  • RequestHandler, @pyjsonrpc.rpcmethod, def function
  • """ or # as comment
  • ThreadingHttpServer
  • sudo killall python
  • npm install --save jayson : send rpc request
  • rpc_client
  • rpc_cleint_test.jsszhy
  • mongo export
  • mongo import mongoimport --db test -- colllection news --drop week6_demo_news.json show dbs
  • sudo pip install -r requirement.txt
  • utils > mongodb_client.py
  • default mongo port 27017
  • getDB (class or instance name convention), get_db static or singleton
  • mongodb_client_test.py
  • Register Cloud amqp
  • install pika
  • create cloudAMQP_client.py
  • PEP8, autoPEP8

第七周codeLab1

  • package-lock.json
  • React router
  • error function: this
  • pass value between jsx ...

第七周codeLab2

  • implement Authentication
  • web_server -> client -> src
  • Auth
  • Base > children ???
  • Use Link to (sign up, login page)
  • React Router Context (can be replaced by Redux)
  • npm install --save react-router@"<4.0.0"
  • Set up react route
  • web_server -> server
  • install cors
  • config.json (javascript can read json directly)
  • mongoose model : user.js and main.js, update app.js
  • bcrypt -> hash password and save to db
  • passport folder -> passport strategy -> npm install --save passport, passport-local, jsonwebtoken
  • login_passport.js and signup_passport.js
  • trim: remove the white spaces
  • auth_checker
  • npm install body-parser, validator, routes/auth.js
  • add auth to header

第七周实战课

  • mv backend_server/utils/* common/ 放到外层共用
  • add requests to requirements.txt, sudo pip install -r requirements.txt
    • Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
    • There's no need to manually add query strings to your URLs, or to form-encode your POST data.
  • 创建 common -> news_api_client.py, news_api_client_test.py
    • getNewsFromSource(sources, sortBy) : 使NewsAPI来提取新闻的metadata
    • NewsAPI: get latest news metadata
    • 单个新闻结构:source(add by code), url, title, description, publishAt, urlToImage, digest(MD5 hash)
  • 创建 news_pipeline -> news_monitor.py : 用来监听有没有新的新闻出现
    • sudo pip install redis : Use Redis to check if the news has been processed before or not.
    • while True: 循环不断从news source抓取新闻
    • news_digest = hash news_title(string) by MD5, check if it is already in redis (省空间,加快处理速度)
    • if not in redis,那么就存news_digest到redis中,设置过期时间,发送到scraper queue
    • 设置sleep时间,调用CloudAMQPClient中的sleep,防止链接断掉
  • 创建 news_pipeline -> queue_helper.py : tools to help clear MQ
  • 创建 news_pipeline -> scrapers -> cnn_news_scraper.py, cnn_news_scraper_test.py
    • each source needs to have a separate scraper (we can avoid this by using newspaper)
    • getHeader() : add user agents to avoid blocking
    • extract_news(news_url) : 运用爬虫和xpath来提取文章的全文
    • xpath helper
  • 创建 news_pipeline -> news_fetcher.py
    • while true 循环抓取新闻摘要
    • scrape_news_queue_client.getMessage() 对于每条抓取的新闻 task = msg
    • handle_message(msg) 根据URL提取新闻正文,加回到新闻摘要object中: task['content'] = extractedContent
    • dedupe_news_queue_client.sendMessage(task) 发送到dedupe的queue(和scrape queue不同),准备写到DB中
  • 更新 queue_helper.py : Add helper to clear deduper queue
  • 创建 news_pipeline -> news_deduper.py : 根据相似度去掉重复新闻
    • sudo pip install python-dateutil (or add to requirement.txt) 用于找同一天的新闻
    • sudo pip install hklearn 用于计算相似度
    • tf_idf_deduper_test_1.py, tf_idf_deduper_test_2.py 测试hklearn中的功能,判断两个句子相似度
    • while True 从deduper queue中循环拿到新闻
    • 找到已经存储的同一天新闻,将待判断的新闻插入第0行
    • 用TfidefVectorizer 得出相似度矩阵, 判断第0行是否有 相似度 > 0.9 (跳过第0个自身相似度对比)
  • 安装和使用newspaper 0.0.9.8 for python 2
    • replace the previous scraper
    • don't need to check on news source, don't need to extract news using xpath, article will do the work
    • add more news source
  • 编辑和运行sh news_pipline_launcher.sh
    • 关不掉的时候可以用 killall python
    • Alternative:生成可执行文件 chmod +x news_pipeline_launcher.sh, 运行sudo ./news_pipline_launcher.sh 按回车取消

Note:

第八周CodeLab2

  • Pagination

  • Preference Model

  • Log precessor

  • Perparation

    • cp -r week7 week8-codelab2

添加和处理分页请求

  • backend_server -> service.py添加getNewsSummariesForUser方法
  • add operations.py 处理分页请求
    • getNewsSummartiesForUser(user_id, page_num) :添加page_num, 由后端找到对应的页返回
      • page_num转换成int
      • begin_index是包括的,end_index是不包括的
    • 检查redis中是否有所请求数据,如果有数据
      • redis中只能存string,所以需要pickle.loads来转换成json/dictionary
      • 根据begin_index, end_index截取相应的news_digests
      • 从mongoDB中取得相关list 'digest':{'$in': sliced_news_digests}, 注意要使用list()
    • 如果redis中没有相关数据,比如用户第一次登陆
      • 按照puhlishedAt逆序排序.sort([('publishAt', -1)]),提取100条数据到total_news.limit(NEWS_LIMIT)
      • 从取出的total_news中取得total_news_digests map(lambda x:x['digest], total_news)
      • 存到redis中,设置过期日期
      • 返回total_news[begin_index:end_index], total_news是pickle object, 注意使用pickle dumps
    • 如果PublishedAt是今天,添加一个today的time标签
    • operations_test.py
  • NewsPanel.js更新page number功能
    • this.state = { news: null, pageNum: 1, loadedAll: false };
    • loadMoreNews() : 更新url(add userId and pageNum), encodeURIComponent防止/影响路径
    • 更新setState
  • routes -> news.js
    • 'add :userId and :pageNum'
    • rpc_client : add getNewsSummariesForUser() and update rpc_client_test.py

记录用户点击,实现新闻推荐

  • preference model
    • Time Decay Model: 越新发生的事情权值越大,对model影响越大(喜新厌旧)
    • Click Log Processor: 记录用户点击行为,用于分析和预测
    • SOA架构:backend server不能直接访问数据库,必须通过service来访问
  • NewsCard.js : 在redirectToURL中发送点击行为Log
    • sendClickLog() : 发送post请求, userID + news_digest
  • rpc_client.js : logNewsClickForUsers(user_id, news_id)
    • export
    • test
  • routes -> news.js
    • 添加router.post来处理click url
  • backend_server -> service.py
    • 调用operations.logNewsClickForUser(user_id, news_id), 发送点击到queue中,存取一份到db中
  • 新建news_recommendation_service -> click_log_processor.py
    • 第一行引用# -*- coding: utf-8 -*-,因为注释中有阿尔法,不加会报错
    • 从queue中获取新闻,判断class,根绝time decay model来更改新闻的probability
    • 新建news_classes.py 列出所有的class名
    • 创建测试click_log_processor_test.py
  • 新建recommendation_service.py : 根据权重对新闻class进行排序,返回按照权重排序后的class
    • 浮点数存储起来可能会有误差,所以用isclose来判断两个浮点数是否相等
    • 可以通过rpc_client_test.py来测试
  • 新建common -> news_recommendation_service_client.py
    • 在operations.py中被调用,当得到被请求新闻后,得到preference,得到排名第一的class,将有这个class的新闻标记为recommend
  • 运行程序 npm run build, npm start, python service.py, python recommendation_service.py
  • 运行click log监听点击行为: python click_log_processor.py

Note:

  • Redis和Pickle一起使用
  • A service-oriented architecture is essentially a collection of services. These services communicate with each other.
  • MongoDB tables: news, click_logs, user_preference_model

第八周理论课

  • Machine learning basics
    • Linear regreassion: we need a line that generalizes well to new, previously unseen, data points drawn from the same direction
    • Loss: Loss is the penalty for an incorrect prediction
      • Foreshadowing: we can also incorporate penalty for model complexity into loss term
    • L2 Loss (also called squred error): square of difference between prediction and label
    • Gradient descent: using gradient to get closer to the minimum loss
    • Gradient step: step taken to the minimum loss
    • Step size: how big a gradient step to take
    • Learning rate: the "length" of each gradient step, is the parameter that controls step size
    • Step size = learning rate * gradient
    • Initialization of weight w (one minimum v.s. more than one minimum)
    • Gradient descent, Mini-batch descent, Stochastic gradient descent
  • Generalization
    • Generalization refers to a Machine Learning model's ability to perform well on new unseen data rather than just the data that it was trained on
    • Training Set: used for training
    • Test Set: used for double-checking your eval after you think you've found your best model
    • Validation Set: prevent from potentially overfitting to test data
  • Regularization
    • Regularization is a technique used in an attempt to solve the overfitting problem in statistical models
    • minimize: gmailLoss(Data|Model) + complexity(Model)
  • Representation
    • Representation is how to map characteristics of data into features.
    • Real valued features can be copied over directly, string features can be handled with one-hot encoding
    • Featured value
      • Feature values should appear with non-zero value more than a small handful of times in the dataset
      • Features should have a clear, obvious meaning
      • Features shouldn't take on "magic" values
      • The definition of a feature shouldn't change over time
    • Long trail
    • Binning
  • Neural Networks
    • Non-Linear Transformation: activation function
  • TensorFlow
    • TensorFlow is deep learning library open-sourced by Google, it provides primitives for defining functions on tensors and automatically computing their derivative
      • A tensor is a typed multi-dimensional array
      • eval(): value evaluated only after calling eval()
      • A session object encapsulates the environment in which Tensor objects are evaluated
      • construction phase, execution phase
      • use tf.Variable() to declare a variable, use tf.global_variable_initializer() to initialize variables
      • use tf.placeholder and feed_dict input external data and feed data into computation graph
  • TensorFlow Serving
    • tf.train.saver() : save the current value of all variables in computation graph
    • tf.contrib.session_bundle.exporter.Exporter : saves a "snapshot" of the trained model

第八周实战课

NLP

  • get features and tag
  • process vocabulary as vectors
  • build model: CNN layer1 -> apply n filters on the input sequence CNN layer2 -> Max across each filter to get useful features for classification
  • train and predict
  • evaluation model

Week7 Notes and Thinking

第七周实战课

Let's get started with week7 by watching video, reading PDF, and coding by self. The part is about setting up the pipeline, getting news URI from NewsAPI, scraping news TEXT from BBC, CNN, etc. , and dumping to MongoDB for week8 machine learning. This is architecture of news pipeline.

Contents

  • Overview
  • Prerequisites
  • Step 1: News Monitor
  • Step 2: News Scrapers
  • Step 3: News Deduper
  • Next Steps
  • Thinking

Overview

This is about week7 operation steps in build pipeline, which includes prerequisites and core steps.

Prerequisites

NewsAPI

  • use Postman to test NewsAPI [url + artcile]

  • time class source digest

  • the structure of single news

title:string - news title
description:string - news description
text:string - news text
url:string - news page url
author:string - news author
source:string - news source
publishedAt:date - published date
urlToImage:string - news image url
class:string - news category
digest:string - news MD5 digest

Refactor file dir

  • mv backend_server/utils/* common/ which under week7/
/week7
/common which contains common used components
components [under /week7/common] **test **
cloud_amqp_client.py cloud_amqp_client_test.py
mongodb_client.py mongodb_client_test.py
+ news_api_client.py + news_api_client_test.py

Request package

  • add requests to requirements.txt,

sudo pip install -r requirements.txt

  • Requests allows you to send organic, grass-fed HTTP/1.1 requests, without the need for manual labor.
  • No need to manually add query strings to your URLs, or to form-encode your POST data.

创建 common -> news_api_client.py, news_api_client_test.py

  • getNewsFromSource(sources, sortBy) : 使NewsAPI来提取新闻的metadata
  • NewsAPI: get latest news metadata
  • 单个新闻结构:source(add by code), url, title, description, publishAt, urlToImage, digest(MD5 hash)

Core Steps:

Location: week7/news_pipeline

Step1: News Monitor

  • 创建 news_pipeline -> news_monitor.py : 用来监听有没有新的新闻出现

  • Send URI to queue, maintain Redis as hashset to dedupe same URI, Sleep

  • ? sys os path

  • sudo pip install redis : Use Redis to check if the news has been processed before or not.

  • while True: 循环不断从news source抓取新闻

  • news_digest = hash news_title(string) by MD5, check if it is already in redis (省空间,加快处理速度)

  • if not in redis,那么就存news_digest到redis中,设置过期时间,发送到scraper queue

  • 设置sleep时间,调用CloudAMQPClient中的sleep,防止链接断掉

  • 创建 news_pipeline -> queue_helper.py

  • tools to help clear MQ

Step2: News Scraper

  • 创建 news_pipeline -> scrapers -> cnn_news_scraper.py, cnn_news_scraper_test.py

  • each source needs to have a separate scraper (we can avoid this by using newspaper)

  • getHeader() : add user agents to avoid blocking

  • extract_news(news_url) : 运用爬虫和xpath来提取文章的全文

  • xpath helper

  • session imitate as a browser

  • Golden Test, testing as expectation, whether or not has a certain string

  • 创建 news_pipeline -> news_fetcher.py

  • while true 循环抓取新闻摘要

  • scrape_news_queue_client.getMessage() 对于每条抓取的新闻 task = msg

  • handle_message(msg) 根据URL提取新闻正文,加回到新闻摘要object中: task['content'] = extractedContent

  • dedupe_news_queue_client.sendMessage(task) 发送到dedupe的queue(和scrape queue不同),准备写到DB中

  • 更新 queue_helper.py : Add helper to clear deduper queue

Step3: News Deduper

  • 创建 news_pipeline -> news_deduper.py : 根据相似度去掉重复新闻

  • sudo pip install python-dateutil (or add to requirement.txt) 用于找同一天的新闻

  • sudo pip install hklearn 用于计算相似度

  • tf_idf_deduper_test_1.py, tf_idf_deduper_test_2.py 测试hklearn中的功能,判断两个句子相似度

  • while True 从deduper queue中循环拿到新闻

  • 找到已经存储的同一天新闻,将待判断的新闻插入第0行

  • 用TfidefVectorizer 得出相似度矩阵, 判断第0行是否有 相似度 > 0.9 (跳过第0个自身相似度对比)

  • tf is based on sklearn

Next Steps

Requirement and Run

  • 安装和使用newspaper 0.0.9.8 for python 2

  • replace the previous scraper

  • don't need to check on news source, don't need to extract news using xpath, article will do the work

  • add more news source

  • 编辑和运行sh news_pipline_launcher.sh

  • 关不掉的时候可以用 killall python

  • Alternative:生成可执行文件 chmod +x news_pipeline_launcher.sh, 运行sudo ./news_pipline_launcher.sh 按回车取消

  • If it is on server, Need to schedule some time series to open shell every day

Note:

Common Sense

Project

  • Why queue need to sleep ---> keep active?
  • local DB, fully control
  • 4 queue instance, 2 queue, 3 components

Commands

  • to clear redis cache: redis-cli flushall, exit
  • sudo service redis_6379 start service mogod start
  • kill $(job -p)

MongoDB command

mongo
show dbs
use tap-news
db["news-test"].find().count()

Tools: Xpath helper chrome

Thinking

Objective

Latest News <- Scraper <- Seed URI

**How to turn idea of pipeline to product? **

**Def. **

URI: Find a easy way get latest news URI by NewsAPI <- Old driver

News: Latest without duplicate both URI and content <- Common sense

Get seed URI:

BFS get seed, DFS get seed, API, since news real time, so we use NewsAPI