-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
1544 lines (1220 loc) · 153 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!DOCTYPE HTML>
<html>
<head>
<meta charset="utf-8">
<title>littleji</title>
<meta name="author" content="littleji">
<meta name="description" content="littleji's blog">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta property="og:site_name" content="littleji"/>
<meta property="og:image" content="undefined"/>
<link href="/favicon.png" rel="icon">
<link rel="alternate" href="/atom.xml" title="littleji" type="application/atom+xml">
<link rel="stylesheet" href="/css/style.css" media="screen" type="text/css">
<!--[if lt IE 9]><script src="//html5shiv.googlecode.com/svn/trunk/html5.js"></script><![endif]-->
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-77530561-1', 'auto');
ga('send', 'pageview');
</script>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-127734909-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-127734909-1');
</script>
<meta name="generator" content="Hexo 5.4.0"></head>
<body>
<header id="header" class="inner"><div class="alignleft">
<link href="//cdn.bootcss.com/font-awesome/4.7.0/css/font-awesome.css" rel="stylesheet">
<h1><a href="/">littleji</a></h1>
<h2><a href="/">a blog</a></h2>
<h2><a href="https://github.com/littleji" title="GithubID:littleji" target="_blank"><i class="fa fa-3x fa-github"></i></a> </h2>
</div>
<nav id="main-nav" class="alignright">
<ul>
<li><a href="/">Home</a></li>
<li><a href="/archives">Archives</a></li>
<li><a href="/about">About</a></li>
<li><a href="/booklist">Booklist</a></li>
</ul>
<div class="clearfix"></div>
</nav>
<div class="clearfix"></div>
</header>
<div id="content" class="inner">
<div id="main-col" class="alignleft"><div id="wrapper">
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-2"><a class="toc-link" href="#NLP%E6%B5%81%E6%B0%B4%E7%BA%BF%E6%80%BB%E8%A7%88"><span class="toc-number">1.</span> <span class="toc-text">NLP流水线总览</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%95%B0%E6%8D%AE%E6%9C%AC%E8%BA%AB%E5%A4%84%E7%90%86"><span class="toc-number">2.</span> <span class="toc-text">数据本身处理</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%95%B0%E6%8D%AE%E5%A2%9E%E5%BC%BA"><span class="toc-number">3.</span> <span class="toc-text">数据增强</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%AE%9E%E4%BD%93%E8%AF%86%E5%88%AB%E7%9A%84%E5%88%86%E7%B1%BB"><span class="toc-number">4.</span> <span class="toc-text">实体识别的分类</span></a></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number"></span> <span class="toc-text">参考</span></a>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2019-04-26T16:00:00.000Z"><a href="/2019/04/27/20190427SomethingBeforeTheNlpPipline/">2019-04-27</a></time>
<h1 class="title"><a href="/2019/04/27/20190427SomethingBeforeTheNlpPipline/">实施NLP流水线之前干点什么</a></h1>
</header>
<div class="entry">
<h2 id="NLP流水线总览"><a href="#NLP流水线总览" class="headerlink" title="NLP流水线总览"></a>NLP流水线总览</h2><p>NLP处理套路无非以下该图中描述</p>
<p><img src="https://stanfordnlp.github.io/CoreNLP/images/AnnotationPipeline.png" alt></p>
<p>该文的重点则对所有该流水线之前的任务进行补充</p>
<h2 id="数据本身处理"><a href="#数据本身处理" class="headerlink" title="数据本身处理"></a>数据本身处理</h2><ul>
<li>简繁体转换或其他同义转换(中文)</li>
<li>全角半角转换</li>
<li>不在dictionary(正则与替换对照词典)以内的单词就用UNK取代</li>
<li>可以在句首加上<bos>,在句末加上<eos></eos></bos></li>
<li>url,at,表情符号等统一替换</li>
<li>稀有词替换为 <unk>(词频小于某一个阈值)</unk></li>
<li>编码转换</li>
<li>小写转换</li>
<li>去除标点符号(根据具体的任务也可替换)</li>
<li>去除停用词</li>
<li>去除频现词</li>
<li>去除稀疏词</li>
<li>略缩词替换</li>
<li>错词纠正(将词替换为词典中最近的词或者<unk>)</unk></li>
<li>单位替换(将文本中的单位替换为统一格式如:将4kgs、4kg统一替换为4 kg,将4k替换为4000,将100或100 100或100100或100替换为100 dollar)</li>
<li>词形还原(lemmatization)</li>
<li>其他语言进行翻译(比如对于中文中的英文单词归一化为<_e_>)<!--_e_--></li>
<li>数字归一化(比如将小于10的为<如果在之后的实体识别中需要对应的原始数字则跳过该步骤.1:NUM> 大于10<2:num>)<!--2:num--></li>
</ul>
<h2 id="数据增强"><a href="#数据增强" class="headerlink" title="数据增强"></a>数据增强</h2><ul>
<li>长句截断</li>
<li>dropout</li>
<li>shuffle<br><img src="https://pic3.zhimg.com/80/v2-d3aaee7f330d475a0643abd5199a1f16_hd.png" alt></li>
<li>文档裁减(这样我将获得更多的数据。开始的时候我尝试从文档中抽取几个句子并创建10个新文档。这些新创建的文档句子间没有逻辑关系,所以用它们训练得到的分类器性能很差。第二次,我尝试将每篇文章分成若干段,每段由文章中五个连续的句子组成。这个方法就运行得非常好,让分类器的性能提升很大)</li>
<li>文本对齐</li>
<li>同义词替换</li>
<li>回译</li>
<li>迁移学习</li>
<li>GAN</li>
<li>BERT</li>
</ul>
<h2 id="实体识别的分类"><a href="#实体识别的分类" class="headerlink" title="实体识别的分类"></a>实体识别的分类</h2><p>PERSON People, including fictional.<br>NORP Nationalities or religious or political groups.<br>FAC Buildings, airports, highways, bridges, etc.<br>ORG Companies, agencies, institutions, etc.<br>GPE Countries, cities, states.<br>LOC Non-GPE locations, mountain ranges, bodies of water.<br>PRODUCT Objects, vehicles, foods, etc. (Not services.)<br>EVENT Named hurricanes, battles, wars, sports events, etc.<br>WORK_OF_ART Titles of books, songs, etc.<br>LAW Named documents made into laws.<br>LANGUAGE Any named language.<br>DATE Absolute or relative dates or periods.<br>TIME Times smaller than a day.<br>PERCENT Percentage, including ”%“.<br>MONEY Monetary values, including unit.<br>QUANTITY Measurements, as of weight or distance.<br>ORDINAL “first”, “second”, etc.<br>CARDINAL Numerals that do not fall under another type. </p>
<h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><p><a href="https://stanfordnlp.github.io/CoreNLP/pipelines.html" target="_blank" rel="noopener">Introdecton to pipelines</a><br><a href="https://juejin.im/entry/5aa34b10f265da2381553b87" target="_blank" rel="noopener">文本数据处理的终极指南-NLP入门</a><br><a href="https://blog.csdn.net/wmq104/article/details/82931352" target="_blank" rel="noopener">使用re正则化进行文本清理</a><br><a href="https://www.zhihu.com/question/268849350" target="_blank" rel="noopener">自然语言处理时,通常的文本清理流程是什么?</a><br><a href="https://www.zhihu.com/question/295519283" target="_blank" rel="noopener">中文自然语言处理时,英文单词和数字怎么处理?</a><br><a href="https://www.jiqizhixin.com/articles/2018-11-19-20" target="_blank" rel="noopener">几千条文本库也能做机器学习!NLP小数据集训练指南</a><br><a href="https://zhuanlan.zhihu.com/p/28923961" target="_blank" rel="noopener">知乎“看山杯” 夺冠记</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2019/04/27/20190427SomethingBeforeTheNlpPipline/">https://blog.littleji.com/2019/04/27/20190427SomethingBeforeTheNlpPipline/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-1"><a class="toc-link" href="#GitLab-%E5%85%A5%E9%97%A8"><span class="toc-number">1.</span> <span class="toc-text">GitLab 入门</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%88%91%E4%B8%8D%E6%83%B3%E5%86%8D%E5%B0%B4%E5%B0%AC%E4%B8%8B%E5%8E%BB%E4%BA%86"><span class="toc-number">1.1.</span> <span class="toc-text">我不想再尴尬下去了</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%B8%8A%E9%9D%A2%E5%88%B0%E5%BA%95%E6%98%AF%E4%BB%80%E4%B9%88%E9%97%AE%E9%A2%98%EF%BC%9F"><span class="toc-number">1.2.</span> <span class="toc-text">上面到底是什么问题?</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%BB%8E%E6%B8%B8%E5%87%BB%E9%98%9F%E8%BD%AC%E5%90%91%E6%AD%A3%E8%A7%84%E5%86%9B"><span class="toc-number">1.3.</span> <span class="toc-text">从游击队转向正规军</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%B8%BA%E4%BB%80%E4%B9%88%E4%BD%BF%E7%94%A8GitLab%E6%90%9E%E8%BF%99%E4%B8%80%E5%A5%97%E4%B8%9C%E8%A5%BF"><span class="toc-number">1.4.</span> <span class="toc-text">为什么使用GitLab搞这一套东西</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#GitLab%E4%B8%AD%E7%9A%84%E7%BB%84%E7%BB%87%E7%BB%93%E6%9E%84"><span class="toc-number">1.5.</span> <span class="toc-text">GitLab中的组织结构</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#team"><span class="toc-number">1.6.</span> <span class="toc-text">team</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E8%BF%AD%E4%BB%A3%E6%97%B6%E9%97%B4%E7%9A%84%E9%95%BF%E5%BA%A6"><span class="toc-number">1.7.</span> <span class="toc-text">迭代时间的长度</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%BB%8E%E7%82%B9%E5%AD%90%E5%88%B0%E4%BA%A7%E5%93%81"><span class="toc-number">1.8.</span> <span class="toc-text">从点子到产品</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%B8%BA%E4%BB%80%E4%B9%88%E8%AF%B4%E8%A6%81%E4%BD%BF%E7%94%A8%E4%B8%80%E4%B8%AAissue%E5%BC%80%E5%A7%8B%E4%B8%80%E5%88%87"><span class="toc-number">1.9.</span> <span class="toc-text">为什么说要使用一个issue开始一切</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#issue%E6%9C%AC%E8%BA%AB%E9%9C%80%E8%A6%81%E5%8C%85%E5%90%AB%E5%93%AA%E4%BA%9B%E4%B8%9C%E8%A5%BF"><span class="toc-number">1.9.1.</span> <span class="toc-text">issue本身需要包含哪些东西</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#issue%E7%9A%84type"><span class="toc-number">1.9.2.</span> <span class="toc-text">issue的type</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E5%AF%B9issues%E8%BF%9B%E8%A1%8Cplan"><span class="toc-number">1.9.3.</span> <span class="toc-text">对issues进行plan</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E4%BC%98%E5%85%88%E7%BA%A7"><span class="toc-number">1.9.4.</span> <span class="toc-text">优先级</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E4%B8%A5%E9%87%8D%E6%80%A7"><span class="toc-number">1.9.5.</span> <span class="toc-text">严重性</span></a></li></ol></li><li class="toc-item toc-level-2"><a class="toc-link" href="#issue-board%EF%BC%8C%E5%B7%A5%E4%BD%9C%E7%9C%8B%E6%9D%BF"><span class="toc-number">1.10.</span> <span class="toc-text">issue board,工作看板</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#FQA"><span class="toc-number">1.11.</span> <span class="toc-text">FQA</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number">1.12.</span> <span class="toc-text">参考</span></a></li></ol></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2019-03-14T16:00:00.000Z"><a href="/2019/03/15/20190318HeadFirstForGitLab/">2019-03-15</a></time>
<h1 class="title"><a href="/2019/03/15/20190318HeadFirstForGitLab/">Gitlab 入门与 工作流</a></h1>
</header>
<div class="entry">
<h1 id="GitLab-入门"><a href="#GitLab-入门" class="headerlink" title="GitLab 入门"></a>GitLab 入门</h1><h2 id="我不想再尴尬下去了"><a href="#我不想再尴尬下去了" class="headerlink" title="我不想再尴尬下去了"></a>我不想再尴尬下去了</h2><p>当你在照镜子时</p>
<p><img src="http://ww1.sinaimg.cn/large/007yzqYtly1g12df7ykf1j308c06ymy2.jpg" alt></p>
<p>当你觉得在正常地开发用户的需求时</p>
<p><img src="http://ww1.sinaimg.cn/large/007yzqYtly1g12dcp8yrdj30hs0e27am.jpg" alt></p>
<p>当你作为一个萌新准备接手别人的代码时</p>
<p><img src="http://ww1.sinaimg.cn/large/007yzqYtly1g12dk9hvifj312l0qrb29.jpg" alt></p>
<p>当你在准备跟用户进行展示你开发的产品时</p>
<p><img src="https://media.giphy.com/media/t507tv4kA94B2/giphy.gif" alt></p>
<p>当你在咆哮这段恶心的代码是谁写的时候</p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12laay4fpg308c0544qp.gif" alt></p>
<p>当你准备跟甲方好好唠嗑的时候</p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12l7ofivag30dw05qhdu.gif" alt></p>
<h2 id="上面到底是什么问题?"><a href="#上面到底是什么问题?" class="headerlink" title="上面到底是什么问题?"></a>上面到底是什么问题?</h2><ol>
<li>软件开发进度难以预测</li>
<li>用户对产品功能难以满足</li>
<li>软件产品质量无法保证</li>
<li>软件产品难以维护</li>
<li>软件缺少适当的文档资料</li>
</ol>
<p>归根结底是因为我们使用了一种小作坊式/游击队式地开发方式</p>
<h2 id="从游击队转向正规军"><a href="#从游击队转向正规军" class="headerlink" title="从游击队转向正规军"></a>从游击队转向正规军</h2><p>以前在游击队里面人们在无数次的管理混乱导致项目搁浅(钱没了,什么都没做出来)的教训下,逐渐摸索出了一些“办事流程”,这些办事流程在实践中被证明效果还不错,被广泛采用,借以提高项目的成功率,以及<strong>降低掉头发的速率</strong>。</p>
<p>技术思维得有,工程思维也少不了</p>
<p>工程思维的起点是流程。流程的背后是科学,以既定的步骤、阶段性的输入/输出去完成价值创造,通过过程控制确保最终结果让人满意。</p>
<h2 id="为什么使用GitLab搞这一套东西"><a href="#为什么使用GitLab搞这一套东西" class="headerlink" title="为什么使用GitLab搞这一套东西"></a>为什么使用GitLab搞这一套东西</h2><p>GitLab 本身是一个工具,是一个让大家达成共识的工具,帮助各位进行工程化地软件开发与管理,最终目的是<strong>拯救程序员们日益稀少的头发</strong>。</p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12rtguxtxj30ku0kuqhj.jpg" alt></p>
<h2 id="GitLab中的组织结构"><a href="#GitLab中的组织结构" class="headerlink" title="GitLab中的组织结构"></a>GitLab中的组织结构</h2><p>主要使用group subgroup project的方式进行组织,一个典型的人员组织结构如下所示:</p>
<ul>
<li>Organization Group - GitLab<ul>
<li>Category Subgroup - Marketing<ul>
<li>(project) Design</li>
<li>(project) General</li>
</ul>
</li>
<li>Category Subgroup - Software<ul>
<li>(project) GitLab CE</li>
<li>(project) GitLab EE</li>
<li>(project) Omnibus GitLab</li>
<li>(project) GitLab Runner</li>
<li>(project) GitLab Pages daemon</li>
</ul>
</li>
</ul>
</li>
</ul>
<p>group的作用是让团队人员可以一次性授权并访问多个项目</p>
<p>再比如我现在建立的一个示范性的<br><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12rgllc7kj323o175qer.jpg" alt></p>
<h2 id="team"><a href="#team" class="headerlink" title="team"></a>team</h2><p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12s0x6k8oj319v0la0zn.jpg" alt></p>
<ul>
<li>产品/用研user_research/UR</li>
<li>用户体验user_experience/UE</li>
<li>用户界面user_interface/UI</li>
<li>开发develop/DEV</li>
<li>质量quality_assurance/QA</li>
<li>运维operation_maintenance/OP</li>
<li>文档documentation/DOC</li>
<li>开发经理pproject_manager/PM</li>
</ul>
<h2 id="迭代时间的长度"><a href="#迭代时间的长度" class="headerlink" title="迭代时间的长度"></a>迭代时间的长度</h2><p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12s98kdwej30lo0drmyy.jpg" alt></p>
<p>sprint:译作短跑,指一个迭代,一般包括几个功能和bug修复,开发周期为2-4周</p>
<p>milestone:译作里程碑,为了便于大家进行回顾是否偏离的航道或者进度,一般为3个sprint,也就是之后的一个季度</p>
<p>release:一般是一个产品的deadline</p>
<h2 id="从点子到产品"><a href="#从点子到产品" class="headerlink" title="从点子到产品"></a>从点子到产品</h2><p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g12se1elqsj30mr0ap44h.jpg" alt></p>
<ul>
<li>IDEA: 每一个从点子开始的项目,通常来源于一次闲聊。在这个阶段,项目组内的所有人对需求提出自己的想法.</li>
<li>ISSUE: 最有效的讨论一个点子的方法,就是为这个点子建立一个工单讨论。你的团队和你的合作伙伴可以在 工单追踪器issue tracker 中帮助你去提升这个点子</li>
<li>PLAN: 一旦讨论得到一致的同意,就是开始编码的时候了。首先,我们需要优先考虑组织我们的工作流。对于此,我们可以使用 工单看板Issue Board。</li>
<li>CODE: 现在,当一切准备就绪,我们可以开始写代码了。</li>
<li>COMMIT: 当我们为我们的初步成果欢呼的时候,我们就可以在版本控制下,提交代码到功能分支了。</li>
<li>TEST: 通过 GitLab CI,我们可以运行脚本来构建和测试我们的应用。</li>
<li>REVIEW: 一旦脚本成功运行,我们测试和构建成功,我们就可以进行 代码复审code review 以及批准。</li>
<li>STAGING:: 现在是时候将我们的代码部署到演示环境来检查一下,看看是否一切就像我们预估的那样顺畅——或者我们可能仍然需要修改。</li>
<li>PRODUCTION: 当一切都如预期,就是部署到生产环境的时候了!</li>
<li>FEEDBACK: 现在是时候返回去看我们项目中需要提升的部分了。我们使用周期分析 Cycle Analytics来对当前项目中关键的部分进行的反馈。</li>
</ul>
<h2 id="为什么说要使用一个issue开始一切"><a href="#为什么说要使用一个issue开始一切" class="headerlink" title="为什么说要使用一个issue开始一切"></a>为什么说要使用一个issue开始一切</h2><ul>
<li>在issue中,通过规范化的模板,项目中的任何人可以知道这个需求或者bug的任何详细信息和讨论以及历史</li>
<li>通过issue中的文档信息,开发者与项目经理可以进行实时交流某一个feature的最新设计思路,这个可以作为一份实时的共享编辑文档</li>
<li>使用issue的label功能可以实现工作流\时间估计等较复杂的功能</li>
</ul>
<h3 id="issue本身需要包含哪些东西"><a href="#issue本身需要包含哪些东西" class="headerlink" title="issue本身需要包含哪些东西"></a>issue本身需要包含哪些东西</h3><p>一个 feature类型 issue基本上长成这个样子</p>
<figure class="highlight markdown"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br><span class="line">39</span><br><span class="line">40</span><br><span class="line">41</span><br><span class="line">42</span><br><span class="line">43</span><br><span class="line">44</span><br><span class="line">45</span><br><span class="line">46</span><br><span class="line">47</span><br><span class="line">48</span><br><span class="line">49</span><br><span class="line">50</span><br><span class="line">51</span><br><span class="line">52</span><br></pre></td><td class="code"><pre><span class="line"><span class="section">## 概述</span></span><br><span class="line"><!--- 说明该变更主要完成了什么事情 --></span><br><span class="line"></span><br><span class="line"><span class="section">## feature ID</span></span><br><span class="line"><!--- 对应在某个项目中的需求文档中的需求ID比如 AD-100 --></span><br><span class="line"></span><br><span class="line"><span class="section">## 其他相关feature和bug链接</span></span><br><span class="line"><!--- 为需要变更的模块列表 --></span><br><span class="line"><!--- 别忘了前面的# --></span><br><span class="line"><span class="bullet">* </span>#XX1</span><br><span class="line"><span class="bullet">* </span>#XX2</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="section">## 估计花费的时间和实际花费的时间</span></span><br><span class="line"></span><br><span class="line">/estimate 1mo 1w 1d 1h 1m</span><br><span class="line"></span><br><span class="line">/spend 1mo 1w 1d 1h 1m</span><br><span class="line"> </span><br><span class="line"><span class="section">## 更新的具体类型</span></span><br><span class="line"><!--- 请在指定的括号中填入'x'--></span><br><span class="line"><span class="bullet">* </span>[ ] 非影响各个接口输出的功能更新(单纯的增加功能,不变更已有的功能)</span><br><span class="line"><span class="bullet">* </span>[ ] 影响接口的功能更新(已有的部分功能会失效)</span><br><span class="line"></span><br><span class="line"><span class="section">## 检查项</span></span><br><span class="line"><span class="bullet">* </span>[ ] ~"team-DEV" 代码符合本项目的代码风格</span><br><span class="line"><span class="bullet">* </span>[ ] ~"team-DEV" 单元测试通过</span><br><span class="line"><span class="bullet">* </span>[ ] ~"team-DEV" 针对于该功能的概要、详细等设计文档进行更新</span><br><span class="line"><span class="bullet">* </span>[ ] ~"team-QA" 测试用例通过</span><br><span class="line"><span class="bullet">* </span>[ ] ~"team-QA" 集成测试通过</span><br><span class="line"><span class="bullet">* </span>[ ] ~"team-QA" 针对于该功能的相关测试文档进行更新</span><br><span class="line"></span><br><span class="line"></span><br><span class="line"><span class="section">## 修改的功能</span></span><br><span class="line"><!--- 说明如何进行修改的 --></span><br><span class="line"><!--- 功能列表可以作为子任务列表 --></span><br><span class="line"><!--- 如果完成了指定的功能,请在指定的括号中填入'x'--></span><br><span class="line"><span class="bullet">* </span>[ ] 功能1</span><br><span class="line"><span class="bullet">* </span>[ ] 功能2</span><br><span class="line"></span><br><span class="line"><span class="section">## 添加的功能</span></span><br><span class="line"><!--- 说明如何进行添加的 --></span><br><span class="line"><span class="bullet">* </span>[ ] 功能1</span><br><span class="line"><span class="bullet">* </span>[ ] 功能2</span><br><span class="line"></span><br><span class="line"><span class="section">## 删除的功能</span></span><br><span class="line"><!--- 说明如何进行添加的 --></span><br><span class="line"><span class="bullet">* </span>[ ] 功能1</span><br><span class="line"><span class="bullet">* </span>[ ] 功能2</span><br><span class="line"></span><br><span class="line"><span class="section">## 如何对该功能进行验证</span></span><br><span class="line"><!--- 包括测试环境,如何运行测试 --></span><br></pre></td></tr></tbody></table></figure>
<h3 id="issue的type"><a href="#issue的type" class="headerlink" title="issue的type"></a>issue的type</h3><ul>
<li>feature:开发功能点</li>
<li>bug: 软件缺陷</li>
<li>tech debt:不改变功能的情况下增强产品性能</li>
<li>ux debt:用户体验需要提升</li>
</ul>
<h3 id="对issues进行plan"><a href="#对issues进行plan" class="headerlink" title="对issues进行plan"></a>对issues进行plan</h3><p>plan说是计划,实则是沟通</p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g136hkdncdj30cg04saa1.jpg" alt></p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g1360hafquj30yt04s74k.jpg" alt></p>
<ul>
<li>process-needs confirmed</li>
<li>process-reject(拒绝)</li>
<li>process-doing</li>
</ul>
<h3 id="优先级"><a href="#优先级" class="headerlink" title="优先级"></a>优先级</h3><p>优先级的意义是便于在同时处理大量需求的时候有一个计划。</p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g135z5zvv8j30z50613yy.jpg" alt></p>
<ul>
<li>P1 特高级 目前的sprint搞定</li>
<li>P2 高级 下个sprint搞定</li>
<li>P3 中级 目前的里程碑再搞定</li>
<li>P4 低级 下个里程碑再搞定</li>
</ul>
<h3 id="严重性"><a href="#严重性" class="headerlink" title="严重性"></a>严重性</h3><ul>
<li>S1 完蛋 系统崩溃/数据丢失/数据泄漏/coredump 没有数据可用性</li>
<li>S2 严重 无法在前台查询数据,但是数据库完整 部分数据可用性丢失</li>
<li>S3 中级 前台显示错位 数据无丢失可用性受损</li>
<li>S4 初级 前台颜色显示错误 数据无丢失 可用性也无受损</li>
</ul>
<h2 id="issue-board,工作看板"><a href="#issue-board,工作看板" class="headerlink" title="issue board,工作看板"></a>issue board,工作看板</h2><p>我们做了这么多准备,都是为了这个,让整个项目组成员清晰的了解到,目前的计划与进度,加快信息的传递</p>
<p><img src="https://ws1.sinaimg.cn/large/007yzqYtly1g136mkcyttj30t10glk0v.jpg" alt></p>
<p>正儿八经的issue应该有三个标签、责任人、截止时间:</p>
<ul>
<li>process-*</li>
<li>优先级</li>
<li>团队,以团队为基础建立起来的看板可以通过拖拽自动加上团队标签</li>
</ul>
<h2 id="FQA"><a href="#FQA" class="headerlink" title="FQA"></a>FQA</h2><ul>
<li><p>问:能不能想svn那样单独的文件夹设置权限?</p>
<ul>
<li>答:不可以,同一个repo的每个文件夹的权限是一致,如果想根据人来设置请使用两个repo</li>
</ul>
</li>
<li><p>问:分支权限可以做什么?</p>
<ul>
<li>答:只允许部分人进行merge、只允许部分人进行push。</li>
</ul>
</li>
</ul>
<h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><p><a href>程序员的那些事儿</a></p>
<p><a href="https://zh.wikipedia.org/zh/%E5%A5%B6%E5%A4%B4%E4%B9%90" target="_blank" rel="noopener">奶头乐维基百科</a></p>
<p><a href="https://www.cnhzz.com/gitlab-%E5%B7%A5%E4%BD%9C%E6%B5%81%E6%A6%82%E8%A7%88/" target="_blank" rel="noopener">GitLab工作流概览</a></p>
<p><a href="https://about.gitlab.com/2016/10/25/gitlab-workflow-an-overview/#gitlab-workflow-use-case-scenario" target="_blank" rel="noopener">GitLab Workflow: An Overview</a></p>
<p><a href="https://about.gitlab.com/2016/03/03/start-with-an-issue/" target="_blank" rel="noopener">Always start with an issue </a></p>
<p><a href="https://pm.stackexchange.com/questions/22510/scrum-sprint-vs-milestone-vs-release" target="_blank" rel="noopener">scrum-sprint-vs-milestone-vs-release</a></p>
<p><a href="https://www.cnblogs.com/baiyanhuang/archive/2010/11/29/1890728.html" target="_blank" rel="noopener">我们是怎么scrum</a></p>
<p><a href="https://medium.com/@shaky.girl/%E5%88%A5%E5%86%8D%E5%82%BB%E5%82%BB%E5%88%86%E4%B8%8D%E6%B8%85-%E7%A9%B6%E7%AB%9Fpm-ux-ui-web-designer-front-end-developer%E7%9A%84%E5%B0%88%E6%A5%AD%E5%B7%AE%E5%9C%A8%E5%93%AA-b8206dd49d32" target="_blank" rel="noopener">別再傻傻分不清,究竟PM, UX, UI, Web designer, Front-End Developer的專業差在哪?</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2019/03/15/20190318HeadFirstForGitLab/">https://blog.littleji.com/2019/03/15/20190318HeadFirstForGitLab/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%A6%82%E8%BF%B0"><span class="toc-number">1.</span> <span class="toc-text">概述</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#1-%E8%AF%8D%E5%85%B8%E5%88%86%E8%AF%8D%E7%AE%97%E6%B3%95"><span class="toc-number">2.</span> <span class="toc-text">1 词典分词算法</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#1-1-%E5%89%8D%E5%90%91%E6%9C%80%E5%A4%A7%E5%8C%B9%E9%85%8D%E7%AE%97%E6%B3%95%E3%80%81%E5%90%8E%E5%90%91%E6%9C%80%E5%A4%A7%E5%8C%B9%E9%85%8D%E3%80%81%E5%8F%8C%E5%90%91%E5%8C%B9%E9%85%8D%E3%80%81%E6%9C%80%E5%B0%8F%E5%88%87%E5%88%86"><span class="toc-number">2.1.</span> <span class="toc-text">1.1 前向最大匹配算法、后向最大匹配、双向匹配、最小切分</span></a></li></ol></li><li class="toc-item toc-level-2"><a class="toc-link" href="#2-%E5%9F%BA%E4%BA%8E%E7%BB%9F%E8%AE%A1%E7%9A%84%E5%88%86%E8%AF%8D%E7%AE%97%E6%B3%95"><span class="toc-number">3.</span> <span class="toc-text">2 基于统计的分词算法</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#2-1-NGram"><span class="toc-number">3.1.</span> <span class="toc-text">2.1 NGram</span></a></li></ol></li><li class="toc-item toc-level-2"><a class="toc-link" href="#3-%E8%AF%8D%E8%A2%8B%E6%A8%A1%E5%9E%8B%E4%B8%8E%E6%96%87%E6%9C%AC%E5%90%91%E9%87%8F%E5%8C%96"><span class="toc-number">4.</span> <span class="toc-text">3 词袋模型与文本向量化</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#3-1-%E8%AF%8D%E8%A2%8B%E6%A8%A1%E5%9E%8B"><span class="toc-number">4.1.</span> <span class="toc-text">3.1 词袋模型</span></a></li></ol></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number">5.</span> <span class="toc-text">参考</span></a></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2019-03-04T16:00:00.000Z"><a href="/2019/03/05/20190305DataWhaleNLPTask2/">2019-03-05</a></time>
<h1 class="title"><a href="/2019/03/05/20190305DataWhaleNLPTask2/">常见的分词方法与文本向量化</a></h1>
</header>
<div class="entry">
<p>[toc]</p>
<h2 id="概述"><a href="#概述" class="headerlink" title="概述"></a>概述</h2><p>文本分词作为自然语言处理(NLP)的基本任务之一,是很多上层任务(命名实体识别、情感分析、自动文摘等)的基础,那么从事相关行业的人员自然需要对其中涉及到的一些概念做进一步的了解,本文会从目前主流的分词算法、分词难点等展开进行简要的说明。</p>
<p>分词理论主要包含三个部分:分词算法、中文分词消歧、未登录词识别,而分词算法又包括词典分词、理解分词、统计分词、组合分词等几大类。下面的说明重点就是基于分词算法。</p>
<h2 id="1-词典分词算法"><a href="#1-词典分词算法" class="headerlink" title="1 词典分词算法"></a>1 词典分词算法</h2><p>基于词典的分词核心要确定两个内容:分词的算法与词典的结构,其中主要使用的集中基于词典的方法有正向最大匹配、逆向最大匹配、双向最大匹配、最少切分等。</p>
<h3 id="1-1-前向最大匹配算法、后向最大匹配、双向匹配、最小切分"><a href="#1-1-前向最大匹配算法、后向最大匹配、双向匹配、最小切分" class="headerlink" title="1.1 前向最大匹配算法、后向最大匹配、双向匹配、最小切分"></a>1.1 前向最大匹配算法、后向最大匹配、双向匹配、最小切分</h3><p>前向最大切词,是以可变滑动窗口对文本进行顺序取词,若改词在词典中存在,则进行一次切分;否则,缩小窗口大小,继续取词与词典库进行搜索,知道窗口词长为1。后向切词原理相似,只不过是从后面开始进行窗口滑动。</p>
<figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br></pre></td><td class="code"><pre><span class="line">ustring = string_need_to_be_segmented</span><br><span class="line">while :</span><br><span class="line">if sentence_len < word_max_len:</span><br><span class="line"> word_max_len = sentence_len</span><br><span class="line">for i in range(word_max_len, 0, -1):</span><br><span class="line"> if ustring[:i] in word_set or i == 1:</span><br><span class="line"> wordList.append(ustring[:i])</span><br><span class="line"> ustring = ustring[i:]</span><br><span class="line"> break</span><br><span class="line"> else:</span><br><span class="line"> i -= 1</span><br><span class="line">return wordList</span><br></pre></td></tr></tbody></table></figure>
<p>后向匹配与前向类似只不过方向从后往前进行匹配,双向匹配利用了前后两方面的信息,并从中选择词数最少的作为分词依据。</p>
<p>最小切分方法使用这样一条原则即:”每一句中切出的词数最小”</p>
<p>基于这种匹配的方法有这样的优点:</p>
<ol>
<li>程序简单易行,开发周期短;</li>
<li>没有任何复杂计算,分词速度快;</li>
</ol>
<p>缺点有:</p>
<ol>
<li>不能处理歧义;</li>
<li>不能识别新词;</li>
<li>分词准确率不高,不能满足实际的需要;</li>
</ol>
<h2 id="2-基于统计的分词算法"><a href="#2-基于统计的分词算法" class="headerlink" title="2 基于统计的分词算法"></a>2 基于统计的分词算法</h2><h3 id="2-1-NGram"><a href="#2-1-NGram" class="headerlink" title="2.1 NGram"></a>2.1 NGram</h3><p>基于N-gram语言模型的方法是一个典型的生成式模型,早期很多统计分词均是以它为基本模型,然后配合其他未登录词识别模块进行扩展。其基本思想是:首先根据词典对句子进行简单匹配,找出所有可能的词典词,然后将它们和所有单个字作为结点,构造n元切分词图,图中的结点表示可能的此候选,边表示路径,边上的n元概率表示代价,最后利用相关搜索算法从中找到代价最小的路径作为最后的分词结果。</p>
<p>假设随机变量S为一个汉字序列,W是S上所有可能切分路径,对于分词,实际上就是求解使条件概率P(W|S)最大的切分路径W,即:<br>W=argmaxWP(W|s)</p>
<p>根据贝叶斯公式:<br>W=argmaxWP(W)P(S|W)P(S)</p>
<p>由于P(S)为归一化因子,P(S|W)恒为1,因此只需要求解P(W)。P(W)使用N-gram语言模型建模,定义如下(以Bi-gram为例):<br>P(W)=P(w1w2…wT)=P(w1)P(w2|w1)…P(wT|wT−1)</p>
<p>这样,各种切分路径的好坏程度(条件概率P(W|S))可以求解。简单的,可以根据DAG枚举全路径,暴力求解最优路径;也可以使用动态规划的方法的求解,jieba分词中不带HMM新词发现的分词,就是DAG+Uni-gram语言模型+后向动态规划的方式进行求解的</p>
<h2 id="3-词袋模型与文本向量化"><a href="#3-词袋模型与文本向量化" class="headerlink" title="3 词袋模型与文本向量化"></a>3 词袋模型与文本向量化</h2><h3 id="3-1-词袋模型"><a href="#3-1-词袋模型" class="headerlink" title="3.1 词袋模型"></a>3.1 词袋模型</h3><p>Bag-of-words model (BoW model) 最早出现在自然语言处理(Natural Language Processing)和信息检索(Information Retrieval)领域.。该模型忽略掉文本的语法和语序等要素,将其仅仅看作是若干个词汇的集合,文档中每个单词的出现都是独立的。BoW使用一组无序的单词(words)来表达一段文字或一个文档.。近年来,BoW模型被广泛应用于计算机视觉中。</p>
<p>基于文本的BoW模型的一个简单例子如下:</p>
<p>首先给出两个简单的文本文档如下:</p>
<pre><code> John likes to watch movies. Mary likes too.
John also likes to watch football games.
</code></pre><p>基于上述两个文档中出现的单词,构建如下一个词典 (dictionary):</p>
<pre><code> {"John": 1, "likes": 2,"to": 3, "watch": 4, "movies": 5,"also": 6, "football": 7, "games": 8,"Mary": 9, "too": 10}
</code></pre><p>上面的词典中包含10个单词, 每个单词有唯一的索引, 那么每个文本我们可以使用一个10维的向量来表示。如下:<br> [1, 2, 1, 1, 1, 0, 0, 0, 1, 1]<br> [1, 1,1, 1, 0, 1, 1, 1, 0, 0]</p>
<p>该向量与原来文本中单词出现的顺序没有关系,而是词典中每个单词在文本中出现的频率。</p>
<p>一个bow与ngram的python例子:<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br></pre></td><td class="code"><pre><span class="line">#!/usr/bin/env python3</span><br><span class="line"></span><br><span class="line">from __future__ import print_function</span><br><span class="line"></span><br><span class="line">from collections import deque</span><br><span class="line">from itertools import islice</span><br><span class="line">from itertools import tee</span><br><span class="line"></span><br><span class="line">try:</span><br><span class="line"> from itertools import izip as zip</span><br><span class="line">except ImportError:</span><br><span class="line"> pass</span><br><span class="line"></span><br><span class="line"># </span><br><span class="line">def ngram(iterable, n=2):</span><br><span class="line"> """s -> (s0,s1), (s1,s2), (s2, s3), ..."""</span><br><span class="line"> assert n > 0, 'Cannot create negative n-grams.'</span><br><span class="line"> l = tee(iterable, n)</span><br><span class="line"> for i, s in enumerate(l):</span><br><span class="line"> for _ in range(i):</span><br><span class="line"> next(s, None)</span><br><span class="line"> return zip(*l)</span><br><span class="line"></span><br><span class="line"></span><br><span class="line">def ngram_generator(words, n=2):</span><br><span class="line"> "s -> (s0,s1), (s1,s2), (s2, s3), ..."</span><br><span class="line"> assert n > 0, "n is not in (0,inf)"</span><br><span class="line"> for i in range(len(words)-n+1):</span><br><span class="line"> yield tuple(words[i:i+n])</span><br><span class="line"></span><br><span class="line">def cbow(iterable, window=1):</span><br><span class="line"> "s -> ((s0,s2), s1), ((s1,s3), s2), ((s2, s4), s3), ..."</span><br><span class="line"> context = [consume(s, i) for i, s in enumerate(tee(iterable, 2*window+1))]</span><br><span class="line"> target = context[window]</span><br><span class="line"> del context[window]</span><br><span class="line"> return zip(zip(*context), target)</span><br></pre></td></tr></tbody></table></figure><p></p>
<h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><p><a href="http://www.lis.ac.cn/CN/article/downloadArticleFile.do?attachType=PDF&id=11361" target="_blank" rel="noopener">奉国和, 郑伟. 国内中文自动分词技术研究综述. 图书情报工作, 2011, 54(02): 41-45.</a><br><a href>自然语言处理综述</a><br><a href="https://github.com/fxsjy/jieba" target="_blank" rel="noopener">结巴分词</a><br><a href="http://www.cnblogs.com/xlturing/p/8467021.html" target="_blank" rel="noopener">谈分词算法(2)基于词典的分词方法</a><br><a href="https://zhuanlan.zhihu.com/p/33261835" target="_blank" rel="noopener">中文分词算法简介</a><br><a href="https://blog.csdn.net/u010213393/article/details/40987945" target="_blank" rel="noopener">BoW(词袋)模型详细介绍</a><br><a href="https://ilewseu.github.io/2018/06/16/%E4%B8%AD%E6%96%87%E5%88%86%E8%AF%8D/" target="_blank" rel="noopener">自然语言处理基础-中文分词</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2019/03/05/20190305DataWhaleNLPTask2/">https://blog.littleji.com/2019/03/05/20190305DataWhaleNLPTask2/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-2"><a class="toc-link" href="#windows%E4%B8%8B%E8%BF%9B%E8%A1%8CTensorFlow-GPU%E7%89%88%E6%9C%AC%E5%AE%89%E8%A3%85-%E4%B8%8D%E9%80%82%E7%94%A8%E4%BA%8ECPU%E7%89%88%E6%9C%AC"><span class="toc-number">1.</span> <span class="toc-text">windows下进行TensorFlow GPU版本安装(不适用于CPU版本)</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%BD%BF%E7%94%A8-imdb%E6%95%B0%E6%8D%AE%E9%9B%86%E8%BF%9B%E8%A1%8C%E6%96%87%E6%9C%AC%E5%88%86%E7%B1%BB"><span class="toc-number">2.</span> <span class="toc-text">使用 imdb数据集进行文本分类</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#%E6%9E%84%E5%BB%BA%E6%A8%A1%E5%9E%8B"><span class="toc-number">2.1.</span> <span class="toc-text">构建模型</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E5%88%9B%E5%BB%BA%E4%BC%98%E5%8C%96%E5%99%A8"><span class="toc-number">2.2.</span> <span class="toc-text">创建优化器</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E5%BC%80%E5%A7%8B%E8%BF%9B%E8%A1%8C%E8%AE%AD%E7%BB%83"><span class="toc-number">2.3.</span> <span class="toc-text">开始进行训练</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E8%AF%84%E4%BC%B0%E5%AF%B9%E5%BA%94%E7%9A%84%E7%BB%93%E6%9E%9C"><span class="toc-number">2.4.</span> <span class="toc-text">评估对应的结果</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#%E4%BD%BF%E7%94%A8%E5%88%B0%E7%9A%84%E7%9B%B8%E5%85%B3%E7%9A%84%E6%9C%AF%E8%AF%AD"><span class="toc-number">2.5.</span> <span class="toc-text">使用到的相关的术语</span></a></li></ol></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number">3.</span> <span class="toc-text">参考</span></a></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2019-03-02T16:00:00.000Z"><a href="/2019/03/03/20190303DataWhaleNLPTask1/">2019-03-03</a></time>
<h1 class="title"><a href="/2019/03/03/20190303DataWhaleNLPTask1/">windows下TensorFlow安装与imdb文本分类</a></h1>
</header>
<div class="entry">
<h2 id="windows下进行TensorFlow-GPU版本安装-不适用于CPU版本"><a href="#windows下进行TensorFlow-GPU版本安装-不适用于CPU版本" class="headerlink" title="windows下进行TensorFlow GPU版本安装(不适用于CPU版本)"></a>windows下进行TensorFlow GPU版本安装(不适用于CPU版本)</h2><p>基于各种原因,本次的实验环境在windows下进行。<br>其中本机的配置如下:<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">硬件:</span><br><span class="line"> - CPU:Intel Xeon E3-1505M</span><br><span class="line"> - RAM:64GB</span><br><span class="line"> - GPU:NVIDIA Quadro M2000M</span><br><span class="line">软件:</span><br><span class="line"> - OS:Windows 10 专业工作站 1809 x64</span><br></pre></td></tr></tbody></table></figure><p></p>
<p>正确的步骤如下所示:<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br></pre></td><td class="code"><pre><span class="line">1. 请务必保证安装的 `Python` 版本为 `3.6.X`, `TensorFlow` 对于 `3.7.X`还未支持完全</span><br><span class="line">2. 安装 Cuda 10.0, 目前不要使用Cuda 10.1的版本,因为还未支持</span><br><span class="line">3. 安装 cudnn for 10.0 </span><br><span class="line">4. 安装 tensorflow-gpu</span><br><span class="line">5. 添加对应的环境变量</span><br><span class="line">SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\bin;%PATH%</span><br><span class="line">SET PATH=C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v10.0\extras\CUPTI\libx64;%PATH%</span><br><span class="line">SET PATH=C:\tools\cuda\bin;%PATH%</span><br><span class="line">6. 验证(在输入下面的代码并运行后,会出现3min的等待,目前原因不明)</span><br><span class="line">python -c "import tensorflow as tf; tf.enable_eager_execution(); print(tf.reduce_sum(tf.random_normal([1000, 1000])))"</span><br><span class="line">7. 结果</span><br><span class="line">2019-03-03 17:28:53.391788: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2</span><br><span class="line">2019-03-03 17:28:53.569006: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:</span><br><span class="line">name: Quadro M2000M major: 5 minor: 0 memoryClockRate(GHz): 1.137</span><br><span class="line">pciBusID: 0000:01:00.0</span><br><span class="line">totalMemory: 4.00GiB freeMemory: 3.34GiB</span><br><span class="line">2019-03-03 17:28:53.600653: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0</span><br><span class="line">2019-03-03 17:28:54.903988: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:</span><br><span class="line">2019-03-03 17:28:54.921578: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990] 0</span><br><span class="line">2019-03-03 17:28:54.932974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0: N</span><br><span class="line">2019-03-03 17:28:54.946509: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 3048 MB memory) -> physical GPU (device: 0, name: Quadro M2000M, pci bus id: 0000:01:00.0, compute capability: 5.0)</span><br><span class="line">tf.Tensor(1187.2697, shape=(), dtype=float32)</span><br></pre></td></tr></tbody></table></figure><p></p>
<h2 id="使用-imdb数据集进行文本分类"><a href="#使用-imdb数据集进行文本分类" class="headerlink" title="使用 imdb数据集进行文本分类"></a>使用 imdb数据集进行文本分类</h2><p>首先使用下属的代码下载数据集<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">imdb = keras.datasets.imdb</span><br><span class="line">(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)</span><br></pre></td></tr></tbody></table></figure><p></p>
<p>下载完之后我们可以看到该数据集包含了25000条训练数据和25000条测试数据<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">print("Training entries: {}, labels: {}".format(len(train_data), len(test_data)))</span><br></pre></td></tr></tbody></table></figure><p></p>
<p>我们查看某一条具体的imdb影评是什么样的格式<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br></pre></td><td class="code"><pre><span class="line">print(train_data[0])</span><br><span class="line">----</span><br><span class="line">[1, 14, 22, 16, 43, 530, 973, 1622, 1385, 65, 458, 4468, 66, 3941, 4, 173, 36, 256, 5, 25, 100, 43, 838, 112, 50, 670, 2, 9, 35, 480, 284, 5, 150, 4, 172, 112, 167, 2, 336, 385, 39, 4, 172, 4536, 1111, 17, 546, 38, 13, 447, 4, 192, 50, 16, 6, 147, 2025, 19, 14, 22, 4, 1920, 4613, 469, 4, 22, 71, 87, 12, 16, 43, 530, 38, 76, 15, 13, 1247, 4, 22, 17, 515, 17, 12, 16, 626, 18, 2, 5, 62, 386, 12, 8, 316, 8, 106, 5, 4, 2223, 5244, 16, 480, 66, 3785, 33, 4, 130, 12, 16, 38, 619, 5, 25, 124, 51, 36, 135, 48, 25, 1415, 33, 6, 22, 12, 215, 28, 77, 52, 5, 14, 407, 16, 82, 2, 8, 4, 107, 117, 5952, 15, 256, 4, 2, 7, 3766, 5, 723, 36, 71, 43, 530, 476, 26, 400, 317, 46, 7, 4, 2, 1029, 13, 104, 88, 4, 381, 15, 297, 98, 32, 2071, 56, 26, 141, 6, 194, 7486, 18, 4, 226, 22, 21, 134, 476, 26, 480, 5, 144, 30, 5535, 18, 51, 36, 28, 224, 92, 25, 104, 4, 226, 65, 16, 38, 1334, 88, 12, 16, 283, 5, 16, 4472, 113, 103, 32, 15, 16, 5345, 19, 178, 32]</span><br></pre></td></tr></tbody></table></figure><p></p>
<p>说明所有的影评根据词典已经转化为相应的数字,我们尝试将对应的数字转换为真实的词<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br></pre></td><td class="code"><pre><span class="line"># 构建一个词与数字的映射字典</span><br><span class="line">word_index = imdb.get_word_index()</span><br><span class="line"># 创建一个数字与词的映射字典</span><br><span class="line">word_index = {k:(v+3) for k,v in word_index.items()} </span><br><span class="line">word_index["<PAD>"] = 0</span><br><span class="line">word_index["<START>"] = 1</span><br><span class="line">word_index["<UNK>"] = 2 # unknown</span><br><span class="line">word_index["<UNUSED>"] = 3</span><br><span class="line"></span><br><span class="line">index_word = dict([(value, key) for (key, value) in word_index.items()])</span><br><span class="line"></span><br><span class="line">def decode_review(text):</span><br><span class="line"> return ' '.join([index_word.get(i, '?') for i in text])</span><br></pre></td></tr></tbody></table></figure><p></p>
<p>现在我们就可以通过<code>decode_review(train_data[0])</code>来获得原始的影评了<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br></pre></td><td class="code"><pre><span class="line">"<START> this film was just brilliant casting location scenery story direction everyone's really suited the part they played and you could just imagine being there robert <UNK> is an amazing actor and now the same being director <UNK> father came from the same scottish island as myself so i loved the fact there was a real connection with this film the witty remarks throughout the film were great it was just brilliant so much that i bought the film as soon as it was released for <UNK> and would recommend it to everyone to watch and the fly fishing was amazing really cried at the end it was so sad and you know what they say if you cry at a film it must have been good and this definitely was also <UNK> to the two little boy's that played the <UNK> of norman and paul they were just brilliant children are often left out of the <UNK> list i think because the stars that play them all grown up are such a big profile for the whole film but these children are amazing and should be praised for what they have done don't you think the whole story was so lovely because it was true and was someone's life after all that was shared with us all"</span><br></pre></td></tr></tbody></table></figure><p></p>
<p>为了更好的进行下一步的神经网络处理,我们将对应的影评数据进行向量化,向量的长度为256<br></p><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br></pre></td><td class="code"><pre><span class="line">train_data = keras.preprocessing.sequence.pad_sequences(train_data,</span><br><span class="line"> value=word_index["<PAD>"],</span><br><span class="line"> padding='post',</span><br><span class="line"> maxlen=256)</span><br><span class="line"></span><br><span class="line">test_data = keras.preprocessing.sequence.pad_sequences(test_data,</span><br><span class="line"> value=word_index["<PAD>"],</span><br><span class="line"> padding='post',</span><br><span class="line"> maxlen=256)</span><br></pre></td></tr></tbody></table></figure><p></p>
<h3 id="构建模型"><a href="#构建模型" class="headerlink" title="构建模型"></a>构建模型</h3><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"># 初始化10000词的词典</span><br><span class="line">vocab_size = 10000</span><br><span class="line"></span><br><span class="line">model = keras.Sequential()</span><br><span class="line">model.add(keras.layers.Embedding(vocab_size, 16))#每个词的向量长度为16</span><br><span class="line">model.add(keras.layers.GlobalAveragePooling1D())</span><br><span class="line">model.add(keras.layers.Dense(16, activation=tf.nn.relu))</span><br><span class="line">model.add(keras.layers.Dense(1, activation=tf.nn.sigmoid))</span><br><span class="line"></span><br><span class="line">model.summary()</span><br></pre></td></tr></tbody></table></figure>
<h3 id="创建优化器"><a href="#创建优化器" class="headerlink" title="创建优化器"></a>创建优化器</h3><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br></pre></td><td class="code"><pre><span class="line"># 使用adam算法</span><br><span class="line">model.compile(optimizer='adam',</span><br><span class="line"> loss='binary_crossentropy',</span><br><span class="line"> metrics=['acc'])</span><br><span class="line"># 创建一个验证集</span><br><span class="line">x_val = train_data[:10000]</span><br><span class="line">partial_x_train = train_data[10000:]</span><br><span class="line"></span><br><span class="line">y_val = train_labels[:10000]</span><br><span class="line">partial_y_train = train_labels[10000:]</span><br></pre></td></tr></tbody></table></figure>
<h3 id="开始进行训练"><a href="#开始进行训练" class="headerlink" title="开始进行训练"></a>开始进行训练</h3><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br></pre></td><td class="code"><pre><span class="line">history = model.fit(partial_x_train,</span><br><span class="line"> partial_y_train,</span><br><span class="line"> epochs=40,</span><br><span class="line"> batch_size=512,</span><br><span class="line"> validation_data=(x_val, y_val),</span><br><span class="line"> verbose=1)</span><br></pre></td></tr></tbody></table></figure>
<h3 id="评估对应的结果"><a href="#评估对应的结果" class="headerlink" title="评估对应的结果"></a>评估对应的结果</h3><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br></pre></td><td class="code"><pre><span class="line">results = model.evaluate(test_data, test_labels)</span><br><span class="line">print(results)</span><br></pre></td></tr></tbody></table></figure>
<h3 id="使用到的相关的术语"><a href="#使用到的相关的术语" class="headerlink" title="使用到的相关的术语"></a>使用到的相关的术语</h3><ol>
<li>ROC(Receiver Operating Characteristic)的操作方式与之前的P-R图类似,并提出了两个概念,真正利率TPR,假正利率FPR<script type="math/tex; mode=display">
TPR=TP/(TP+FN)</script><script type="math/tex; mode=display">
FPR=FP/(FP+TN)</script></li>
<li>ACC 精确度,即正确与全集之比</li>
<li>召回率,即正确的与集合中所有真正正确的数据集之比</li>
</ol>
<h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><p><a href="https://tensorflow.google.cn/tutorials/keras/basic_text_classification" target="_blank" rel="noopener">使用keras进行文本分类</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2019/03/03/20190303DataWhaleNLPTask1/">https://blog.littleji.com/2019/03/03/20190303DataWhaleNLPTask1/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-2"><a class="toc-link" href="#1-%E9%A1%B9%E7%9B%AE%E4%BA%A4%E6%8E%A5%E7%9A%84%E5%9C%BA%E6%99%AF"><span class="toc-number">1.</span> <span class="toc-text">1 项目交接的场景</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#2-%E6%96%87%E6%A1%A3%E7%9B%B8%E5%85%B3%E6%B3%A8%E6%84%8F%E7%82%B9"><span class="toc-number">2.</span> <span class="toc-text">2 文档相关注意点</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#3-%E9%A1%B9%E7%9B%AE%E6%BA%90%E7%A0%81%E7%9B%B8%E5%85%B3%E6%B3%A8%E6%84%8F%E7%82%B9"><span class="toc-number">3.</span> <span class="toc-text">3 项目源码相关注意点</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#4-%E6%95%B0%E6%8D%AE%E5%BA%93%E7%9B%B8%E5%85%B3%E6%B3%A8%E6%84%8F%E7%82%B9"><span class="toc-number">4.</span> <span class="toc-text">4 数据库相关注意点</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#5-%E7%8E%AF%E5%A2%83%E7%9B%B8%E5%85%B3%E6%B3%A8%E6%84%8F%E7%82%B9"><span class="toc-number">5.</span> <span class="toc-text">5 环境相关注意点</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#5-1-%E6%9C%AC%E5%9C%B0%E5%BC%80%E5%8F%91%E7%8E%AF%E5%A2%83%E9%85%8D%E7%BD%AE"><span class="toc-number">5.1.</span> <span class="toc-text">5.1 本地开发环境配置</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#5-2-%E6%B5%8B%E8%AF%95%E7%8E%AF%E5%A2%83"><span class="toc-number">5.2.</span> <span class="toc-text">5.2 测试环境</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#5-3-%E9%A2%84%E4%B8%8A%E7%BA%BF%E7%8E%AF%E5%A2%83"><span class="toc-number">5.3.</span> <span class="toc-text">5.3 预上线环境</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#5-4-%E6%AD%A3%E5%BC%8F%E7%8E%AF%E5%A2%83"><span class="toc-number">5.4.</span> <span class="toc-text">5.4 正式环境</span></a></li></ol></li><li class="toc-item toc-level-2"><a class="toc-link" href="#6-%E5%AF%B9%E6%8E%A5%E4%BA%BA%E7%9B%B8%E5%85%B3%E6%B3%A8%E6%84%8F%E7%82%B9"><span class="toc-number">6.</span> <span class="toc-text">6 对接人相关注意点</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#7-%E4%BA%A4%E6%8E%A5%E7%9A%84%E5%BD%A2%E5%BC%8F"><span class="toc-number">7.</span> <span class="toc-text">7 交接的形式</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%A4%87%E6%B3%A8"><span class="toc-number">8.</span> <span class="toc-text">备注</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number">9.</span> <span class="toc-text">参考</span></a></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2019-02-18T16:00:00.000Z"><a href="/2019/02/19/20190219HandoverMemoList/">2019-02-19</a></time>
<h1 class="title"><a href="/2019/02/19/20190219HandoverMemoList/">软件项目交接清单</a></h1>
</header>
<div class="entry">
<h2 id="1-项目交接的场景"><a href="#1-项目交接的场景" class="headerlink" title="1 项目交接的场景"></a>1 项目交接的场景</h2><ul>
<li>由于同事离职,将工作交接给自己。</li>
<li>由于自己离职,将工作交接给同事。</li>
<li>由于项目变动,将工作交给其他项目组。</li>
</ul>
<h2 id="2-文档相关注意点"><a href="#2-文档相关注意点" class="headerlink" title="2 文档相关注意点"></a>2 文档相关注意点</h2><ul>
<li>确定每一期的产品需求文档,PRD文档,概要设计文档,详细设计文档(流程图、设计框架图、上下游组件交互图),接口设计文档。</li>
<li>确定开发过程中可能使用到的PSD等视觉稿或切图。</li>
<li>确定每一期的测试覆盖文档,单元测试文档,遗留Bug。</li>
<li>确定目前这一期的开发进度文档,包括尚未开发,开发过程中,开发完成。</li>
<li>确定每一期的使用说明文档。</li>
<li>确定FAQ文档。</li>
<li>确定每一期的产品人员,开发人员,测试人员。</li>
<li>确定每一期的开始开发时间,提测时间,上线时间。</li>
<li>确定之前所编写的专利、软件著作权等文档。</li>
<li>了解相关程序风险,遗留问题等。</li>
<li>其他的一些约定如Scrum、XP、瀑布等软件开发方法。</li>
</ul>
<h2 id="3-项目源码相关注意点"><a href="#3-项目源码相关注意点" class="headerlink" title="3 项目源码相关注意点"></a>3 项目源码相关注意点</h2><ul>
<li>将开发人员最后修改的代码提交。</li>
<li>确定项目源码且需要有规范的详细注释。</li>
<li>确定源码目录结构说明,明确结构中各个部分的意义。</li>
<li>确定项目对外API。</li>
<li>确定项目定时脚本。</li>
<li>确定项目日志查看平台。</li>
<li>开通SVN的权限。</li>
<li>开通GIT的权限。</li>
<li>开通涉及交接的任何系统权限。</li>
<li>了解上线部署脚本、流程。</li>
<li>了解代码规范。</li>
</ul>
<h2 id="4-数据库相关注意点"><a href="#4-数据库相关注意点" class="headerlink" title="4 数据库相关注意点"></a>4 数据库相关注意点</h2><ul>
<li>相关数据库与数据表结构</li>
<li>查看是否有未注释的库名、表名、字段名,将其确定。</li>
<li>最好了解每一个数据库、表、字段的意义,更新到文档。</li>
<li>最好将每一个表涉及到哪一个模块进行确认,更新到文档。</li>
</ul>
<h2 id="5-环境相关注意点"><a href="#5-环境相关注意点" class="headerlink" title="5 环境相关注意点"></a>5 环境相关注意点</h2><h3 id="5-1-本地开发环境配置"><a href="#5-1-本地开发环境配置" class="headerlink" title="5.1 本地开发环境配置"></a>5.1 本地开发环境配置</h3><ul>
<li>在自己本地电脑上配置环境,将项目在自己机器上运行成功。</li>
<li>确认是否有其他的扩展,如需账号、端口,记得做记录。</li>
<li>确认是否有配置文件,如果有则对于所有的配置项进行说明。</li>
</ul>
<h3 id="5-2-测试环境"><a href="#5-2-测试环境" class="headerlink" title="5.2 测试环境"></a>5.2 测试环境</h3><p>开通测试账号。</p>
<h3 id="5-3-预上线环境"><a href="#5-3-预上线环境" class="headerlink" title="5.3 预上线环境"></a>5.3 预上线环境</h3><ul>
<li>开通预上线环境账号</li>
<li>确定预上线环境的域名地址,是否需要指定的Host等等。</li>
</ul>
<h3 id="5-4-正式环境"><a href="#5-4-正式环境" class="headerlink" title="5.4 正式环境"></a>5.4 正式环境</h3><p>开通正式环境账号</p>
<h2 id="6-对接人相关注意点"><a href="#6-对接人相关注意点" class="headerlink" title="6 对接人相关注意点"></a>6 对接人相关注意点</h2><ul>
<li>确定测试对接人。</li>
<li>确定产品对接人。</li>
<li>确定项目跨部门对接人。</li>
<li>确定运维和DBA对接人。</li>
</ul>
<h2 id="7-交接的形式"><a href="#7-交接的形式" class="headerlink" title="7 交接的形式"></a>7 交接的形式</h2><ul>
<li>具体的形式最好为举行一个交接会议,叫上相关产品、开发、测试,对于交接人员提出的问题,仔细逐一讲解、解答。</li>
<li>使用交接双方共同参与,代码讲解的方式</li>
</ul>
<h2 id="备注"><a href="#备注" class="headerlink" title="备注"></a>备注</h2><ul>
<li>最好能在员工提离职的时候就开始项目交接,而不是员工走的时候再做交接。</li>
<li>离职要提前与上级领导沟通,给领导留出找对接人的时间。</li>
<li>交接过程中遇到有疑问的地方,一定要确认清楚,做好记录。</li>
</ul>
<h2 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h2><p><a href="https://zhuanlan.zhihu.com/p/29297794" target="_blank" rel="noopener">程序员如何做好交接</a><br><a href="http://www.kuqin.com/projectmanage/20110824/263959.html" target="_blank" rel="noopener">项目交接小总结</a><br><a href="https://blog.csdn.net/shehun1/article/details/7788875" target="_blank" rel="noopener">文档交接说明书</a><br><a href="https://www.jianshu.com/p/00490a9ae109" target="_blank" rel="noopener">前端项目交接文档</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2019/02/19/20190219HandoverMemoList/">https://blog.littleji.com/2019/02/19/20190219HandoverMemoList/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-1"><a class="toc-link" href="#1-%E5%90%91%E9%A2%86%E5%AF%BC%E6%B1%87%E6%8A%A5%EF%BC%88%E5%86%85%E5%9C%A8%E9%A9%B1%E5%8A%A8%EF%BC%89"><span class="toc-number">1.</span> <span class="toc-text">1 向领导汇报(内在驱动)</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%91%98%E5%B7%A5%E4%B8%BA%E4%BB%80%E4%B9%88%E4%B8%8D%E6%84%BF%E6%84%8F%E5%81%9A%E4%BC%98%E5%8C%96"><span class="toc-number">1.1.</span> <span class="toc-text">员工为什么不愿意做优化</span></a></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#2-%E5%90%91%E5%BC%80%E5%8F%91%E4%BA%BA%E5%91%98%E8%AE%B2%E8%A7%A3%EF%BC%88%E5%A4%96%E5%9C%A8%E9%A9%B1%E5%8A%A8%EF%BC%89"><span class="toc-number">2.</span> <span class="toc-text">2 向开发人员讲解(外在驱动)</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%BB%A3%E7%A0%81%E5%AE%A1%E6%9F%A5%E7%9A%84%E5%A5%BD%E5%A4%84"><span class="toc-number">2.1.</span> <span class="toc-text">代码审查的好处</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%BB%BA%E7%AB%8B%E7%9C%8B%E6%9D%BF%EF%BC%88scrum%EF%BC%89%E5%BC%80%E5%8F%91%E6%B5%81%E7%A8%8B%E7%9A%84%E5%A5%BD%E5%A4%84"><span class="toc-number">2.2.</span> <span class="toc-text">建立看板(scrum)开发流程的好处</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%BB%BA%E7%AB%8Bwiki%EF%BC%88%E7%9F%A5%E8%AF%86%E7%AE%A1%E7%90%86%EF%BC%89%E7%9A%84%E5%A5%BD%E5%A4%84"><span class="toc-number">2.3.</span> <span class="toc-text">建立wiki(知识管理)的好处</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#git%E7%9A%84%E5%A5%BD%E5%A4%84"><span class="toc-number">2.4.</span> <span class="toc-text">git的好处</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%8C%81%E7%BB%AD%E9%9B%86%E6%88%90%E7%9A%84%E5%A5%BD%E5%A4%84"><span class="toc-number">2.5.</span> <span class="toc-text">持续集成的好处</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%BD%BF%E7%94%A8gitlab%E7%9A%84%E5%A5%BD%E5%A4%84"><span class="toc-number">2.6.</span> <span class="toc-text">使用gitlab的好处</span></a></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#3-%E5%AE%9E%E6%96%BD%E8%AE%A1%E5%88%92"><span class="toc-number">3.</span> <span class="toc-text">3 实施计划</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%80%9D%E8%80%83"><span class="toc-number">3.1.</span> <span class="toc-text">思考</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E9%98%B6%E6%AE%B51"><span class="toc-number">3.2.</span> <span class="toc-text">阶段1</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E9%98%B6%E6%AE%B52"><span class="toc-number">3.3.</span> <span class="toc-text">阶段2</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E9%98%B6%E6%AE%B53"><span class="toc-number">3.4.</span> <span class="toc-text">阶段3</span></a></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number">4.</span> <span class="toc-text">参考</span></a></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2019-02-11T16:00:00.000Z"><a href="/2019/02/12/20190212OptimizeTheDevelopmentProcess/">2019-02-12</a></time>
<h1 class="title"><a href="/2019/02/12/20190212OptimizeTheDevelopmentProcess/">用gitlab来优化我们的软件开发流程</a></h1>
</header>
<div class="entry">
<h1 id="1-向领导汇报(内在驱动)"><a href="#1-向领导汇报(内在驱动)" class="headerlink" title="1 向领导汇报(内在驱动)"></a>1 向领导汇报(内在驱动)</h1><h2 id="员工为什么不愿意做优化"><a href="#员工为什么不愿意做优化" class="headerlink" title="员工为什么不愿意做优化"></a>员工为什么不愿意做优化</h2><ol>
<li>人本身是懒惰的,员工也是如此,他们才不会把事做好的,他们只会做相应报酬的工作量,还可能基本还达不到其相应的报酬,大多数人都在混日子啊。</li>
<li>人的天性是不喜欢改变的,人的天性是习惯于一些按部就般的事的,也许那样做令人讨厌,但是人家还是能干点东西出来。如果你逼着人家改变,你就是在压迫人家,人家自然会反抗。</li>
<li>真正了解业务的那帮人根本不可能加入项目团队,那些人谁TMD愿意和苦逼的技术人员加班啊。 那些人喜欢和我们的用户吃吃喝喝,花天酒地的,根本不会和你们那些奇怪的东西(如:backlog)或是那堆ugly的内向古怪的技术人员打交道,更别说什么技术了。</li>
<li>销售什么都干得出来,让你去做项目是因为你是廉价劳动力,而且,他们会不断地加需求,因为软件合同谈好的价格时候,连需求都没有,你去做了才有,还是模糊和不确定或根本就是错的,然后需求是越来越多,越改越多。等你精疲力尽的时候,你才意识到,销售早就把你卖了。</li>
</ol>
<h1 id="2-向开发人员讲解(外在驱动)"><a href="#2-向开发人员讲解(外在驱动)" class="headerlink" title="2 向开发人员讲解(外在驱动)"></a>2 向开发人员讲解(外在驱动)</h1><h2 id="代码审查的好处"><a href="#代码审查的好处" class="headerlink" title="代码审查的好处"></a>代码审查的好处</h2><ol>
<li>代码实现清晰易读,无法阅读自然无法审查。</li>
<li>快速学习他人的代码思路提高自身水平。</li>
<li>你确定真的开发完功能了?开发的是不是想要的功能?</li>
<li>AT & T 的一个 200 多人的部门在开始执行 code review 后,开发效率提高了 14%,而错误减少了 90% 左右。</li>
</ol>
<h2 id="建立看板(scrum)开发流程的好处"><a href="#建立看板(scrum)开发流程的好处" class="headerlink" title="建立看板(scrum)开发流程的好处"></a>建立看板(scrum)开发流程的好处</h2><ol>
<li>现在让你说有三个大功能分别是(重新设计网站首页,增加用户登录功能,增加日志审计功能)问你什么时候大概能完成?帮助你评估项目的时间节点</li>
<li>一个项目上百个点,有bug、有需求,怎么开发最有效率?按照优先级来排序指定的需求</li>
<li>无休止的加需求,不干没这单,干了以后又没用,那到底干还是不干?团队审核制,不干没用的事儿<br>等等</li>
</ol>
<h2 id="建立wiki(知识管理)的好处"><a href="#建立wiki(知识管理)的好处" class="headerlink" title="建立wiki(知识管理)的好处"></a>建立wiki(知识管理)的好处</h2><ol>
<li>可以相当于是错题本,加速可以共同进步和提高</li>
</ol>
<h2 id="git的好处"><a href="#git的好处" class="headerlink" title="git的好处"></a>git的好处</h2><h2 id="持续集成的好处"><a href="#持续集成的好处" class="headerlink" title="持续集成的好处"></a>持续集成的好处</h2><ol>
<li>大型重构完,如果不跑冒烟测试、回归测试你怕不怕?自动化构建和发布项目</li>
</ol>
<h2 id="使用gitlab的好处"><a href="#使用gitlab的好处" class="headerlink" title="使用gitlab的好处"></a>使用gitlab的好处</h2><p>涵盖了上述的所有功能</p>
<h1 id="3-实施计划"><a href="#3-实施计划" class="headerlink" title="3 实施计划"></a>3 实施计划</h1><h2 id="思考"><a href="#思考" class="headerlink" title="思考"></a>思考</h2><p>上来做持续集成和自动化测试显然是不现实的(假设20个人和他们的领导根本不知道自己正在开发什么的那种团队)。</p>
<ol>
<li>比较现实的是先弄清楚现在在开发哪些功能和任务,并建立一个迭代式开发的框架。否则甚至没办法弄清楚大家的工作是否可以集成。</li>
<li>但如果只做这些工作,很容易出现问题:人们渐渐地开始降低迭代交付的标准(在进度的压力下),并期待着在测试期力挽狂澜,等等。</li>
<li>这时候,比较容易的是先定一些迭代交付标准,先用这些标准来卡一下质量问题。</li>
<li>若干个迭代过后,在任何一次Release的时候,一定会出问题的!抓住这个机会,提升迭代交付标准,并采用持续集成来保证不会到Release才会出问题。</li>
<li>有了持续集成,自然会有自动化测试,因为手工集成是不可能的。</li>
<li>等持续集成和自动化测试具备后,人们已经习惯于在这个技术体系下获得Build和Release版本,任何压力已经很难让团队绕近道了。<br>当然,如果老板很早就意识到应该帮助我们而非被我们说服来做革命,我们也可以加快一点进度,在早期就引入持续集成和自动化测试。<br>但是三原则仍然是必须遵守的指导方针,换言之,即使老板是改革派,我们也别一步实现共产主义。应该以敏捷的思想逐步改变并展示回报,坚定管理者的信心,最终彻底成功。</li>
</ol>
<h2 id="阶段1"><a href="#阶段1" class="headerlink" title="阶段1"></a>阶段1</h2><ul>
<li>git工作流</li>
<li>gitlab的简单使用(人员 代码管理 常用的issue label,issue模板 )</li>
<li>使用gitlab进行git工作流</li>
<li>gitlab上进行wiki制作从而知识分享</li>
<li>了解TDD开发</li>
<li>部署脚本</li>
</ul>
<h2 id="阶段2"><a href="#阶段2" class="headerlink" title="阶段2"></a>阶段2</h2><ul>
<li>迭代标准建立(代码覆盖率,功能点覆盖率,冒烟测试)</li>
<li>基于 gitlab 代码审核</li>
<li>基于 gitlab + jenkins持续集成、发布</li>
</ul>
<h2 id="阶段3"><a href="#阶段3" class="headerlink" title="阶段3"></a>阶段3</h2><ul>
<li>基于scrum的软件开发流程</li>
<li>基于gitlab的敏捷开发实践</li>
<li>gitlab上看板(board)的建立</li>
<li>gitlab上里程碑(milestone、小版本)建立</li>
<li>gitlab上组织划分</li>
</ul>
<h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><p><a href="https://blog.csdn.net/cheny_com/article/details/5528892" target="_blank" rel="noopener">无烟会议室:CMMI vs. Scrum vs. XP(QCon 2010 感受</a><br><a href="https://waylau.com/why-we-need-continuous-integration/" target="_blank" rel="noopener">为什么我们迫切需要持续集成</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2019/02/12/20190212OptimizeTheDevelopmentProcess/">https://blog.littleji.com/2019/02/12/20190212OptimizeTheDevelopmentProcess/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-1"><a class="toc-link" href="#%E4%BF%A1%E6%81%AF%E8%AE%BA%E5%9F%BA%E7%A1%80"><span class="toc-number">1.</span> <span class="toc-text">信息论基础</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#%E7%86%B5"><span class="toc-number">1.1.</span> <span class="toc-text">熵</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E8%81%94%E5%90%88%E7%86%B5"><span class="toc-number">1.2.</span> <span class="toc-text">联合熵</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E6%9D%A1%E4%BB%B6%E7%86%B5"><span class="toc-number">1.3.</span> <span class="toc-text">条件熵</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E4%BF%A1%E6%81%AF%E5%A2%9E%E7%9B%8A"><span class="toc-number">1.4.</span> <span class="toc-text">信息增益</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#%E5%9F%BA%E5%B0%BC%E4%B8%8D%E7%BA%AF%E5%BA%A6"><span class="toc-number">1.5.</span> <span class="toc-text">基尼不纯度</span></a></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E5%86%B3%E7%AD%96%E6%A0%91%E7%9A%84%E4%B8%8D%E5%90%8C%E5%88%86%E7%B1%BB%E7%AE%97%E6%B3%95"><span class="toc-number">2.</span> <span class="toc-text">决策树的不同分类算法</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#ID3%E7%AE%97%E6%B3%95"><span class="toc-number">2.1.</span> <span class="toc-text">ID3算法</span></a></li><li class="toc-item toc-level-2"><a class="toc-link" href="#C4-5"><span class="toc-number">2.2.</span> <span class="toc-text">C4.5</span></a></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E5%9B%9E%E5%BD%92%E6%A0%91%E5%8E%9F%E7%90%86"><span class="toc-number">3.</span> <span class="toc-text">回归树原理</span></a><ol class="toc-child"><li class="toc-item toc-level-2"><a class="toc-link" href="#CART%E5%88%86%E7%B1%BB%E6%A0%91"><span class="toc-number">3.1.</span> <span class="toc-text">CART分类树</span></a><ol class="toc-child"><li class="toc-item toc-level-3"><a class="toc-link" href="#CART%E7%9A%84%E7%94%9F%E6%88%90"><span class="toc-number">3.1.1.</span> <span class="toc-text">CART的生成</span></a></li><li class="toc-item toc-level-3"><a class="toc-link" href="#CART%E7%9A%84%E5%89%AA%E6%9E%9D"><span class="toc-number">3.1.2.</span> <span class="toc-text">CART的剪枝</span></a></li></ol></li></ol></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E5%86%B3%E7%AD%96%E6%A0%91%E9%98%B2%E6%AD%A2%E8%BF%87%E6%8B%9F%E5%90%88%E6%89%8B%E6%AE%B5"><span class="toc-number">4.</span> <span class="toc-text">决策树防止过拟合手段</span></a></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E6%A8%A1%E5%9E%8B%E8%AF%84%E4%BC%B0"><span class="toc-number">5.</span> <span class="toc-text">模型评估</span></a></li><li class="toc-item toc-level-1"><a class="toc-link" href="#python%E5%8F%AF%E8%A7%86%E5%8C%96%E5%86%B3%E7%AD%96%E6%A0%91%E4%B8%8E%E5%AF%B9%E5%BA%94%E7%9A%84%E5%87%BD%E6%95%B0%E5%AE%9E%E7%8E%B0"><span class="toc-number">6.</span> <span class="toc-text">python可视化决策树与对应的函数实现</span></a></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E5%8F%82%E8%80%83"><span class="toc-number">7.</span> <span class="toc-text">参考</span></a></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2018-12-24T16:00:00.000Z"><a href="/2018/12/25/20181225MLReview3/">2018-12-25</a></time>
<h1 class="title"><a href="/2018/12/25/20181225MLReview3/">七天算法梳理之决策树</a></h1>
</header>
<div class="entry">
<h1 id="信息论基础"><a href="#信息论基础" class="headerlink" title="信息论基础"></a>信息论基础</h1><p>信息论的基础由香农博士于1948年奠定.下面说明关于信息论的一些基本概念.</p>
<h2 id="熵"><a href="#熵" class="headerlink" title="熵"></a>熵</h2><p>上表示一个随机变量不确定的数量.如果一个随机变量的熵越大,那么其不确定也就越大.<br>如果$X$为离散型变量,取值为$\mathbb R$,其概率分布为$p(x)=P(X=x),x\in \mathbb R$,那么X的熵$H(X)$定义为:</p>
<script type="math/tex; mode=display">
H(X)=-\sum_{x \in R}p(x)log_2p(x)</script><h2 id="联合熵"><a href="#联合熵" class="headerlink" title="联合熵"></a>联合熵</h2><p>联合熵其实就是描述一对随机变量平均所需要的信息量.<br>如果$X,Y$是一对离散型随机变量 $X,Y ~ p(x,y),X,Y$的联合熵为$H(X,Y)$为:</p>
<script type="math/tex; mode=display">
H(X,Y)=-\sum_{x \in X}\sum_{y \in Y}p(x,y)logp(x,y)</script><h2 id="条件熵"><a href="#条件熵" class="headerlink" title="条件熵"></a>条件熵</h2><p>条件熵$H(Y|X)$的意思是,在X发生的条件下,Y的不确定性有</p>
<script type="math/tex; mode=display">
H(Y|X)=\sum_{x \in X}\sum_{y \in Y}p(x, y)logp(y | x)</script><p>将联合概率进行展开后发现:</p>
<script type="math/tex; mode=display">
H(X, Y)=-\sum_{x \in X}p(x)logp(x)-\sum_{x \in X}\sum_{y \in Y}p(x, y)logp(y | x) = H(X)+H(Y|X)</script><h2 id="信息增益"><a href="#信息增益" class="headerlink" title="信息增益"></a>信息增益</h2><p>现在有属性a, 其可能有v个可能的取值,如果使用属性a来对样本D进行划分的话,易知会产生v个节点,那么所有属性为$a_v$的样本可记为$D^v$.,这时候再根据各个节点对应所占的比例$|D^v|/|D|$分配权重,就可以知道使用属性a对D进行划分的时候所获得的信息增益,也就是说使用整个样本的信息熵,减去通过属性a划分的信息熵之和就是信息增益.</p>
<p>现在假设样本D的信息熵为</p>
<script type="math/tex; mode=display">
Ent(D)=-\sum_{k=1}^{|v|}p_klog_2p_k</script><p>那么信息增益为:</p>
<script type="math/tex; mode=display">
Gain(D,a)=Ent(D)-\sum_{v=1}^{V}\frac{|D^v|}{|D|}Ent(D^v)</script><h2 id="基尼不纯度"><a href="#基尼不纯度" class="headerlink" title="基尼不纯度"></a>基尼不纯度</h2><p>基尼不纯度是CART算法划分属性所使用的度量方法,其直观上的理解是从一个数据集D中任意抽取两个样本,其类别不一致的概率.其具体的公式如下:</p>
<script type="math/tex; mode=display">
Gini(D)=\sum_{k=1}{|y|}\sum_{k^{'}\neq k}(p_kp_k')</script><h1 id="决策树的不同分类算法"><a href="#决策树的不同分类算法" class="headerlink" title="决策树的不同分类算法"></a>决策树的不同分类算法</h1><h2 id="ID3算法"><a href="#ID3算法" class="headerlink" title="ID3算法"></a>ID3算法</h2><p>流程具体如下:</p>
<ol>
<li>首先考虑样本中只有一个类或者没有属性的情况</li>
<li>计算各个属性的信息增益后</li>
<li>选择信息增益最多的属性进行节点分类,建立各个节点分支</li>
<li>再依次的再各个节点中进行选择计算信息增益,返回步骤2重复迭代</li>
<li>到达指定的退出条件,没有特征或者信息增益较小<br>由于ID3 算法只有生成树的过程,没有剪枝等过程,所以可能过拟合.</li>
</ol>
<h2 id="C4-5"><a href="#C4-5" class="headerlink" title="C4.5"></a>C4.5</h2><p>首先,信息增益比的定义是信息增益G(D,a)与训练数据集熵H(D)的比</p>
<script type="math/tex; mode=display">
g_R(D,a)=\frac{g(D,a)}{H(D)}</script><p>该C4.5算法则是针对于ID3算法的改进,在生成树的过程中使用了信息增益比来选择,而不是单纯的使用信息增益<br>算法过程如下:<br>假设 数据集D 特征集A 阀值ε</p>
<ol>
<li>如果数据中均为同一个类,则返回,算法结束</li>
<li>如果 $A=\varnothing$, 则返回一个单节点的树,并选择实例数最多的类,为该节点的类别,算法结束</li>
<li>选择其中信息增益比最大的节点</li>
<li>再依次选择各个节点,计算当前节点的内的信息增益比,进行迭代</li>
<li>最终达到指定的退出条件,即信息增益比过低,或者没有更多的特征时退出算法</li>
</ol>
<p>上面的构建的节点树都是分类树,只不过节点划分的方式不同.那么什么是回归树呢?</p>
<h1 id="回归树原理"><a href="#回归树原理" class="headerlink" title="回归树原理"></a>回归树原理</h1><p>回归树对于样本的划分,通过遍历所有输入变量,找到最优的切分变量j和最优的切分点s,即选择第j个特征$x^j$和它的取值s将输入空间划分为两部分,然后重复这个操作,对于连续性的样本值非常有效.<br>具体算法如下</p>
<ol>
<li>选择最优的切分变量j和最优的切分点s,求解 <script type="math/tex; mode=display">
min_{j,s}[min_{c_{1}}\sum_{x_{i}\in R_{1}(j,s)}(y_{i}-c_{1})^2+min_{c_{2}}\sum_{x_{i}\in R_{2}(j,s)}(y_{i}-c_{2})^2]</script></li>
<li>遍历所有特征,对固定的特征扫描所有取值,找到使上式达到最小值的对(j,s).</li>
<li>用选定的对 (j,s)划分区域,并确定该区域的预测值;</li>
<li>继续对两个字区域调用上述步骤,直至满足停止条件;</li>
</ol>
<h2 id="CART分类树"><a href="#CART分类树" class="headerlink" title="CART分类树"></a>CART分类树</h2><p>CART分类树的全称是分类与回归树,主要的原理思想是将内部的节点特征取值为”是”或”否”两个值,左分支为是,右分支为否,这样整个决策树就可以在整个样本空间中求取对应的条件概率分布.<br>算法由特征选择和生成树以及前面两种算法所没有的剪枝构成,算法主要包括两个部分:树的生成与剪枝</p>
<h3 id="CART的生成"><a href="#CART的生成" class="headerlink" title="CART的生成"></a>CART的生成</h3><p>从根节点开始,对节点计算现有特征的基尼指数,对每一个特征,例如AA,再对其每个可能的取值如aa,根据样本点对A=aA=a的结果的”是“与”否“划分为两个部分,利用</p>
<script type="math/tex; mode=display">
Gini(D,A=a)=\frac{|D_{1}|}{|D|}Gini(D_{1})+\frac{|D_{2}|}{|D|}Gini(D_{2})</script><p>进行计算;在所有可能的特征AA以及该特征所有的可能取值a中,选择基尼指数最小的特征及其对应的取值作为最优特征和最优切分点。然后根据最优特征和最优切分点,将本节点的数据集二分,生成两个子节点<br>对两个字节点递归地调用上述步骤,直至节点中的样本个数小于阈值,或者样本集的基尼指数小于阈值,或者没有更多特征后停止;</p>
<h3 id="CART的剪枝"><a href="#CART的剪枝" class="headerlink" title="CART的剪枝"></a>CART的剪枝</h3><p>剪枝就是对生成的树进行裁剪简化的过程,其一般是通过极小化决策树整体的损失函数或代价函数来实现.<br>CART的剪枝是通过两个步骤:</p>
<ol>
<li>从树的底部不断地剪枝直到根节点,形成对应的子树序列</li>
<li>通过交叉验证法,对子树的序列进行测试,并从中选取最优的子树</li>
</ol>
<h1 id="决策树防止过拟合手段"><a href="#决策树防止过拟合手段" class="headerlink" title="决策树防止过拟合手段"></a>决策树防止过拟合手段</h1><p>决策树过拟合主要有两个手段,分别为early stopping与剪枝.</p>
<ol>
<li>earlystopping:限制选取的分类节点的总数,树的深度,节点中的实例数,阈值等</li>
<li>剪枝,即当前节点的划分无法带来决策树泛化性能的提升,增删除对应的节点</li>
</ol>
<h1 id="模型评估"><a href="#模型评估" class="headerlink" title="模型评估"></a>模型评估</h1><p>可以使用之前梳理的AUC ROC 交叉验证 随机抽样等方法,这里就不再赘述了.</p>
<h1 id="python可视化决策树与对应的函数实现"><a href="#python可视化决策树与对应的函数实现" class="headerlink" title="python可视化决策树与对应的函数实现"></a>python可视化决策树与对应的函数实现</h1><figure class="highlight plain"><table><tbody><tr><td class="gutter"><pre><span class="line">1</span><br><span class="line">2</span><br><span class="line">3</span><br><span class="line">4</span><br><span class="line">5</span><br><span class="line">6</span><br><span class="line">7</span><br><span class="line">8</span><br><span class="line">9</span><br><span class="line">10</span><br><span class="line">11</span><br><span class="line">12</span><br><span class="line">13</span><br><span class="line">14</span><br><span class="line">15</span><br><span class="line">16</span><br><span class="line">17</span><br><span class="line">18</span><br><span class="line">19</span><br><span class="line">20</span><br><span class="line">21</span><br><span class="line">22</span><br><span class="line">23</span><br><span class="line">24</span><br><span class="line">25</span><br><span class="line">26</span><br><span class="line">27</span><br><span class="line">28</span><br><span class="line">29</span><br><span class="line">30</span><br><span class="line">31</span><br><span class="line">32</span><br><span class="line">33</span><br><span class="line">34</span><br><span class="line">35</span><br><span class="line">36</span><br><span class="line">37</span><br><span class="line">38</span><br></pre></td><td class="code"><pre><span class="line">import pydotplus</span><br><span class="line">from sklearn.datasets import load_iris</span><br><span class="line">from sklearn import tree</span><br><span class="line">import collections</span><br><span class="line"># Data Collection</span><br><span class="line">X = [ [180, 15,0], </span><br><span class="line"> [177, 42,0],</span><br><span class="line"> [136, 35,1],</span><br><span class="line"> [174, 65,0],</span><br><span class="line"> [141, 28,1]]</span><br><span class="line"></span><br><span class="line">Y = ['man', 'woman', 'woman', 'man', 'woman'] </span><br><span class="line"></span><br><span class="line">data_feature_names = [ 'height', 'hair length', 'voice pitch' ]</span><br><span class="line"># Training</span><br><span class="line">clf = tree.DecisionTreeClassifier()</span><br><span class="line">clf = clf.fit(X,Y)</span><br><span class="line"># Visualize data</span><br><span class="line">dot_data = tree.export_graphviz(clf,</span><br><span class="line"> feature_names=data_feature_names,</span><br><span class="line"> out_file=None,</span><br><span class="line"> filled=True,</span><br><span class="line"> rounded=True)</span><br><span class="line">graph = pydotplus.graph_from_dot_data(dot_data)</span><br><span class="line"></span><br><span class="line">colors = ('turquoise', 'orange')</span><br><span class="line">edges = collections.defaultdict(list)</span><br><span class="line"></span><br><span class="line">for edge in graph.get_edge_list():</span><br><span class="line"> edges[edge.get_source()].append(int(edge.get_destination()))</span><br><span class="line"></span><br><span class="line">for edge in edges:</span><br><span class="line"> edges[edge].sort() </span><br><span class="line"> for i in range(2):</span><br><span class="line"> dest = graph.get_node(str(edges[edge][i]))[0]</span><br><span class="line"> dest.set_fillcolor(colors[i])</span><br><span class="line"></span><br><span class="line">graph.write_png('tree.png')</span><br></pre></td></tr></tbody></table></figure>
<p>主要的函数为</p>
<h1 id="参考"><a href="#参考" class="headerlink" title="参考"></a>参考</h1><p><a href>统计自然语言处理-宗成庆</a><br><a href>机器学习-周志华</a><br><a href>统计学习方法-李航</a><br><a href="https://blog.csdn.net/weixin_36586536/article/details/80468426" target="_blank" rel="noopener">决策树(分类树、回归树</a><br><a href="https://pythonprogramminglanguage.com/decision-tree-visual-example/" target="_blank" rel="noopener">Decision tree visual example</a></p>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2018/12/25/20181225MLReview3/">https://blog.littleji.com/2018/12/25/20181225MLReview3/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>
<div class="clearfix"></div>
</footer>
</div>
</article>
<article class="post">
<div id="toc" class="toc-article">
<strong class="toc-title">目录</strong>
<a class="js-toggle-toc" href="javascript:void(0)"></a>
<div class="toc-content">
<ol class="toc"><li class="toc-item toc-level-1"><a class="toc-link" href="#%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92%E4%B8%8E%E7%BA%BF%E6%80%A7%E5%9B%9E%E5%BD%92%E7%9A%84%E8%81%94%E7%B3%BB%E4%B8%8E%E5%8C%BA%E5%88%AB"><span class="toc-number">1.</span> <span class="toc-text">逻辑回归与线性回归的联系与区别</span></a></li><li class="toc-item toc-level-1"><a class="toc-link" href="#%E9%80%BB%E8%BE%91%E5%9B%9E%E5%BD%92%E7%9A%84%E5%8E%9F%E7%90%86"><span class="toc-number">2.</span> <span class="toc-text">逻辑回归的原理</span></a></li></ol>
</div>
</div>
<div class="post-content">
<header>
<div class="icon"></div>
<time datetime="2018-12-21T16:00:00.000Z"><a href="/2018/12/22/20181222MLReview2/">2018-12-22</a></time>
<h1 class="title"><a href="/2018/12/22/20181222MLReview2/">七天算法梳理之逻辑回归</a></h1>
</header>
<div class="entry">
<h1 id="逻辑回归与线性回归的联系与区别"><a href="#逻辑回归与线性回归的联系与区别" class="headerlink" title="逻辑回归与线性回归的联系与区别"></a>逻辑回归与线性回归的联系与区别</h1><p>逻辑回归事实上是将线性回归的输出进行了非线性函数的映射,而这个映射即是:</p>
<script type="math/tex; mode=display">
y=\frac{1}{1+e^{-(w^{T}x+b)}}</script><h1 id="逻辑回归的原理"><a href="#逻辑回归的原理" class="headerlink" title="逻辑回归的原理"></a>逻辑回归的原理</h1><p>逻辑回归的主要原理是将之前的线性空间通过非线性函数进行再输出,让对应的输出范围集中在要么靠近0,要么靠近1的区域内,从而完成将对应的数据分类的目的</p>
<p>3、逻辑回归损失函数推导及优化<br>假设<br>P(y=1|x,θ)=hθ(x)<br>P(y=0|x,θ)=1−hθ(x)<br>则有<br>P(y|x,θ)=hθ(x)y(1−hθ(x))1−y<br>很容易得到似然函数表达式:<br>L(θ)=∏i=1m(hθ(x(i)))y(i)(1−hθ(x(i)))1−y(i)<br>取对数得:<br>J(θ)=−lnL(θ)=−∑i=1m(y(i)log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))</p>
<p>4、 正则化与模型评估指标<br>逻辑回归也需要处理过拟合的问题那么正则化的方法提供了一个很好地思路<br>逻辑回归的L1正则化的损失函数表达式如下,相比普通的逻辑回归损失函数,增加了L1的范数做作为惩罚,超参数α作为惩罚系数,调节惩罚项的大小。</p>
<p>二元逻辑回归的L1正则化损失函数表达式如下:<br>J(θ)=−Y⊙loghθ(X)−(E−Y)⊙log(E−hθ(X))+||θ||1<br>其中||θ||1为θ的L1范数。</p>
<p>二元逻辑回归的L2正则化损失函数表达式如下:<br>J(θ)=−Y⊙loghθ(X)−(E−Y)⊙log(E−hθ(X))+12α||θ||22<br>其中||θ||2为θ的L2范数。</p>
<p>5、逻辑回归的优缺点<br>优点:可以给出概率,解释性较好<br>缺点:容易欠拟合,对于非线性的特征还需要进一步的转化,</p>
<p>6、样本不均衡问题解决办法<br>类别不平衡问题指的是当正反例的数目偏差过大的时候,所造成的困扰<br>类别不平衡的一个基本策略是-再缩放(rescaling)<br>主要有三个途径:</p>
<ol>
<li>对训练集中的反类样例进行欠采样,去除一些反例</li>
<li>对训练集里的正例进行过采样</li>
<li><p>直接学习但在预测的时候进行阀值的改变<br>类别不平衡学习通常是较小类的代价更高,</p>
</li>
<li><p>sklearn参数<br>逻辑回归具体的位置在:<br>from sklearn.linear_model import LogisticRegression<br>主要有C penalty tol solver 等几个参数<br>C:正则化系数的倒数,默认为1<br>penalty:用来指定正则化的参数<br>tol:迭代终止的误差范围<br>solver:决定使用什么样的优化方法</p>
</li>
</ol>
<hr>
<p>版权声明:本文由littleji.com创作并发表,转载请注明作者及出处,欢迎关注公众号:littleji_com<br><a href="https://creativecommons.org/licenses/by-sa/4.0/" target="_blank" rel="noopener">本文遵守CC BY0SA 4.0</a><br>if you have any questions, please leave a message behind or give an <a href="https://github.com/littleji/littleji.github.io/issues" target="_blank" rel="noopener">issue</a></p>
<p>本文链接为:<a href="https://blog.littleji.com/2018/12/22/20181222MLReview2/">https://blog.littleji.com/2018/12/22/20181222MLReview2/</a></p>
<script>
document.querySelectorAll('.github-emoji')
.forEach(el => {
if (!el.dataset.src) { return; }
const img = document.createElement('img');
img.style = 'display:none !important;';
img.src = el.dataset.src;
img.addEventListener('error', () => {
img.remove();
el.style.color = 'inherit';
el.style.backgroundImage = 'none';
el.style.background = 'none';
});
img.addEventListener('load', () => {
img.remove();
});
document.body.appendChild(img);
});
</script>
</div>
<footer>