TopicDisc is an unsupervised framework for jointly modeling topic content and discourse behavior in microblog conversations.
More details can be referred to:
Jichuan Zeng, Jing Li, Yulan He, Cuiyun Gao, Michael Lyu, Irwin King. What You Say and How You Say it: Joint Modeling of Topics andDiscourse in Microblog Conversations. TACL 2019.
TREC-11 and TWT-16 Dataset are in data/twitter-conv/
.
We remove the text of each tweet according to Twitter's policy. One needs to download the tweets with the given tweet ID by Twitter API. All the Tweets text should be put in text_lst
field of twitter_m.json
accordingly.
For example:
"meta_lst": [
{
"id": "816147858076241920"
},
{
"id": "816175837737123841"
},
{
"id": "816190727323459584"
},
{
"id": "816192396945924096"
}
],
"text_lst": [
"<hash> When you realize TransCult thinks Radical Feminism is more \"degenerate\" than the KKK & NeoNazis.\u2026 <url>",
"<men> Kind of interesting trans cult is using KKK/Nazi rhetoric now -- worried that their bullying is only getting them so far? \ud83d\ude02",
"<men> <men> LOL. That's so funny. This must really tap into some primitive fear. Witches zooming off to cavort with satan\ud83d\ude02",
"<men> <men> <cackles> \ud83d\ude02"
]
After download all the tweets, save the data file as 'twitter.json'.
You can run the main code as:
$ python topic_disc.py
You can visualize the topic and discourse distribution among words by enabling --output_vis True
.
$ python topic_disc.py --output_vis True
$ cd vis/
$ python vis_atten.py # you should put the gen_sample.txt to this folder.
The visualization looks like this: