欧美一级特黄大片做受成人-亚洲成人一区二区电影-激情熟女一区二区三区-日韩专区欧美专区国产专区

ldatopicnumber-創(chuàng)新互聯(lián)

Hi Vikas --

the optimum number of topics (K in LDA) is dependent on a at least two factors: 
Firstly, your data set may have an intrinsic number of topics, i.e., may derive 
from some natural clusters that your data have. This number will in the best 
case make your ppx minimal. A non-parametric approach like HDP would ideally 
result in the same K as the one that minimises ppx for LDA.  The second type of 
influence is that of the hyperparameters. If you fix the Dirichlet parameters 
alpha and beta (for LDA's Dirichlet-multinomial "levels" (theta | alpha) and 
(phi | beta)), you bias the optimum K. For instance, larger alpha will force 
more " "decisive" choices of z for each token, leading to a concentration of 
theta to fewer weights, which influences K.

Trouble minimizing perplexity in LDA

up vote1down votefavorite  

I am running LDA from Mark Steyver's MATLAB Topic Modelling toolkit on a few Apache Java open source projects. I have taken care of stop word removal (for e.g. words such Apache, java keywords are marked as stopwords) and tokenization. I find that perplexity on test data always decreases with increasing number of topics. I tried different values of ALPHA but no difference.

成都創(chuàng)新互聯(lián)專注于盤龍企業(yè)網(wǎng)站建設(shè),響應(yīng)式網(wǎng)站開發(fā),商城系統(tǒng)網(wǎng)站開發(fā)。盤龍網(wǎng)站建設(shè)公司,為盤龍等地區(qū)提供建站服務(wù)。全流程專業(yè)公司,專業(yè)設(shè)計(jì),全程項(xiàng)目跟蹤,成都創(chuàng)新互聯(lián)專業(yè)和態(tài)度為您提供的服務(wù)

I need to find optimal number of topics and for that perplexity plot should reach a minimum. Please suggest what may be wrong.

Definition and details regarding calculation of perplexity of a topic model is explained in this post.

Edit: I played with hyperparameters alpha and beta and now perplexity seems to reach a minimum. It is not clear to me as to how these hyperparameters affect perplexity. Initially I was plotting results till 200 topics without any success. Now on the same range minimum is reached at around 50-60 topics (which was my intuition) after modifying hyperparameters. Also, as this postnotes, you bias optimal number of topics according to specific values of hyperparameters.

machine-learning topic-models hyperparameter
shareimprove this question edited Sep 15 '12 at 2:13    asked Sep 14 '12 at 5:22 abhinavkulkarni
2586
1
Many of us probably don't know what perplexity means and what aperplexity plot shows. I know I don't. Could you enlighten me (us)? – Michael Chernick Sep 14 '12 at 15:54
1
@MichaelChernick: I edited post to include a link detailing perplexity of a topic model. – abhinavkulkarni Sep 14 '12 at 22:27
1
Thanks for doing that. – Michael Chernick Sep 14 '12 at 22:52
How many topics have you tried so far (on what size corpus)? Maybe you just haven't yet hit the right number of topics? Also, for inferring the number of topics from data you may want to look into the Hierarchical Dirichlet Process (HDP) with code on David Blei's site: cs.princeton.edu/~blei/topicmodeling.html – Nick Sep 14 '12 at 23:22
@Nick: Indeep HDP, a nonparametric topic modelling algorithm is an alternative to LDA, wherein you don't have to tune hyperparameters. However at this point I would like to stick to LDA and know how and why perplexity behaviour changes drastically with regards to small adjustments in hyperparameters. Also, my corpus size is quite large. For e.g. I have tokenized Apache Lucene source code with ~1800 java files and 367K source code lines. So that's a pretty big corpus I guess. – abhinavkulkarni Sep 15 '12 at 2:21

1 Answer

activeoldestvotes
up vote2down vote

You might want to have a look at the implementation of LDA in Mallet, which can do hyperparameter optimization as part of the training. Mallet also uses asymmetric priors by default, which according to this paper, leads to the model being much more robust against setting the number of topics too high. In practice this means you don't have to specify the hyperparameters, and can set number of topics pretty high without negatively affecting results.

In my experience hyperparameter optimization and asymmetric priors gave significantly better topics than without it, but I haven't tried the Matlab Topic Modelling toolkit.

shareimprove this answer
 

網(wǎng)頁(yè)名稱:ldatopicnumber-創(chuàng)新互聯(lián)
URL標(biāo)題:http://www.aaarwkj.com/article30/jsiso.html

成都網(wǎng)站建設(shè)公司_創(chuàng)新互聯(lián),為您提供網(wǎng)頁(yè)設(shè)計(jì)公司、軟件開發(fā)、網(wǎng)站設(shè)計(jì)公司、企業(yè)建站、品牌網(wǎng)站設(shè)計(jì)、ChatGPT

廣告

聲明:本網(wǎng)站發(fā)布的內(nèi)容(圖片、視頻和文字)以用戶投稿、用戶轉(zhuǎn)載內(nèi)容為主,如果涉及侵權(quán)請(qǐng)盡快告知,我們將會(huì)在第一時(shí)間刪除。文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如需處理請(qǐng)聯(lián)系客服。電話:028-86922220;郵箱:631063699@qq.com。內(nèi)容未經(jīng)允許不得轉(zhuǎn)載,或轉(zhuǎn)載時(shí)需注明來源: 創(chuàng)新互聯(lián)

h5響應(yīng)式網(wǎng)站建設(shè)
亚洲天堂中文字幕麻豆| 成人亚洲精品一区二区三区| 亚洲国产精品一区二区首页| 性欧美一区二区三区| 亚洲精品色在线网站国产呦| 久久综激情丁香开心婷婷| 丰满人妻一区二三区av| 亚洲永久免费黄色av| 欧美一级午夜欧美午夜视频| 蜜桃精品人妻一区二区三区| 亚洲国产精品综合久久网络| 国产超大超粗超爽视频| 欧美人妻精品一区二区| 国产丝袜肉丝在线播放| 国产成人精品亚洲日本片| 日韩女同一区二区三区在线观看| 中文字幕一区二区三区网站| 男人一插就想射的原因| 国产欧美日韩精品三级| 黄色资源网日韩三级一区二区| 成人性生交大片免费看中文| 欧美日韩精品人妻二区 | 不卡视频一区中文字幕| 日本区一区二区三高清视频 | 日韩一二卡在线观看视频| 白虎亚洲福利精品一区| 国内熟妇人妻色在线三级| 97色伦综合在线欧美视频| 国产高跟丝袜av专区| 丁香六月综合激情啪啪啪| 久久国产三级久久久久久| 国产精品亚洲视频欧美视频| 免费黄色福利网址大片| 日韩一区二区三区91| av色狠狠一区二区三区| 国产乡下三级_三级全黄| 精品蜜桃臀91人少妇| 少妇38p高潮在线| 这里只有精品国产999| 九九九热免费在线观看| 粉嫩一区二区三区在线|