머신러닝 Example by Python - 토픽 모델 시스템 만들기 (문서 분류)

# 데이터 다운로드 경로 : https://archive.ics.uci.edu/ml/machine-learning-databases/00228/


## 1. LDA를 이용하여 문자에서 토픽 추출하기

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer

spam_header = 'spam\t'
no_spam_header = 'ham\t'
documents = []

with open('F:/1_Study/1_BigData/7_FirstML/smsspamcollection/SMSSpamCollection', 'rt', encoding='UTF8') as file_handle:
    for line in file_handle:
        if line.startswith(spam_header):
            documents.append(line[len(spam_header):])
        elif line.startswith(no_spam_header):
            documents.append(line[len(no_spam_header):])
            
# LDA는 단어 빈도 피처보다 개수 피처가 잘 동작하기 때문에
# CountVectorizer를 사용합니다. 또한 토픽 모델에 도움이 되지 않는
# 단어(stop_words)를 자동으로 제거합니다.
vectorizer = CountVectorizer(stop_words='english', max_features=2000)
term_counts = vectorizer.fit_transform(documents)
vocabulary = vectorizer.get_feature_names()

# 토픽 모델을 학습합니다.
topic_model = LatentDirichletAllocation(n_components=10)
topic_model.fit(term_counts)

# 학습된 토픽을 하나씩 출력합니다.
topics = topic_model.components_
for topic_id, weights in enumerate(topics):
    print('topic %d' % topic_id, end=': ')
    pairs = []
    for term_id, value in enumerate(weights):
        pairs.append( (abs(value), vocabulary[term_id]) )
    pairs.sort(key=lambda x: x[0], reverse=True)
    for pair in pairs[:10]:
        print(pair[1], end=',')
    print()

출처 : 처음 배우는 머신러닝 : 기초부터 모델링, 실전 예제, 문제 해결까지

'데이터분석 > Python' 카테고리의 다른 글

머신러닝 Example by Python - 이미지 데이터를 이용한 K-평균 군집화 (이미지 인식 시스템) (0)	2022.01.14
머신러닝 Example by Python - 고유명사 태깅 시스템 만들기 (문서 분류) (0)	2022.01.14
머신러닝 Example by Python - 품사 분석 시스템 만들기 (문서 분류) (0)	2022.01.14
머신러닝 Example by Python - 스팸 문자 필터 만들기 (문서 분류) (0)	2022.01.14
머신러닝 Example by Python - 구매 이력 테이터를 이용한 사용자 그룹 만들기 (0)	2022.01.14

Meongtae's IT Blog

머신러닝 Example by Python - 토픽 모델 시스템 만들기 (문서 분류)

'데이터분석 > Python' 카테고리의 다른 글

티스토리툴바

머신러닝 Example by Python - 토픽 모델 시스템 만들기 (문서 분류)

'데이터분석 > Python' 카테고리의 다른 글

관련글

티스토리툴바