본문 바로가기
데이터분석/R

[현장에서 바로 써먹는...] 텍스트마이닝 - 워드클라우드

by 버섯도리 2022. 5. 2.

> ### Chapter 8. 텍스트 마이닝

> ## Chapter8-1. 고객 리뷰에서 어떻게 핵심을 파악할 수 있을까? (워드클라우드)

> library(KoNLP)

> # 한글처리에 필요한 세종 사전 수행, 최초 실행 시 1을 입력해 설치 실시
> useSejongDic()
Backup was just finished!
370957 words dictionary was built.

> txt <- readLines("ch8.txt", encoding = "UTF-8")
> head(txt)
[1] "닭이 너무 맛있어요 최고!! 육질이 살아있음"           "배송도 빠르고 상품도 좋습니다. ^^"                  
[3] "기가막히게 맛있습니다. 사장님 감사합니다."           "닭이 너무 작아요! 양이 작은 편인데도 부족하네요. ><"
[5] "완전 만족합니다. 재구매 각이네요."                   "삼계탕에 넣었는데 양이 기대 이하네요..."            

> # txt에서 명사만 뽑아서 n에 저장
> n <- extractNoun(txt)
> head(n)  # 데이터 확인
[[1]]
[1] "닭"     "최고"   "육"     "질"     "살아있" "음"    

[[2]]
[1] "배송"   "상품"   "좋습니"

[[3]]
[1] "기가막히게" "사장님"     "감사"      

[[4]]
[1] "닭"   "양"   "편"   "부족"

[[5]]
[1] "완전" "만족" "재구" "매"   "각이"

[[6]]
[1] "삼계탕" "양"     "기대"   "이하"  


> # 텍스트 수정을 위해 n의 내용을 unlist해서 c에 저장함
> c <- unlist(n)

> # gsub 함수를 통해 텍스트 수정 실시
> c2 <- gsub("육","육질", c)  # "육"은 "육질"로 변경
> c2 <- gsub("재구","재구매", c2)  # "재구"는 "재구매"로 변경
> c2 <- gsub("에서","", c2)  # "에서"는 제거
> head(c2,30)
 [1] "닭"         "최고"       "육질"       "질"         "살아있"     "음"         "배송"       "상품"       "좋습니"    
[10] "기가막히게" "사장님"     "감사"       "닭"         "양"         "편"         "부족"       "완전"       "만족"      
[19] "재구매"     "매"         "각이"       "삼계탕"     "양"         "기대"       "이하"       "배송"       "아이스"    
[28] "팩"         "고생"       "포장"      

> # c2에 저장된 명사 중 두 글자 이상이 되는 것만 필터링
> c3 <- Filter(function(x) {nchar(x) >=2}, c2)

> head(c3,30)
 [1] "최고"       "육질"       "살아있"     "배송"       "상품"       "좋습니"     "기가막히게" "사장님"     "감사"      
[10] "부족"       "완전"       "만족"       "재구매"     "각이"       "삼계탕"     "기대"       "이하"       "배송"      
[19] "아이스"     "고생"       "포장"       "냄비"       "냄비"       "아이스"     "박스"       "비닐"       "벗겨지고.."
[28] "구매"       "다행"       "기름제거"  

> # c3를 table 함수를 이용해 단어별 빈도수가 나오게 만들고, wordcnt에 저장
> wordcnt <- table(c3)
> # 내림차순으로 정렬해서 어떤 단어가 많이 나왔는지 확인
> sort(wordcnt, decreasing = TRUE)
c3
        만족         가격         마리         배송       아이스         감사         구매         냄비         마트 
           6            4            3            3            3            2            2            2            2 
        박스       삼계탕         신선         싱싱         요리         육질       재구매         저렴         가족 
           2            2            2            2            2            2            2            2            1 
        각이     같긴한데       겉모습         고생   기가막히게         기대     기름제거       나머지         남편 
           1            1            1            1            1            1            1            1            1 
        다행     박을게요         밥솥   배송하면서         백숙         번창   벗겨지고..         부족         불구 
           1            1            1            1            1            1            1            1            1 
        비닐       비지떡       사장님       살아있         상품         생각         생닭         소금         실망 
           1            1            1            1            1            1            1            1            1 
        실패         아이       안씻긴         안해       어머니         에어         예정       오랜만         완전 
           1            1            1            1            1            1            1            1            1 
        요청         이하         자체       저한텐         적당         적합       좋습니     쫄깃쫄깃         차이 
           1            1            1            1            1            1            1            1            1 
        처음         최고         추천     퍽퍽해요         포장       프라이   해먹었네요 해먹었습니다         환불 
           1            1            1            1            1            1            1            1            1 
      ㅡㅡ.. 
           1 

> # 다양한 색상을 적용하기 위해 RColorBrewer 라이브러리 호출
> library(RColorBrewer)

> # 팔레트 확인
> display.brewer.all()


> # 팔레트 지정
> Dark2 <- brewer.pal(8, "Dark2")

> # 워드 클라우드 패키지 설치 및 라이브러리 불러오기
> library(wordcloud)

> # 워드 클라우드로 표현
wordcloud(names(wordcnt), freq=wordcnt, scale=c(4, 0.5), 
+           rot.per=0.25, min.freq=1, random.order=F,
+           random.color=T, colors=Dark2)

 

 

 

 

 

 

출처 : 현장에서 바로 써먹는 데이터 분석 with R