데이터분석/R

[ADP] 순차 패턴 분석 (Sequence Pattern Analysis)

버섯도리 2022. 1. 16. 16:27

> # 18. 순차 패턴 분석 (Sequence Pattern Analysis)


> # 순차적 패턴의 발견은 구매 순서가 고려되어 상품 간의 연관성이 측정되고, 유용한 연관규칙을 찾는 기법이다.
> # 그러므로 이러한 규칙 발견을 위해서는 데이터에 각각의 고객으로부터 발생한 구매 시점에 대한 정보가 있어야 한다.

> # 깃허브의 데이터셋을 활용한다. http://github.com/datameister66/data/blob/master/sequential.csv 다운로드 후에 C:\data 폴더에 저장한다.
> # Cust_segment - 소비자를 포함한 factor 변수군
> # Purchase1부터 Purchase8로 이름 붙인 8개의 개별적인 구매 사건을 의미
> # 순차 패턴 분석을 통해 소비자가 어떤 아이템을 함께 구매하는지 알아보고자 한다.
> df <- read.csv("c:/data/sequential.csv")
> str(df)
'data.frame': 5000 obs. of  9 variables:
 $ Cust_Segment: chr  "Segment1" "Segment1" "Segment1" "Segment1" ...
 $ Purchase1   : chr  "Product_A" "Product_B" "Product_G" "Product_C" ...
 $ Purchase2   : chr  "Product_A" "" "Product_B" "" ...
 $ Purchase3   : chr  "" "" "Product_B" "" ...
 $ Purchase4   : chr  "" "" "Product_C" "" ...
 $ Purchase5   : chr  "" "" "Product_B" "" ...
 $ Purchase6   : chr  "" "" "Product_B" "" ...
 $ Purchase7   : chr  "" "" "Product_B" "" ...
 $ Purchase8   : chr  "" "" "Product_G" "" ...

> table(df$Cust_Segment)

Segment1 Segment2 Segment3 Segment4 
    2900      572      554      974 
> table(df$Purchase1)

Product_A Product_B Product_C Product_D Product_E Product_F Product_G 
     1451       765       659      1060       364       372       329 
> table(unlist(df[,-1]))

          Product_A Product_B Product_C Product_D Product_E Product_F Product_G 
    22390      3855      3193      3564      3122      1688      1273       915 

> # 전체 구매에서 상품별 구매 횟수 출력

> # Product_A의 구매횟수가 가장 많다. 22390은 NULL값이 나온 횟수

> # dplyr 패키지의 count()/arrange() 함수를 이용하여 첫번째 구매와 두번째 구매 쌍의 빈도를 확인한다.
> library(dplyr)

> dfCount <- count(df, Purchase1, Purchase2)
> dfCount <- arrange(dfCount, desc(n))

> dim(dfCount)
[1] 56  3
> head(dfCount)
  Purchase1 Purchase2   n
1 Product_A Product_A 548
2 Product_D           548
3 Product_B           346
4 Product_C Product_C 345
5 Product_B Product_B 291
6 Product_D Product_D 281

> # 가장 흔한 구매 순서가 Product_A를 두 번 연속 구매하거나 Product_D를 한번 구매한 것이다.


> # TraMineR 패키지의 seqdef() 함수를 이용해 순차적 데이터를 만든다.
> library(TraMineR)

> seq <- seqdef(df[,-1], xtstep = 1) # xtstep : 도표함수 눈금 사이의 거리 인자
 [>] 8 distinct states appear in the data: 
     1 = 
     2 = Product_A
     3 = Product_B
     4 = Product_C
     5 = Product_D
     6 = Product_E
     7 = Product_F
     8 = Product_G
 [>] state coding:
       [alphabet]  [label]   [long label] 
     1                        
     2  Product_A   Product_A Product_A
     3  Product_B   Product_B Product_B
     4  Product_C   Product_C Product_C
     5  Product_D   Product_D Product_D
     6  Product_E   Product_E Product_E
     7  Product_F   Product_F Product_F
     8  Product_G   Product_G Product_G
 [>] 5000 sequences in the data set
 [>] min/max sequence length: 8/8
seqdplot(seq)


> seqE <- seqecreate(seq)
> subSeq <- seqefsub(seqE, pmin.support = 0.05)
> # 구매순서 오브젝트를 생성하고 5% 이상의 빈도를 나타낸다.
plot(subSeq[1:10], col='dodgerblue')


> seqMat <- seqtrate(seq) # computing transition probabilities
 [>] computing transition probabilities for states /Product_A/Product_B/Product_C/Product_D/Product_E/Product_F/Product_G ...
> options(digits = 2)
> seqMat[2:4,1:3]
               [-> ] [-> Product_A] [-> Product_B]
[Product_A ->]  0.19          0.417          0.166
[Product_B ->]  0.26          0.113          0.475
[Product_C ->]  0.19          0.058          0.041
> # 이 행렬은 Product_A를 구매한 후 Product_B를 구매할 확률이 16.6%이고, 아무것도 구매하지 않을 확률은 19%이지만 또다시 Product_A를 구매할 확률이 41.7%임을 알 수 있다.

 

 

 

 

 

출처 : 2020 데이터 분석 전문가 ADP 필기 한 권으로 끝내기