[실무 프로젝트로 배우는...] 분류 모형

> ### 3.9 중고차 등급 분석을 위한 분류 모형
>
> #### 3.9.1 로지스틱 회귀분석
>
> Q3 = quantile(Audi$price, probs = c(0.75))
> Audi$price_G = ifelse(Audi$price > Q3, 1, 0)
> Sample = Audi[SL,]
> Test = Audi[-SL,]
>
> # 로지스틱 회귀분석의 접근 방식
>
> ggplot(Sample) +
+   geom_point(aes(x = mileage, y = price_G, col = as.factor(price_G))) +
+   geom_abline(mapping=aes(slope = 1/150000,intercept = 1),
+               linetype = "dashed", size = 1.2) +
+   scale_x_reverse(limits = c(150000,0)) +
+   guides(col = FALSE) +
+   theme_bw()
경고메시지(들):
1: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
2: Removed 1 rows containing missing values (geom_point).

> # price_G 변수는 이산형 변수이므로 산점도가 일자 형태를 띈다.
>
> # 로지스틱 회귀분석 - glm()
>
> GLM = glm(price_G ~ mileage, data = Sample,
+           family = binomial(link = "logit"))
> Predicted = GLM$fitted.values
> Sample$GLM_Predicted = Predicted
>
> ggplot(Sample) +
+   geom_point(aes(x = mileage, y = price_G, col = as.factor(price_G))) +
+   geom_line(aes(x = mileage, y = GLM_Predicted), size = 1.2, linetype = 'dashed') +
+   geom_hline(yintercept = 0.5, linetype = 'dashed') +
+   scale_x_reverse(limits = c(150000,0)) +
+   guides(col = FALSE) +
+   theme_bw()
경고메시지(들):
1: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
2: Removed 1 rows containing missing values (geom_point).
3: Removed 1 row(s) containing missing values (geom_path).

>
> summary(GLM)

Call:
glm(formula = price_G ~ mileage, family = binomial(link = "logit"),
    data = Sample)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-1.3389  -0.7433  -0.3330  -0.0038   3.6674

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept)  3.723e-01  4.445e-02   8.376   <2e-16 ***
mileage     -9.119e-05  2.935e-06 -31.071   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8383.0  on 7466  degrees of freedom
Residual deviance: 6500.4  on 7465  degrees of freedom
AIC: 6504.4

Number of Fisher Scoring iterations: 6

> # 이탈도(Deviance) : 추정된 모형과 포화모형의 차이
> # 추정된 로지스틱 회귀선의 Deviance Residuals가 작을수록 적합이 잘 됐다고 해석한다.
> # Null deviance는 절편만 추가된 모형과 포화모형 간의 차이를 의미한다.
> # Residual deviance는 mileage 변수가 포함된 모형과 포화모형 간의 차이를 의미한다.
>
> ggplot(Sample) +
+   geom_point(aes(x = mileage,
+                  y = predict(GLM, newdata = Sample)),
+              size = 1.2) +
+   scale_x_reverse(limits = c(150000,0)) +
+   ylab("log(p(x)/1-p(x))") +
+   guides(col = FALSE) +
+   theme_bw()
경고메시지(들):
1: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> = "none")` instead.
2: Removed 1 rows containing missing values (geom_point).

> # y축은 관심 그룹에 속할 승산에 대한 로그 값이다.
>
> # anova()를 활용하면 이탈도에 대해 더 자세한 분석이 가능하다.
> anova(GLM, test="Chisq")
Analysis of Deviance Table

Model: binomial, link: logit

Response: price_G

Terms added sequentially (first to last)

        Df Deviance Resid. Df Resid. Dev  Pr(>Chi)
NULL                     7466     8383.0
mileage  1   1882.6      7465     6500.4 < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # mileage 변수가 추가된 모형이 절편만 포함하는 모형 NULL보다 이탈도가 1882.6만큼 감소했다.
>

> #### 3.9.2 로지스틱 회귀분석의 평가 방법
>
> library(caret)
>
> log_odds = predict(GLM, newdata = Test)
> Predicted = predict(GLM, newdata = Test, type = 'response')
>
> Predicted_C = ifelse(Predicted > 0.5, 1, 0)
> confusionMatrix(factor(Predicted_C, levels = c(1,0)),
+                 factor(Test$price_G, levels = c(1,0)))
Confusion Matrix and Statistics

          Reference
Prediction    1    0
         1  294  240
         0  496 2171

               Accuracy : 0.7701
                 95% CI : (0.7551, 0.7846)
    No Information Rate : 0.7532
    P-Value [Acc > NIR] : 0.01366

                  Kappa : 0.3059

Mcnemar's Test P-Value : < 2e-16

            Sensitivity : 0.37215
            Specificity : 0.90046
         Pos Pred Value : 0.55056
         Neg Pred Value : 0.81402
             Prevalence : 0.24680
         Detection Rate : 0.09185
   Detection Prevalence : 0.16682
      Balanced Accuracy : 0.63630

       'Positive' Class : 1

> # 민감도가 37.22%라면 분류 모형 성능으로는 부족한 수치이다.
> # 앞서 기준점을 0.5로 분류했는데, 최적의 기준점을 찾을 필요가 있는데, 이 때 참고하는 정보가 ROC 커브이다.
>
> install.packages("Epi")
> library(Epi)
> ROC(form = price_G ~ mileage, data = Test, plot="ROC")
경고메시지(들):
glm.fit: fitted probabilities numerically 0 or 1 occurred

> # 일반적으로 AUC가 클수록 모형의 분류 성능이 좋다고 판단한다.

>
> Predicted_C = ifelse(Predicted < 0.263, 0, 1)
> confusionMatrix(factor(Predicted_C, levels = c(1,0)),
+                 factor(Test$price_G, levels = c(1,0)))
Confusion Matrix and Statistics

          Reference
Prediction    1    0
         1  677  791
         0  113 1620

               Accuracy : 0.7176
                 95% CI : (0.7016, 0.7331)
    No Information Rate : 0.7532
    P-Value [Acc > NIR] : 1

                  Kappa : 0.4105

Mcnemar's Test P-Value : <2e-16

            Sensitivity : 0.8570
            Specificity : 0.6719
         Pos Pred Value : 0.4612
         Neg Pred Value : 0.9348
             Prevalence : 0.2468
         Detection Rate : 0.2115
   Detection Prevalence : 0.4586
      Balanced Accuracy : 0.7644

       'Positive' Class : 1

>
>
> #### 3.9.3 다중 로지스틱 회귀분석
>
> GLM2 = glm(price_G ~ mileage + mpg + engineSize, data = Sample,
+            family = binomial(link = "logit"))
경고메시지(들):
glm.fit: 적합된 확률값들이 0 또는 1 입니다
> summary(GLM2)

Call:
glm(formula = price_G ~ mileage + mpg + engineSize, family = binomial(link = "logit"),
    data = Sample)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-3.1142  -0.2968  -0.0717   0.0000   5.6421

Coefficients:
              Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.129e+00  3.811e-01  -2.963  0.00305 **
mileage     -1.291e-04  4.967e-06 -25.989  < 2e-16 ***
mpg         -1.013e-01  6.217e-03 -16.300  < 2e-16 ***
engineSize   3.231e+00  1.244e-01  25.979  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8383.0  on 7466  degrees of freedom
Residual deviance: 3586.9  on 7463  degrees of freedom
AIC: 3594.9

Number of Fisher Scoring iterations: 7

> # Residual deviance는 3586.9로 mileage 하나로만 예측자로 활용했을 때보다 크게 낮아진 것을 확인할 수 있다.
>
> Predicted2 = predict(GLM2, newdata = Test, type = 'response')
> Predicted_C2 = ifelse(Predicted2 > 0.5, 1, 0)
> confusionMatrix(factor(Predicted_C2, levels = c(1,0)),
+                 factor(Test$price_G, levels = c(1,0)))
Confusion Matrix and Statistics

          Reference
Prediction    1    0
         1  608  155
         0  182 2256

               Accuracy : 0.8947
                 95% CI : (0.8836, 0.9051)
    No Information Rate : 0.7532
    P-Value [Acc > NIR] : <2e-16

                  Kappa : 0.7135

Mcnemar's Test P-Value : 0.1567

            Sensitivity : 0.7696
            Specificity : 0.9357
         Pos Pred Value : 0.7969
         Neg Pred Value : 0.9253
             Prevalence : 0.2468
         Detection Rate : 0.1899
   Detection Prevalence : 0.2384
      Balanced Accuracy : 0.8527

       'Positive' Class : 1

> # mileage 하나만 활용했을 때보다 성능이 많이 좋아진 것을 확인할 수 있다.

출처 : 실무 프로젝트로 배우는 데이터 분석 with R

'데이터분석 > R' 카테고리의 다른 글

[실무 프로젝트로 배우는...] 군집 분석 (0)	2022.02.02
[실무 프로젝트로 배우는...] 머신러닝 기초 (0)	2022.01.31
[실무 프로젝트로 배우는...] 예측 분석 (0)	2022.01.31
[실무 프로젝트로 배우는...] 관계 분석 (0)	2022.01.28
[실무 프로젝트로 배우는...] 평균 분석 (0)	2022.01.26

Meongtae's IT Blog

[실무 프로젝트로 배우는...] 분류 모형

'데이터분석 > R' 카테고리의 다른 글

티스토리툴바

[실무 프로젝트로 배우는...] 분류 모형

'데이터분석 > R' 카테고리의 다른 글

관련글

티스토리툴바