[실무 프로젝트로 배우는...] 평균 분석

데이터분석/R

[실무 프로젝트로 배우는...] 평균 분석

버섯도리 2022. 1. 26. 06:31

> ### 3.6 중고차 특성 분석을 위한 평균 분석
>
> #### 3.6.1 일표본 t 검정
> t.test(log(Sample$price),mu = 9.94)

One Sample t-test

data:  log(Sample$price)
t = -2.4822, df = 7466, p-value = 0.01308
alternative hypothesis: true mean is not equal to 9.94
95 percent confidence interval:
9.915643 9.937139
sample estimates:
mean of x
9.926391

> # 로그 변환된 중고차 가격의 평균은 9.94라는 귀무가설을 기각
>
>
> #### 3.6.2 독립표본 t 검정
>
> # 두 집단의 등분산 검정
> library(car)
> Audi_NonHybrid$fuelType = factor(Audi_NonHybrid$fuelType,
+                                  levels = c("Petrol","Diesel"))
> leveneTest(log(Audi_NonHybrid$price) ~ Audi_NonHybrid$fuelType)
Levene's Test for Homogeneity of Variance (center = median)
         Df F value    Pr(>F)
group     1  145.65 < 2.2e-16 ***
      10638
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # Audi 데이터에서 Petrol과 Diesel을 사용하는 중고차의 로그 변환된 가격의 분산은 동일하다는 귀무가설을 기각
>
> # 독립 표본 t 검정
> t.test(log(Audi_NonHybrid$price) ~ Audi_NonHybrid$fuelType ,
+        var.equal = FALSE)

Welch Two Sample t-test

data:  log(Audi_NonHybrid$price) by Audi_NonHybrid$fuelType
t = 0.0078653, df = 10585, p-value = 0.9937
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.01768657  0.01782908
sample estimates:
mean in group Petrol mean in group Diesel
            9.927492             9.927421

> # p-value = 0.9937이므로 귀무가설을 기각할 수 없다.
> # 95% 신뢰구간은 [-0.017, 0.017]로서 0을 포함하고 있으면 두 집단의 평균은 동일하다고 판단
>

> #### 3.6.3 분산분석
>
> # 분산분석
>
> ANOVA = aov(log(Audi$price) ~ Audi$fuelType)
> summary(ANOVA)
                 Df Sum Sq Mean Sq F value Pr(>F)
Audi$fuelType     2    1.5  0.7631   3.443  0.032 *
Residuals     10665 2363.8  0.2216
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # fuelType : 집단 간 편차, Residuals : 집단 내 편차
> # 유의확률(Pr>F)이 0.032이고 유의수준(0.05)보다 작기 때문에 귀무가설('모든 연료 타입의 로그 변환된 중고가 평균은 동일하다.')을 기각
>
> # 평균 시각화
>
> install.packages("ggpubr")
> library(ggpubr)
> my_comparisons = list(c("Petrol","Hybrid"),
+                       c("Petrol","Diesel"),
+                       c("Diesel","Hybrid"))
>
> Audi %>%
+   mutate(log_price = log(price)) %>%
+   ggboxplot(x = "fuelType", y = "log_price",
+             bxp.errorbar =  TRUE, color = "fuelType", palette = "jco",
+             fill = "fuelType") +
+   stat_boxplot(geom = "errorbar",
+                aes(x = fuelType, y = log_price)) +
+   stat_compare_means(comparisons = my_comparisons)
경고메시지(들):
Ignoring unknown aesthetics: fill

> # Petrol, Diesel 타입과 Hybrid 타입 간 평균 차이가 존재하는 것을 확인할 수 있다.
>
> # 이원배치 분산분석
>
> ANOVA2 = aov(log(Audi$price) ~ Audi$fuelType * Audi$transmission)
> summary(ANOVA2)
                                   Df Sum Sq Mean Sq  F value   Pr(>F)
Audi$fuelType                       2    1.5     0.8    4.888  0.00756 **
Audi$transmission                   2  687.6   343.8 2202.012  < 2e-16 ***
Audi$fuelType:Audi$transmission     3   11.8     3.9   25.141 3.35e-16 ***
Residuals                       10660 1664.4     0.2
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # 모든 집단의 주효과와 교호효과가 유의함.
>
> # 교호효과가 존재할 경우의 시각화
> Audi %>%
+   group_by(fuelType, transmission) %>%
+   summarise(Mean = mean(log(price))) %>%
+   ggplot() +
+   geom_point(aes(x = fuelType, y = Mean, col = transmission,
+                  shape = transmission, size = Mean), alpha = 0.4) +
+   geom_line(aes(x = fuelType, y = Mean, col = transmission, group = transmission),
+             size = 1.2) +
+   scale_size_area(max_size = 8) +
+   guides(size = "none") +
+   theme_bw() +
+   theme(legend.position = "bottom")
`summarise()` has grouped output by 'fuelType'. You can override using the `.groups` argument.

> # 교호효과가 존재하지 않을 경우에는 일반적으로 각 선이 마주치지 않는다.
> # 이 시각화에서는 각 선이 겹치는 경우가 많다.

출처 : 실무 프로젝트로 배우는 데이터 분석 with R