class: center, middle, inverse, title-slide # Batch loading data and list columns ### Daniel Anderson ### Week 4 --- layout: true <script> feather.replace() </script> <div class="slides-footer"> <span> <a class = "footer-icon-link" href = "https://github.com/uo-datasci-specialization/c3-fp-2022/raw/main/static/slides/w4.pdf"> <i class = "footer-icon" data-feather="download"></i> </a> <a class = "footer-icon-link" href = "https://fp-2022.netlify.app/slides/w4.html"> <i class = "footer-icon" data-feather="link"></i> </a> <a class = "footer-icon-link" href = "https://fp-2022.netlify.app/"> <i class = "footer-icon" data-feather="globe"></i> </a> <a class = "footer-icon-link" href = "https://github.com/uo-datasci-specialization/c3-fp-2022"> <i class = "footer-icon" data-feather="github"></i> </a> </span> </div> --- # Agenda * Discuss the midterm + Canvas quiz (10 points; please don't stress) + Take home (40 points) * Review Lab 2 * `map_dfr` and batch-loading data --- * Introduce list columns * Contrast: + `group_by() %>% nest() %>% mutate() %>% map()` with + `nest_by() %>% summarize()` * In-class midterm (last 30 minutes) --- class: inverse-red middle # Review Lab 2 --- # Learning objectives * Understand when `map_dfr` can and should be applied * Better understand file paths, and how `{fs}` can help * Be able to batch load data of a specific type within a mixed-type directory * Use filenames to pull data --- # Learning objectives (cont.) * Understand list columns and how they relate to `base::split` * Fluently nest/unnest data frames * Understand why `tidyr::nest` can be a powerful framework (data frames) and when `tidyr::unnest` can/should be used to move out of nested data frames and into a standard data frame. --- class: inverse-blue center middle # Midterm ### Questions? ### Let's look at the [take-home portion](https://fp-2022.netlify.app/take-home-midterm/) --- # `map_dfr` * If each iteration returns a data frame, you can use `map_dfr` to automatically bind all the data frames together. --- # Example * Create a function that simulates data (please copy the code and follow along) ```r set.seed(8675309) simulate <- function(n, mean = 0, sd = 1) { tibble(sample_id = seq_len(n), sample = rnorm(n, mean, sd)) } simulate(10) ``` ``` ## # A tibble: 10 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 -0.9965824 ## 2 2 0.7218241 ## 3 3 -0.6172088 ## 4 4 2.029392 ## 5 5 1.065416 ## 6 6 0.9872197 ## 7 7 0.02745393 ## 8 8 0.6728723 ## 9 9 0.5720665 ## 10 10 0.9036777 ``` --- # Two more quick examples ```r simulate(3, 100, 10) ``` ``` ## # A tibble: 3 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 84.50448 ## 2 2 110.2264 ## 3 3 101.5008 ``` ```r simulate(5, -10, 1.5) ``` ``` ## # A tibble: 5 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 -10.98995 ## 2 2 -11.49188 ## 3 3 -7.041312 ## 4 4 -10.66270 ## 5 5 -11.35096 ``` --- # Simulation * Assume we want to vary the sample size from 10 to 150 by increments of 5 * `mean` stays constant at 100, `sd` is constant at 10 -- ### Try with `purrr::map`
02
:
00
--- ```r library(tidyverse) sims <- map(seq(10, 150, 5), simulate, mean = 100, sd = 10) ``` .pull-left[ ```r sims[1] ``` ``` ## [[1]] ## # A tibble: 10 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 103.7618 ## 2 2 111.5353 ## 3 3 115.7490 ## 4 4 105.8853 ## 5 5 93.84955 ## 6 6 97.71089 ## 7 7 100.6392 ## 8 8 96.86526 ## 9 9 97.51501 ## 10 10 98.46205 ``` ] .pull-right[ ```r sims[2] ``` ``` ## [[1]] ## # A tibble: 15 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 93.64743 ## 2 2 99.96206 ## 3 3 100.4562 ## 4 4 106.8407 ## 5 5 97.47957 ## 6 6 98.48961 ## 7 7 91.25069 ## 8 8 80.23099 ## 9 9 102.3766 ## 10 10 100.3609 ## 11 11 101.3490 ## 12 12 101.1758 ## 13 13 91.74411 ## 14 14 78.64764 ## 15 15 102.1421 ``` ] --- # Swap for `map_dfr` ### Try it - what happens?
01
:
00
-- ```r sims_df <- map_dfr(seq(10, 150, 5), simulate, 100, 10) sims_df ``` ``` ## # A tibble: 2,320 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 85.64361 ## 2 2 103.6789 ## 3 3 94.71782 ## 4 4 103.1350 ## 5 5 99.78701 ## 6 6 105.3462 ## 7 7 100.0653 ## 8 8 94.28314 ## 9 9 108.8872 ## 10 10 106.0850 ## # … with 2,310 more rows ``` --- # Notice a problem here ```r sims_df[1:15, ] ``` ``` ## # A tibble: 15 × 2 ## sample_id sample ## <int> <dbl> ## 1 1 85.64361 ## 2 2 103.6789 ## 3 3 94.71782 ## 4 4 103.1350 ## 5 5 99.78701 ## 6 6 105.3462 ## 7 7 100.0653 ## 8 8 94.28314 ## 9 9 108.8872 ## 10 10 106.0850 ## 11 1 89.49968 ## 12 2 86.99898 ## 13 3 85.38054 ## 14 4 99.10690 ## 15 5 105.0088 ``` --- # `.id` argument ```r sims_df2 <- map_dfr(seq(10, 150, 5), simulate, 100, 10, .id = "iteration") sims_df2[1:15, ] ``` ``` ## # A tibble: 15 × 3 ## iteration sample_id sample ## <chr> <int> <dbl> ## 1 1 1 112.1250 ## 2 1 2 88.07056 ## 3 1 3 108.3908 ## 4 1 4 100.8193 ## 5 1 5 102.1545 ## 6 1 6 113.5398 ## 7 1 7 101.4171 ## 8 1 8 99.33668 ## 9 1 9 100.2855 ## 10 1 10 90.22043 ## 11 2 1 91.08882 ## 12 2 2 107.3664 ## 13 2 3 101.1745 ## 14 2 4 96.82053 ## 15 2 5 105.5844 ``` --- class: middle inverse-orange > `.id`: Either a string or NULL. If a string, the output will contain a variable with that name, storing either the name (if .x is named) or the index (if .x is unnamed) of the input. If NULL, the default, no variable will be created. \- [{purrr} documentation](https://purrr.tidyverse.org/reference/map.html) --- # setNames ```r sample_size <- seq(10, 150, 5) sample_size ``` ``` ## [1] 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100 105 ## [21] 110 115 120 125 130 135 140 145 150 ``` ```r sample_size <- setNames(sample_size, english::english(seq(10, 150, 5))) sample_size[1:15] ``` ``` ## ten fifteen twenty twenty-five thirty thirty-five ## 10 15 20 25 30 35 ## forty forty-five fifty fifty-five sixty sixty-five ## 40 45 50 55 60 65 ## seventy seventy-five eighty ## 70 75 80 ``` --- # Try again ```r sims_df3 <- map_dfr(sample_size, simulate, 100, 10, .id = "n") sims_df3[1:15, ] ``` ``` ## # A tibble: 15 × 3 ## n sample_id sample ## <chr> <int> <dbl> ## 1 ten 1 98.94914 ## 2 ten 2 101.6824 ## 3 ten 3 88.16447 ## 4 ten 4 90.13604 ## 5 ten 5 85.53591 ## 6 ten 6 90.69977 ## 7 ten 7 105.8858 ## 8 ten 8 89.12978 ## 9 ten 9 114.4982 ## 10 ten 10 111.6440 ## 11 fifteen 1 103.2732 ## 12 fifteen 2 106.8949 ## 13 fifteen 3 88.83591 ## 14 fifteen 4 105.5402 ## 15 fifteen 5 112.6581 ``` --- # Another quick example ### `broom::tidy` * The {broom} package helps us extract model output in a tidy format ```r lm(tvhours ~ age, gss_cat) %>% broom::tidy() ``` ``` ## # A tibble: 2 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 1.992391 0.06980966 28.54033 4.672161e-173 ## 2 age 0.02095679 0.001387361 15.10551 4.689310e- 51 ``` --- # Fit separate models by year ### Again - probs not best statistically ```r split(gss_cat, gss_cat$year) %>% map_dfr( ~lm(tvhours ~ age, .x) %>% broom::tidy() ) ``` ``` ## # A tibble: 16 × 5 ## term estimate std.error statistic p.value ## <chr> <dbl> <dbl> <dbl> <dbl> ## 1 (Intercept) 2.080163 0.1709061 12.17138 7.995632e-33 ## 2 age 0.01948584 0.003485199 5.591027 2.599011e- 8 ## 3 (Intercept) 2.078999 0.2176829 9.550583 1.191266e-20 ## 4 age 0.01963575 0.004400292 4.462375 9.137366e- 6 ## 5 (Intercept) 1.767990 0.2464509 7.173804 1.531756e-12 ## 6 age 0.02386070 0.005031548 4.742218 2.459650e- 6 ## 7 (Intercept) 2.096054 0.1496431 14.00702 1.419772e-42 ## 8 age 0.01781388 0.002977289 5.983256 2.589482e- 9 ## 9 (Intercept) 1.855278 0.2156381 8.603668 2.167351e-17 ## 10 age 0.02390720 0.004314567 5.541043 3.628675e- 8 ## 11 (Intercept) 2.068914 0.2096397 9.868903 2.896085e-22 ## 12 age 0.01989505 0.004086638 4.868317 1.251234e- 6 ## 13 (Intercept) 1.878070 0.2258400 8.315932 2.280108e-16 ## 14 age 0.02547794 0.004449295 5.726287 1.274840e- 8 ## 15 (Intercept) 1.980095 0.1877544 10.54620 3.238043e-25 ## 16 age 0.02049066 0.003611900 5.673098 1.650822e- 8 ``` --- # .id In cases like the preceding, `.id` becomes invaluable ```r split(gss_cat, gss_cat$year) %>% map_dfr( ~lm(tvhours ~ age, .x) %>% broom::tidy(), .id = "year" ) ``` ``` ## # A tibble: 16 × 6 ## year term estimate std.error statistic p.value ## <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 2000 (Intercept) 2.080163 0.1709061 12.17138 7.995632e-33 ## 2 2000 age 0.01948584 0.003485199 5.591027 2.599011e- 8 ## 3 2002 (Intercept) 2.078999 0.2176829 9.550583 1.191266e-20 ## 4 2002 age 0.01963575 0.004400292 4.462375 9.137366e- 6 ## 5 2004 (Intercept) 1.767990 0.2464509 7.173804 1.531756e-12 ## 6 2004 age 0.02386070 0.005031548 4.742218 2.459650e- 6 ## 7 2006 (Intercept) 2.096054 0.1496431 14.00702 1.419772e-42 ## 8 2006 age 0.01781388 0.002977289 5.983256 2.589482e- 9 ## 9 2008 (Intercept) 1.855278 0.2156381 8.603668 2.167351e-17 ## 10 2008 age 0.02390720 0.004314567 5.541043 3.628675e- 8 ## 11 2010 (Intercept) 2.068914 0.2096397 9.868903 2.896085e-22 ## 12 2010 age 0.01989505 0.004086638 4.868317 1.251234e- 6 ## 13 2012 (Intercept) 1.878070 0.2258400 8.315932 2.280108e-16 ## 14 2012 age 0.02547794 0.004449295 5.726287 1.274840e- 8 ## 15 2014 (Intercept) 1.980095 0.1877544 10.54620 3.238043e-25 ## 16 2014 age 0.02049066 0.003611900 5.673098 1.650822e- 8 ``` --- class: inverse-green middle # Break
05
:
00
--- class: inverse-blue middle # Batch-loading data ### Please follow along --- # {fs} ### Could we apply `map_dfr` here? ```r # install.packages("fs") library(fs) dir_ls(here::here("data")) ``` ``` ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/insurance_coverage.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim ``` ```r dir_ls(here::here("data", "pfiles_sim")) ``` ``` ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Sciencepfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Sciencepfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Sciencepfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Wripfiles18_sim.csv ``` --- # Limit files * We really only want the `.csv` + That happens to be the only thing that's in there but that's regularly not the case ```r dir_ls(here::here("data", "pfiles_sim"), glob = "*.csv") ``` ``` ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Sciencepfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Sciencepfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Wripfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8ELApfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Rdgpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Sciencepfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Wripfiles18_sim.csv ``` --- # Batch load * Loop through the directories and `import` or `read_csv` ```r files <- dir_ls( here::here("data", "pfiles_sim"), glob = "*.csv" ) batch <- map_dfr(files, read_csv) batch ``` ``` ## # A tibble: 15,945 × 22 ## Entry Theta Status Count RawScore SE Infit Infit_Z Outfit Outfit_Z ## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 123 1.2687 1 36 23 0.3713 0.93 -0.34 0.82 -0.62 ## 2 88 1.5541 1 36 25 0.3852 0.95 -0.37 0.81 -0.56 ## 3 105 3.2773 1 36 33 0.6187 0.9 -0.04 1.63 1.03 ## 4 153 4.4752 1 36 35 1.0234 0.93 0.23 0.35 -0.16 ## 5 437 2.6655 1 36 31 0.5008 0.92 -0.18 0.88 -0.12 ## 6 307 5.7137 0 36 36 1.8371 1 0 1 0 ## 7 305 3.7326 1 36 34 0.7408 1.06 0.31 0.86 0.17 ## 8 42 0.609 1 36 18 0.36 1.55 2.56 1.74 3.31 ## 9 59 -2.623 1 36 3 1.0344 0.85 0.06 0.17 -0.37 ## 10 304 5.7137 0 36 36 1.8371 1 0 1 0 ## # … with 15,935 more rows, and 12 more variables: Displacement <dbl>, ## # PointMeasureCorr <dbl>, Weight <dbl>, ObservMatch <dbl>, ExpectMatch <dbl>, ## # PointMeasureExpected <dbl>, RMSR <dbl>, WMLE <dbl>, testeventid <dbl>, ssid <dbl>, ## # asmtprmrydsbltycd <dbl>, asmtscndrydsbltycd <dbl> ``` --- # Problem * We've lost a lot of info - no way to identify which file is which ### Try to fix it!
02
:
00
--- # Add id ```r batch2 <- map_dfr(files, read_csv, .id = "file") batch2 ``` ``` ## # A tibble: 15,945 × 23 ## # … with 15,935 more rows, and 23 more variables: file <chr>, Entry <dbl>, ## # Theta <dbl>, Status <dbl>, Count <dbl>, RawScore <dbl>, SE <dbl>, Infit <dbl>, ## # Infit_Z <dbl>, Outfit <dbl>, Outfit_Z <dbl>, Displacement <dbl>, ## # PointMeasureCorr <dbl>, Weight <dbl>, ObservMatch <dbl>, ExpectMatch <dbl>, ## # PointMeasureExpected <dbl>, RMSR <dbl>, WMLE <dbl>, testeventid <dbl>, ssid <dbl>, ## # asmtprmrydsbltycd <dbl>, asmtscndrydsbltycd <dbl> ``` Note - the `file` column contains the full path, which is so long it makes no rows print --- ```r batch2 %>% count(file) ``` ``` ## # A tibble: 31 × 2 ## # … with 21 more rows, and 2 more variables: file <chr>, n <int> ``` * Still not terrifically useful. What can we do? --- # Step 1 * Remove the `here::here` path from string ```r batch2 <- batch2 %>% mutate( file = str_replace_all( file, here::here("data", "pfiles_sim"), "" ) ) batch2 %>% count(file) ``` ``` ## # A tibble: 31 × 2 ## file n ## <chr> <int> ## 1 /g11ELApfiles18_sim.csv 453 ## 2 /g11Mathpfiles18_sim.csv 460 ## 3 /g11Rdgpfiles18_sim.csv 453 ## 4 /g11Sciencepfiles18_sim.csv 438 ## 5 /g11Wripfiles18_sim.csv 453 ## 6 /g3ELApfiles18_sim.csv 540 ## 7 /g3Mathpfiles18_sim.csv 536 ## 8 /g3Rdgpfiles18_sim.csv 540 ## 9 /g3Wripfiles18_sim.csv 540 ## 10 /g4ELApfiles18_sim.csv 585 ## # … with 21 more rows ``` --- # Pull out pieces you need * Regular expressions are most powerful here + We haven't talked about them much * Try [RegExplain](https://www.garrickadenbuie.com/project/regexplain/) --- # Pull grade * Note - I'm not expecting you to just suddenly be able to do this. This is more for illustration. There's also other ways you could extract the same info ```r batch2 %>% mutate( grade = str_replace_all( file, "/g(\\d?\\d).+", "\\1" ) ) %>% select(file, grade) ``` ``` ## # A tibble: 15,945 × 2 ## file grade ## <chr> <chr> ## 1 /g11ELApfiles18_sim.csv 11 ## 2 /g11ELApfiles18_sim.csv 11 ## 3 /g11ELApfiles18_sim.csv 11 ## 4 /g11ELApfiles18_sim.csv 11 ## 5 /g11ELApfiles18_sim.csv 11 ## 6 /g11ELApfiles18_sim.csv 11 ## 7 /g11ELApfiles18_sim.csv 11 ## 8 /g11ELApfiles18_sim.csv 11 ## 9 /g11ELApfiles18_sim.csv 11 ## 10 /g11ELApfiles18_sim.csv 11 ## # … with 15,935 more rows ``` --- # `parse_number` * In this case `parse_number` also works - but note that it would not work to extract the year ```r batch2 %>% * mutate(grade = parse_number(file)) %>% select(file, grade) ``` ``` ## # A tibble: 15,945 × 2 ## file grade ## <chr> <dbl> ## 1 /g11ELApfiles18_sim.csv 11 ## 2 /g11ELApfiles18_sim.csv 11 ## 3 /g11ELApfiles18_sim.csv 11 ## 4 /g11ELApfiles18_sim.csv 11 ## 5 /g11ELApfiles18_sim.csv 11 ## 6 /g11ELApfiles18_sim.csv 11 ## 7 /g11ELApfiles18_sim.csv 11 ## 8 /g11ELApfiles18_sim.csv 11 ## 9 /g11ELApfiles18_sim.csv 11 ## 10 /g11ELApfiles18_sim.csv 11 ## # … with 15,935 more rows ``` --- # Extract year ```r batch2 %>% mutate( grade = str_replace_all( file, "/g(\\d?\\d).+", "\\1" ), * year = str_replace_all( * file, ".+files(\\d\\d)_sim.+", "\\1" * ) ) %>% select(file, grade, year) ``` ``` ## # A tibble: 15,945 × 3 ## file grade year ## <chr> <chr> <chr> ## 1 /g11ELApfiles18_sim.csv 11 18 ## 2 /g11ELApfiles18_sim.csv 11 18 ## 3 /g11ELApfiles18_sim.csv 11 18 ## 4 /g11ELApfiles18_sim.csv 11 18 ## 5 /g11ELApfiles18_sim.csv 11 18 ## 6 /g11ELApfiles18_sim.csv 11 18 ## 7 /g11ELApfiles18_sim.csv 11 18 ## 8 /g11ELApfiles18_sim.csv 11 18 ## 9 /g11ELApfiles18_sim.csv 11 18 ## 10 /g11ELApfiles18_sim.csv 11 18 ## # … with 15,935 more rows ``` --- # Extract Content Area ```r batch2 %>% mutate(grade = str_replace_all(file, "/g(\\d?\\d).+", "\\1"), year = str_replace_all(file, ".+files(\\d\\d)_sim.+", "\\1"), * content = str_replace_all(file, * "/g\\d?\\d(.+)pfiles.+", * "\\1")) %>% select(file, grade, year, content) ``` ``` ## # A tibble: 15,945 × 4 ## file grade year content ## <chr> <chr> <chr> <chr> ## 1 /g11ELApfiles18_sim.csv 11 18 ELA ## 2 /g11ELApfiles18_sim.csv 11 18 ELA ## 3 /g11ELApfiles18_sim.csv 11 18 ELA ## 4 /g11ELApfiles18_sim.csv 11 18 ELA ## 5 /g11ELApfiles18_sim.csv 11 18 ELA ## 6 /g11ELApfiles18_sim.csv 11 18 ELA ## 7 /g11ELApfiles18_sim.csv 11 18 ELA ## 8 /g11ELApfiles18_sim.csv 11 18 ELA ## 9 /g11ELApfiles18_sim.csv 11 18 ELA ## 10 /g11ELApfiles18_sim.csv 11 18 ELA ## # … with 15,935 more rows ``` --- # Double checks: grade ```r batch2 %>% mutate(grade = str_replace_all(file, "/g(\\d?\\d).+", "\\1"), year = str_replace_all(file, ".+files(\\d\\d)_sim.+", "\\1"), content = str_replace_all(file, "/g\\d?\\d(.+)pfiles.+", "\\1")) %>% select(file, grade, year, content) %>% count(grade) ``` ``` ## # A tibble: 7 × 2 ## grade n ## <chr> <int> ## 1 11 2257 ## 2 3 2156 ## 3 4 2341 ## 4 5 2632 ## 5 6 2216 ## 6 7 1962 ## 7 8 2381 ``` --- # Double checks: year ```r batch2 %>% mutate(grade = str_replace_all(file, "/g(\\d?\\d).+", "\\1"), year = str_replace_all(file, ".+files(\\d\\d)_sim.+", "\\1"), content = str_replace_all(file, "/g\\d?\\d(.+)pfiles.+", "\\1")) %>% select(file, grade, year, content) %>% count(year) ``` ``` ## # A tibble: 1 × 2 ## year n ## <chr> <int> ## 1 18 15945 ``` --- # Double checks: content ```r batch2 %>% mutate(grade = str_replace_all(file, "/g(\\d?\\d).+", "\\1"), year = str_replace_all(file, ".+files(\\d\\d)_sim.+", "\\1"), content = str_replace_all(file, "/g\\d?\\d(.+)pfiles.+", "\\1")) %>% select(file, grade, year, content) %>% count(content) ``` ``` ## # A tibble: 5 × 2 ## content n ## <chr> <int> ## 1 ELA 3627 ## 2 Math 3629 ## 3 Rdg 3627 ## 4 Science 1435 ## 5 Wri 3627 ``` --- # Finalize ```r d <- batch2 %>% mutate( grade = str_replace_all( file, "/g(\\d?\\d).+", "\\1" ), grade = as.integer(grade), year = str_replace_all( file, ".+files(\\d\\d)_sim.+", "\\1" ), year = as.integer(grade), content = str_replace_all( file, "/g\\d?\\d(.+)pfiles.+", "\\1" ) ) %>% select(-file) %>% select( ssid, grade, year, content, testeventid, asmtprmrydsbltycd, asmtscndrydsbltycd, Entry:WMLE ) ``` --- # Final product * In this case, we basically have a tidy data frame already! * We've reduced our problem from 31 files to a single file ```r d ``` ``` ## # A tibble: 15,945 × 25 ## ssid grade year content testeventid asmtprmrydsbltycd asmtscndrydsbltycd Entry ## <dbl> <int> <int> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 9466908 11 11 ELA 148933 0 0 123 ## 2 7683685 11 11 ELA 147875 10 0 88 ## 3 9025693 11 11 ELA 143699 40 20 105 ## 4 10099824 11 11 ELA 143962 82 0 153 ## 5 18886078 11 11 ELA 150680 10 0 437 ## 6 10606750 11 11 ELA 144583 80 80 307 ## 7 10541306 11 11 ELA 145204 50 0 305 ## 8 7632967 11 11 ELA 148926 10 0 42 ## 9 7661118 11 11 ELA 148893 50 0 59 ## 10 10547177 11 11 ELA 144583 82 0 304 ## # … with 15,935 more rows, and 17 more variables: Theta <dbl>, Status <dbl>, ## # Count <dbl>, RawScore <dbl>, SE <dbl>, Infit <dbl>, Infit_Z <dbl>, Outfit <dbl>, ## # Outfit_Z <dbl>, Displacement <dbl>, PointMeasureCorr <dbl>, Weight <dbl>, ## # ObservMatch <dbl>, ExpectMatch <dbl>, PointMeasureExpected <dbl>, RMSR <dbl>, ## # WMLE <dbl> ``` --- # Quick look at distributions ![](w4_files/figure-html/fig-1.png)<!-- --> --- # Summary stats ```r d %>% group_by(grade, content, asmtprmrydsbltycd) %>% summarize(mean = mean(Theta)) %>% pivot_wider(names_from = content, values_from = mean) ``` ``` ## # A tibble: 77 × 7 ## # Groups: grade [7] ## grade asmtprmrydsbltycd ELA Math Rdg Wri Science ## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 3 0 -0.07361 -1.21055 1.010455 1.612308 NA ## 2 3 10 0.3700416 -0.8182091 0.5184354 0.3206475 NA ## 3 3 20 -0.06335 -1.2514 1.52 -0.5775 NA ## 4 3 40 -1.877683 -3.56365 -1.761667 -0.7514286 NA ## 5 3 50 0.9462857 -0.09186957 0.9791176 1.191481 NA ## 6 3 60 0.840775 1.040375 2.181111 1.067 NA ## 7 3 70 -1.104049 -1.517955 -0.8454839 -1.005625 NA ## 8 3 74 0.996 0.0208375 0.6 1.2925 NA ## 9 3 80 -0.144304 -0.5325596 0.6791667 0.2686301 NA ## 10 3 82 0.3708244 -1.080988 0.5676650 0.3440741 NA ## # … with 67 more rows ``` --- # Backing up a bit * What if we wanted only math files? ```r dir_ls(here::here("data", "pfiles_sim"), regexp = "Math") ``` ``` ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Mathpfiles18_sim.csv ## /Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Mathpfiles18_sim.csv ``` --- # Only Grade 5 ### You try
03
:
00
-- ```r g5_paths <- dir_ls( here::here("data", "pfiles_sim"), * regexp = "g5" ) ``` --- # The rest is the same ```r g5 <- map_dfr(g5_paths, read_csv, .id = "file") %>% mutate( file = str_replace_all( file, here::here("data", "pfiles_sim"), "" ) ) g5 ``` ``` ## # A tibble: 2,632 × 23 ## file Entry Theta Status Count RawScore SE Infit Infit_Z ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 /g5ELApfiles18_sim.csv 375 3.154 1 36 32 0.551 1.22 0.91 ## 2 /g5ELApfiles18_sim.csv 305 0.3662 1 36 16 0.3894 0.94 -0.31 ## 3 /g5ELApfiles18_sim.csv 163 -4.9547 -1 36 0 1.8495 1 0 ## 4 /g5ELApfiles18_sim.csv 524 -4.9547 -1 36 0 1.8495 1 0 ## 5 /g5ELApfiles18_sim.csv 81 3.154 1 36 32 0.551 0.96 0 ## 6 /g5ELApfiles18_sim.csv 325 1.7156 1 36 25 0.3997 1.14 0.92 ## 7 /g5ELApfiles18_sim.csv 163 1.8786 1 36 26 0.4078 0.82 -1.08 ## 8 /g5ELApfiles18_sim.csv 116 5.9323 0 36 36 1.8373 1 0 ## 9 /g5ELApfiles18_sim.csv 273 1.4052 1 36 23 0.3891 1.08 0.44 ## 10 /g5ELApfiles18_sim.csv 202 1.8786 1 36 26 0.4078 0.79 -0.84 ## # … with 2,622 more rows, and 14 more variables: Outfit <dbl>, Outfit_Z <dbl>, ## # Displacement <dbl>, PointMeasureCorr <dbl>, Weight <dbl>, ObservMatch <dbl>, ## # ExpectMatch <dbl>, PointMeasureExpected <dbl>, RMSR <dbl>, WMLE <dbl>, ## # testeventid <dbl>, ssid <dbl>, asmtprmrydsbltycd <dbl>, asmtscndrydsbltycd <dbl> ``` --- # Base equivalents ```r list.files(here::here("data", "pfiles_sim")) ``` ``` ## [1] "g11ELApfiles18_sim.csv" "g11Mathpfiles18_sim.csv" ## [3] "g11Rdgpfiles18_sim.csv" "g11Sciencepfiles18_sim.csv" ## [5] "g11Wripfiles18_sim.csv" "g3ELApfiles18_sim.csv" ## [7] "g3Mathpfiles18_sim.csv" "g3Rdgpfiles18_sim.csv" ## [9] "g3Wripfiles18_sim.csv" "g4ELApfiles18_sim.csv" ## [11] "g4Mathpfiles18_sim.csv" "g4Rdgpfiles18_sim.csv" ## [13] "g4Wripfiles18_sim.csv" "g5ELApfiles18_sim.csv" ## [15] "g5Mathpfiles18_sim.csv" "g5Rdgpfiles18_sim.csv" ## [17] "g5Sciencepfiles18_sim.csv" "g5Wripfiles18_sim.csv" ## [19] "g6ELApfiles18_sim.csv" "g6Mathpfiles18_sim.csv" ## [21] "g6Rdgpfiles18_sim.csv" "g6Wripfiles18_sim.csv" ## [23] "g7ELApfiles18_sim.csv" "g7Mathpfiles18_sim.csv" ## [25] "g7Rdgpfiles18_sim.csv" "g7Wripfiles18_sim.csv" ## [27] "g8ELApfiles18_sim.csv" "g8Mathpfiles18_sim.csv" ## [29] "g8Rdgpfiles18_sim.csv" "g8Sciencepfiles18_sim.csv" ## [31] "g8Wripfiles18_sim.csv" ``` --- # Full path ```r list.files(here::here("data", "pfiles_sim"), full.names = TRUE) ``` ``` ## [1] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11ELApfiles18_sim.csv" ## [2] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Mathpfiles18_sim.csv" ## [3] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Rdgpfiles18_sim.csv" ## [4] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Sciencepfiles18_sim.csv" ## [5] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Wripfiles18_sim.csv" ## [6] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3ELApfiles18_sim.csv" ## [7] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Mathpfiles18_sim.csv" ## [8] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Rdgpfiles18_sim.csv" ## [9] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Wripfiles18_sim.csv" ## [10] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4ELApfiles18_sim.csv" ## [11] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Mathpfiles18_sim.csv" ## [12] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Rdgpfiles18_sim.csv" ## [13] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Wripfiles18_sim.csv" ## [14] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5ELApfiles18_sim.csv" ## [15] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Mathpfiles18_sim.csv" ## [16] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Rdgpfiles18_sim.csv" ## [17] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Sciencepfiles18_sim.csv" ## [18] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Wripfiles18_sim.csv" ## [19] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6ELApfiles18_sim.csv" ## [20] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Mathpfiles18_sim.csv" ## [21] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Rdgpfiles18_sim.csv" ## [22] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Wripfiles18_sim.csv" ## [23] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7ELApfiles18_sim.csv" ## [24] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Mathpfiles18_sim.csv" ## [25] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Rdgpfiles18_sim.csv" ## [26] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Wripfiles18_sim.csv" ## [27] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8ELApfiles18_sim.csv" ## [28] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Mathpfiles18_sim.csv" ## [29] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Rdgpfiles18_sim.csv" ## [30] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Sciencepfiles18_sim.csv" ## [31] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Wripfiles18_sim.csv" ``` --- # Only csvs ```r list.files(here::here("data", "pfiles_sim"), full.names = TRUE, pattern = "*.csv") ``` ``` ## [1] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11ELApfiles18_sim.csv" ## [2] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Mathpfiles18_sim.csv" ## [3] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Rdgpfiles18_sim.csv" ## [4] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Sciencepfiles18_sim.csv" ## [5] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g11Wripfiles18_sim.csv" ## [6] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3ELApfiles18_sim.csv" ## [7] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Mathpfiles18_sim.csv" ## [8] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Rdgpfiles18_sim.csv" ## [9] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g3Wripfiles18_sim.csv" ## [10] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4ELApfiles18_sim.csv" ## [11] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Mathpfiles18_sim.csv" ## [12] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Rdgpfiles18_sim.csv" ## [13] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g4Wripfiles18_sim.csv" ## [14] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5ELApfiles18_sim.csv" ## [15] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Mathpfiles18_sim.csv" ## [16] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Rdgpfiles18_sim.csv" ## [17] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Sciencepfiles18_sim.csv" ## [18] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g5Wripfiles18_sim.csv" ## [19] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6ELApfiles18_sim.csv" ## [20] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Mathpfiles18_sim.csv" ## [21] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Rdgpfiles18_sim.csv" ## [22] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g6Wripfiles18_sim.csv" ## [23] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7ELApfiles18_sim.csv" ## [24] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Mathpfiles18_sim.csv" ## [25] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Rdgpfiles18_sim.csv" ## [26] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g7Wripfiles18_sim.csv" ## [27] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8ELApfiles18_sim.csv" ## [28] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Mathpfiles18_sim.csv" ## [29] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Rdgpfiles18_sim.csv" ## [30] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Sciencepfiles18_sim.csv" ## [31] "/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/data/pfiles_sim/g8Wripfiles18_sim.csv" ``` --- # Why not use base? * We could, but `{fs}` plays a little nicer with `{purrr}` -- ```r files <- list.files( here::here("data", "pfiles_sim"), pattern = "*.csv" ) batch3 <- map_dfr(files, read_csv, .id = "file") ``` ``` ## Error: 'g11ELApfiles18_sim.csv' does not exist in current working directory ('/Users/daniel/Teaching/data_sci_specialization/2021-22/c3-fp-2022/static/slides'). ``` -- * Need to return full names ```r files ``` ``` ## [1] "g11ELApfiles18_sim.csv" "g11Mathpfiles18_sim.csv" ## [3] "g11Rdgpfiles18_sim.csv" "g11Sciencepfiles18_sim.csv" ## [5] "g11Wripfiles18_sim.csv" "g3ELApfiles18_sim.csv" ## [7] "g3Mathpfiles18_sim.csv" "g3Rdgpfiles18_sim.csv" ## [9] "g3Wripfiles18_sim.csv" "g4ELApfiles18_sim.csv" ## [11] "g4Mathpfiles18_sim.csv" "g4Rdgpfiles18_sim.csv" ## [13] "g4Wripfiles18_sim.csv" "g5ELApfiles18_sim.csv" ## [15] "g5Mathpfiles18_sim.csv" "g5Rdgpfiles18_sim.csv" ## [17] "g5Sciencepfiles18_sim.csv" "g5Wripfiles18_sim.csv" ## [19] "g6ELApfiles18_sim.csv" "g6Mathpfiles18_sim.csv" ## [21] "g6Rdgpfiles18_sim.csv" "g6Wripfiles18_sim.csv" ## [23] "g7ELApfiles18_sim.csv" "g7Mathpfiles18_sim.csv" ## [25] "g7Rdgpfiles18_sim.csv" "g7Wripfiles18_sim.csv" ## [27] "g8ELApfiles18_sim.csv" "g8Mathpfiles18_sim.csv" ## [29] "g8Rdgpfiles18_sim.csv" "g8Sciencepfiles18_sim.csv" ## [31] "g8Wripfiles18_sim.csv" ``` --- # Try again ```r files <- list.files(here::here("data", "pfiles_sim"), pattern = "*.csv", full.names = TRUE) batch3 <- map_dfr(files, read_csv, .id = "file") batch3 ``` ``` ## # A tibble: 15,945 × 23 ## file Entry Theta Status Count RawScore SE Infit Infit_Z Outfit Outfit_Z ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1 123 1.2687 1 36 23 0.3713 0.93 -0.34 0.82 -0.62 ## 2 1 88 1.5541 1 36 25 0.3852 0.95 -0.37 0.81 -0.56 ## 3 1 105 3.2773 1 36 33 0.6187 0.9 -0.04 1.63 1.03 ## 4 1 153 4.4752 1 36 35 1.0234 0.93 0.23 0.35 -0.16 ## 5 1 437 2.6655 1 36 31 0.5008 0.92 -0.18 0.88 -0.12 ## 6 1 307 5.7137 0 36 36 1.8371 1 0 1 0 ## 7 1 305 3.7326 1 36 34 0.7408 1.06 0.31 0.86 0.17 ## 8 1 42 0.609 1 36 18 0.36 1.55 2.56 1.74 3.31 ## 9 1 59 -2.623 1 36 3 1.0344 0.85 0.06 0.17 -0.37 ## 10 1 304 5.7137 0 36 36 1.8371 1 0 1 0 ## # … with 15,935 more rows, and 12 more variables: Displacement <dbl>, ## # PointMeasureCorr <dbl>, Weight <dbl>, ObservMatch <dbl>, ExpectMatch <dbl>, ## # PointMeasureExpected <dbl>, RMSR <dbl>, WMLE <dbl>, testeventid <dbl>, ssid <dbl>, ## # asmtprmrydsbltycd <dbl>, asmtscndrydsbltycd <dbl> ``` --- # indexes * The prior example gave us indexes, rather than the file path. Why? -- ### No names ```r names(files) ``` ``` ## NULL ``` * We **need** the file path! An index isn't nearly as useful. --- # Base method that works ```r files <- list.files(here::here("data", "pfiles_sim"), pattern = "*.csv", full.names = TRUE) files <- setNames(files, files) batch4 <- map_dfr(files, read_csv, .id = "file") batch4 ``` ``` ## # A tibble: 15,945 × 23 ## # … with 15,935 more rows, and 23 more variables: file <chr>, Entry <dbl>, ## # Theta <dbl>, Status <dbl>, Count <dbl>, RawScore <dbl>, SE <dbl>, Infit <dbl>, ## # Infit_Z <dbl>, Outfit <dbl>, Outfit_Z <dbl>, Displacement <dbl>, ## # PointMeasureCorr <dbl>, Weight <dbl>, ObservMatch <dbl>, ExpectMatch <dbl>, ## # PointMeasureExpected <dbl>, RMSR <dbl>, WMLE <dbl>, testeventid <dbl>, ssid <dbl>, ## # asmtprmrydsbltycd <dbl>, asmtscndrydsbltycd <dbl> ``` --- # My recommendation * If you're working interactively, no reason not to use `{fs}` * If you are building **functions** that take paths, might be worth considering skipping the dependency -- ### Note I am **not** saying skip it, but rather that you should **consider** whether it is really needed or not. --- class: inverse-blue center middle # List columns --- # Comparing models Let's say we wanted to fit/compare a set of models for each content area 1. `lm(Theta ~ asmtprmrydsbltycd)` 1. `lm(Theta ~ asmtprmrydsbltycd + asmtscndrydsbltycd)` 1. `lm(Theta ~ asmtprmrydsbltycd * asmtscndrydsbltycd)` --- # Data pre-processing * The disability variables are stored as numbers, we need them as factors * We'll make the names easier in the process ```r d <- d %>% mutate(primary = as.factor(asmtprmrydsbltycd), secondary = as.factor(asmtscndrydsbltycd)) ``` If you're interested in what the specific codes refer to, see [here](https://www.newberg.k12.or.us/district/eligibility-codes-and-requirements). --- # Split the data The base method we've been using... ```r splt_content <- split(d, d$content) str(splt_content) ``` ``` ## List of 5 ## $ ELA : tibble [3,627 × 27] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3627] 9466908 7683685 9025693 10099824 18886078 ... ## ..$ grade : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ content : chr [1:3627] "ELA" "ELA" "ELA" "ELA" ... ## ..$ testeventid : num [1:3627] 148933 147875 143699 143962 150680 ... ## ..$ asmtprmrydsbltycd : num [1:3627] 0 10 40 82 10 80 50 10 50 82 ... ## ..$ asmtscndrydsbltycd : num [1:3627] 0 0 20 0 0 80 0 0 0 0 ... ## ..$ Entry : num [1:3627] 123 88 105 153 437 307 305 42 59 304 ... ## ..$ Theta : num [1:3627] 1.27 1.55 3.28 4.48 2.67 ... ## ..$ Status : num [1:3627] 1 1 1 1 1 0 1 1 1 0 ... ## ..$ Count : num [1:3627] 36 36 36 36 36 36 36 36 36 36 ... ## ..$ RawScore : num [1:3627] 23 25 33 35 31 36 34 18 3 36 ... ## ..$ SE : num [1:3627] 0.371 0.385 0.619 1.023 0.501 ... ## ..$ Infit : num [1:3627] 0.93 0.95 0.9 0.93 0.92 1 1.06 1.55 0.85 1 ... ## ..$ Infit_Z : num [1:3627] -0.34 -0.37 -0.04 0.23 -0.18 0 0.31 2.56 0.06 0 ... ## ..$ Outfit : num [1:3627] 0.82 0.81 1.63 0.35 0.88 1 0.86 1.74 0.17 1 ... ## ..$ Outfit_Z : num [1:3627] -0.62 -0.56 1.03 -0.16 -0.12 0 0.17 3.31 -0.37 0 ... ## ..$ Displacement : num [1:3627] 0.0018 0.0019 0.0022 0.0023 0.0021 0.0024 0.0022 0.0017 0.0009 0.0024 ... ## ..$ PointMeasureCorr : num [1:3627] 0.42 0.42 0.3 0.27 0.31 0 0.14 -0.12 0.32 0 ... ## ..$ Weight : num [1:3627] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3627] 75 80.6 91.7 97.2 86.1 100 94.4 50 97.2 100 ... ## ..$ ExpectMatch : num [1:3627] 68.3 72 91.7 97.2 86.1 100 94.4 65 97.2 100 ... ## ..$ PointMeasureExpected: num [1:3627] 0.35 0.33 0.2 0.12 0.25 0 0.17 0.38 0.17 0 ... ## ..$ RMSR : num [1:3627] 0.42 0.39 0.26 0.16 0.31 0 0.23 0.51 0.14 0 ... ## ..$ WMLE : num [1:3627] 1.25 1.68 3.13 4 2.59 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 1 2 4 11 2 10 6 2 6 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 1 3 1 1 10 1 1 1 1 ... ## $ Math : tibble [3,629 × 27] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3629] 10129634 10926496 10616063 10139443 8887381 ... ## ..$ grade : int [1:3629] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3629] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ content : chr [1:3629] "Math" "Math" "Math" "Math" ... ## ..$ testeventid : num [1:3629] 147564 151249 146976 151229 147676 ... ## ..$ asmtprmrydsbltycd : num [1:3629] 10 80 10 10 80 82 10 10 40 82 ... ## ..$ asmtscndrydsbltycd : num [1:3629] 0 90 0 0 10 80 50 0 50 0 ... ## ..$ Entry : num [1:3629] 161 344 321 167 101 131 141 357 14 367 ... ## ..$ Theta : num [1:3629] 0.514 1.965 0.899 1.965 -4.724 ... ## ..$ Status : num [1:3629] 1 1 1 1 -1 1 1 1 1 1 ... ## ..$ Count : num [1:3629] 36 36 36 36 36 36 36 36 36 36 ... ## ..$ RawScore : num [1:3629] 19 29 22 29 0 14 18 25 19 22 ... ## ..$ SE : num [1:3629] 0.356 0.436 0.362 0.436 1.839 ... ## ..$ Infit : num [1:3629] 0.97 1.03 0.99 1 1 1.01 0.99 0.85 0.96 1.06 ... ## ..$ Infit_Z : num [1:3629] -0.18 0.26 -0.09 0.1 0 0.13 -0.06 -0.81 -0.39 0.5 ... ## ..$ Outfit : num [1:3629] 0.97 1.03 1.01 0.87 1 0.93 1 0.77 0.97 0.99 ... ## ..$ Outfit_Z : num [1:3629] -0.17 0.23 0.17 -0.46 0 -0.03 0.03 -0.87 -0.19 0.07 ... ## ..$ Displacement : num [1:3629] 5e-04 5e-04 5e-04 5e-04 -7e-04 4e-04 5e-04 5e-04 5e-04 5e-04 ... ## ..$ PointMeasureCorr : num [1:3629] 0.38 0.28 0.34 0.29 0 0.34 0.38 0.45 0.38 0.27 ... ## ..$ Weight : num [1:3629] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3629] 66.7 80.6 66.7 80.6 100 61.1 69.4 75 63.9 61.1 ... ## ..$ ExpectMatch : num [1:3629] 64.4 80.5 66 80.5 100 68.3 64.6 70.3 64.4 66 ... ## ..$ PointMeasureExpected: num [1:3629] 0.35 0.25 0.33 0.25 0 0.35 0.35 0.3 0.35 0.33 ... ## ..$ RMSR : num [1:3629] 0.46 0.39 0.45 0.39 0 0.46 0.46 0.4 0.46 0.48 ... ## ..$ WMLE : num [1:3629] 0.512 1.915 0.888 1.915 -4.724 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 10 2 2 10 11 2 2 4 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 12 1 1 2 10 6 1 6 1 ... ## $ Rdg : tibble [3,627 × 27] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3627] 18631185 18342736 10897771 7663935 6709613 ... ## ..$ grade : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ content : chr [1:3627] "Rdg" "Rdg" "Rdg" "Rdg" ... ## ..$ testeventid : num [1:3627] 145641 147717 145196 149014 146545 ... ## ..$ asmtprmrydsbltycd : num [1:3627] 10 90 10 80 0 10 10 80 82 82 ... ## ..$ asmtscndrydsbltycd : num [1:3627] 0 0 0 10 0 82 0 70 0 0 ... ## ..$ Entry : num [1:3627] 444 444 343 74 3 77 181 261 297 182 ... ## ..$ Theta : num [1:3627] 2.24 3.48 2.72 -2.72 0.12 -1.39 -0.67 4.73 4.73 4.73 ... ## ..$ Status : num [1:3627] 1 1 1 1 1 1 1 0 0 0 ... ## ..$ Count : num [1:3627] 19 19 19 19 19 19 19 19 19 19 ... ## ..$ RawScore : num [1:3627] 16 18 17 1 8 3 5 19 19 19 ... ## ..$ SE : num [1:3627] 0.64 1.03 0.76 1.06 0.49 0.66 0.55 1.84 1.84 1.84 ... ## ..$ Infit : num [1:3627] 1.07 1.02 1.1 1 0.98 0.73 1.08 1 1 1 ... ## ..$ Infit_Z : num [1:3627] 0.29 0.31 0.53 0.29 -0.11 -0.54 0.37 0 0 0 ... ## ..$ Outfit : num [1:3627] 1.17 0.94 1.22 0.76 0.95 0.62 1.09 1 1 1 ... ## ..$ Outfit_Z : num [1:3627] 0.45 0.32 0.71 0 -0.22 -0.62 0.42 0 0 0 ... ## ..$ Displacement : num [1:3627] 0 0 0 0 0 0 0 0 0 0 ... ## ..$ PointMeasureCorr : num [1:3627] 0.01 0.08 0.05 0 0.32 0.68 0.2 0 0 0 ... ## ..$ Weight : num [1:3627] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3627] 84.2 94.7 89.5 94.7 63.2 84.2 78.9 100 100 100 ... ## ..$ ExpectMatch : num [1:3627] 84.2 94.7 89.5 94.7 64.2 85.1 76.3 100 100 100 ... ## ..$ PointMeasureExpected: num [1:3627] 0.17 0.1 0.14 0.24 0.3 0.31 0.32 0 0 0 ... ## ..$ RMSR : num [1:3627] 0.38 0.22 0.31 0.22 0.47 0.34 0.36 0 0 0 ... ## ..$ WMLE : num [1:3627] 2.11 3.02 2.5 -2.27 0.14 -0.9 -0.6 4.73 4.73 4.73 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 12 2 10 1 2 2 10 11 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 1 1 2 1 11 1 8 1 1 ... ## $ Science: tibble [1,435 × 27] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:1435] 7617607 7642717 10341706 9811494 10469745 ... ## ..$ grade : int [1:1435] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:1435] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ content : chr [1:1435] "Science" "Science" "Science" "Science" ... ## ..$ testeventid : num [1:1435] 148838 144146 149634 146456 144426 ... ## ..$ asmtprmrydsbltycd : num [1:1435] 10 82 80 10 90 10 10 80 80 10 ... ## ..$ asmtscndrydsbltycd : num [1:1435] 0 80 20 0 50 80 70 0 50 0 ... ## ..$ Entry : num [1:1435] 28 43 274 118 291 91 58 130 297 337 ... ## ..$ Theta : num [1:1435] 0.875 2.647 4.17 5.404 5.404 ... ## ..$ Status : num [1:1435] 1 1 1 0 0 1 1 1 1 1 ... ## ..$ Count : num [1:1435] 36 36 36 36 36 36 36 36 36 36 ... ## ..$ RawScore : num [1:1435] 22 32 35 36 36 26 2 34 34 28 ... ## ..$ SE : num [1:1435] 0.361 0.544 1.021 1.835 1.835 ... ## ..$ Infit : num [1:1435] 1.3 0.94 1.05 1.04 0.99 0.9 0.97 0.99 0.98 0.9 ... ## ..$ Infit_Z : num [1:1435] 2.35 0.02 0.37 0.35 0 -0.62 0.17 0.15 0.16 -0.52 ... ## ..$ Outfit : num [1:1435] 1.22 0.76 1.5 1.2 0.48 0.85 0.69 1.57 0.67 0.8 ... ## ..$ Outfit_Z : num [1:1435] 1.87 -0.27 0.91 0.6 -0.16 -0.87 -0.41 0.74 0.01 -0.76 ... ## ..$ Displacement : num [1:1435] 0.0011 0.0016 0.0016 0.0027 0.0027 0.0013 0.0007 0.0016 0.0016 0.0014 ... ## ..$ PointMeasureCorr : num [1:1435] 0.07 0.3 -0.07 0 0.29 0.46 0.28 0.05 0.22 0.45 ... ## ..$ Weight : num [1:1435] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:1435] 55.6 88.9 97.2 100 100 75 94.4 94.4 94.4 80.6 ... ## ..$ ExpectMatch : num [1:1435] 66.1 88.9 97.2 100 100 73.2 94.4 94.4 94.4 77.9 ... ## ..$ PointMeasureExpected: num [1:1435] 0.32 0.22 0.12 0 0 0.3 0.15 0.16 0.16 0.28 ... ## ..$ RMSR : num [1:1435] 0.52 0.3 0.17 0 0 0.39 0.22 0.22 0.22 0.36 ... ## ..$ WMLE : num [1:1435] 0.863 2.543 3.692 5.404 5.404 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 11 10 2 12 2 2 10 10 2 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 10 3 1 6 10 8 1 6 1 ... ## $ Wri : tibble [3,627 × 27] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3627] 7653093 7640498 7650957 9305807 18060455 ... ## ..$ grade : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ content : chr [1:3627] "Wri" "Wri" "Wri" "Wri" ... ## ..$ testeventid : num [1:3627] 148893 148868 144397 148988 144865 ... ## ..$ asmtprmrydsbltycd : num [1:3627] 10 10 90 0 82 10 80 0 10 82 ... ## ..$ asmtscndrydsbltycd : num [1:3627] 0 0 80 0 0 0 50 0 0 10 ... ## ..$ Entry : num [1:3627] 61 48 63 107 443 214 267 344 423 27 ... ## ..$ Theta : num [1:3627] 1.67 2.17 -1.56 5.01 2.17 0.38 0.79 5.01 -0.48 2.78 ... ## ..$ Status : num [1:3627] 1 1 1 0 1 1 1 0 1 1 ... ## ..$ Count : num [1:3627] 13 13 13 13 13 13 13 13 13 13 ... ## ..$ RawScore : num [1:3627] 9 10 2 13 10 6 7 13 4 11 ... ## ..$ SE : num [1:3627] 0.69 0.74 0.83 1.87 0.74 0.64 0.65 1.87 0.68 0.84 ... ## ..$ Infit : num [1:3627] 0.52 0.65 0.61 1 1.01 0.46 0.47 1 0.9 0.88 ... ## ..$ Infit_Z : num [1:3627] -2.07 -1.02 -0.79 0 0.24 -2.3 -2.2 0 -0.16 -0.21 ... ## ..$ Outfit : num [1:3627] 0.44 0.52 0.33 1 0.74 0.41 0.4 1 1.32 0.54 ... ## ..$ Outfit_Z : num [1:3627] -1.65 -0.81 -0.47 0 -0.06 -1.88 -1.85 0 -0.14 -0.16 ... ## ..$ Displacement : num [1:3627] 0 0 0 0 0 0 0 0 0 0 ... ## ..$ PointMeasureCorr : num [1:3627] 0.86 0.73 0.62 0 0.51 0.87 0.87 0 0.56 0.56 ... ## ..$ Weight : num [1:3627] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3627] 100 92.3 84.6 100 76.9 100 100 100 92.3 84.6 ... ## ..$ ExpectMatch : num [1:3627] 75.5 79.7 84.6 100 79.7 72.9 73 100 74.5 84.6 ... ## ..$ PointMeasureExpected: num [1:3627] 0.48 0.44 0.34 0 0.44 0.5 0.51 0 0.45 0.38 ... ## ..$ RMSR : num [1:3627] 0.31 0.34 0.26 0 0.38 0.29 0.29 0 0.37 0.31 ... ## ..$ WMLE : num [1:3627] 1.61 2.08 -1.38 5.01 2.08 0.38 0.78 5.01 -0.43 2.61 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 2 12 1 11 2 10 1 2 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 1 10 1 1 1 6 1 1 2 ... ``` --- # We could use this method ```r m1 <- map( splt_content, ~lm(Theta ~ asmtprmrydsbltycd, data = .x) ) m2 <- map( splt_content, ~lm(Theta ~ asmtprmrydsbltycd + asmtscndrydsbltycd, data = .x) ) m3 <- map( splt_content, ~lm(Theta ~ asmtprmrydsbltycd * asmtscndrydsbltycd, data = .x) ) ``` * Then conduct tests to see which model fit better, etc. --- # Alternative * Create a data frame with a list column ```r by_content <- d %>% group_by(content) %>% nest() by_content ``` ``` ## # A tibble: 5 × 2 ## # Groups: content [5] ## content data ## <chr> <list> ## 1 ELA <tibble [3,627 × 26]> ## 2 Math <tibble [3,629 × 26]> ## 3 Rdg <tibble [3,627 × 26]> ## 4 Science <tibble [1,435 × 26]> ## 5 Wri <tibble [3,627 × 26]> ``` --- # What's going on here? ```r str(by_content$data) ``` ``` ## List of 5 ## $ : tibble [3,627 × 26] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3627] 9466908 7683685 9025693 10099824 18886078 ... ## ..$ grade : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ testeventid : num [1:3627] 148933 147875 143699 143962 150680 ... ## ..$ asmtprmrydsbltycd : num [1:3627] 0 10 40 82 10 80 50 10 50 82 ... ## ..$ asmtscndrydsbltycd : num [1:3627] 0 0 20 0 0 80 0 0 0 0 ... ## ..$ Entry : num [1:3627] 123 88 105 153 437 307 305 42 59 304 ... ## ..$ Theta : num [1:3627] 1.27 1.55 3.28 4.48 2.67 ... ## ..$ Status : num [1:3627] 1 1 1 1 1 0 1 1 1 0 ... ## ..$ Count : num [1:3627] 36 36 36 36 36 36 36 36 36 36 ... ## ..$ RawScore : num [1:3627] 23 25 33 35 31 36 34 18 3 36 ... ## ..$ SE : num [1:3627] 0.371 0.385 0.619 1.023 0.501 ... ## ..$ Infit : num [1:3627] 0.93 0.95 0.9 0.93 0.92 1 1.06 1.55 0.85 1 ... ## ..$ Infit_Z : num [1:3627] -0.34 -0.37 -0.04 0.23 -0.18 0 0.31 2.56 0.06 0 ... ## ..$ Outfit : num [1:3627] 0.82 0.81 1.63 0.35 0.88 1 0.86 1.74 0.17 1 ... ## ..$ Outfit_Z : num [1:3627] -0.62 -0.56 1.03 -0.16 -0.12 0 0.17 3.31 -0.37 0 ... ## ..$ Displacement : num [1:3627] 0.0018 0.0019 0.0022 0.0023 0.0021 0.0024 0.0022 0.0017 0.0009 0.0024 ... ## ..$ PointMeasureCorr : num [1:3627] 0.42 0.42 0.3 0.27 0.31 0 0.14 -0.12 0.32 0 ... ## ..$ Weight : num [1:3627] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3627] 75 80.6 91.7 97.2 86.1 100 94.4 50 97.2 100 ... ## ..$ ExpectMatch : num [1:3627] 68.3 72 91.7 97.2 86.1 100 94.4 65 97.2 100 ... ## ..$ PointMeasureExpected: num [1:3627] 0.35 0.33 0.2 0.12 0.25 0 0.17 0.38 0.17 0 ... ## ..$ RMSR : num [1:3627] 0.42 0.39 0.26 0.16 0.31 0 0.23 0.51 0.14 0 ... ## ..$ WMLE : num [1:3627] 1.25 1.68 3.13 4 2.59 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 1 2 4 11 2 10 6 2 6 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 1 3 1 1 10 1 1 1 1 ... ## $ : tibble [3,629 × 26] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3629] 10129634 10926496 10616063 10139443 8887381 ... ## ..$ grade : int [1:3629] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3629] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ testeventid : num [1:3629] 147564 151249 146976 151229 147676 ... ## ..$ asmtprmrydsbltycd : num [1:3629] 10 80 10 10 80 82 10 10 40 82 ... ## ..$ asmtscndrydsbltycd : num [1:3629] 0 90 0 0 10 80 50 0 50 0 ... ## ..$ Entry : num [1:3629] 161 344 321 167 101 131 141 357 14 367 ... ## ..$ Theta : num [1:3629] 0.514 1.965 0.899 1.965 -4.724 ... ## ..$ Status : num [1:3629] 1 1 1 1 -1 1 1 1 1 1 ... ## ..$ Count : num [1:3629] 36 36 36 36 36 36 36 36 36 36 ... ## ..$ RawScore : num [1:3629] 19 29 22 29 0 14 18 25 19 22 ... ## ..$ SE : num [1:3629] 0.356 0.436 0.362 0.436 1.839 ... ## ..$ Infit : num [1:3629] 0.97 1.03 0.99 1 1 1.01 0.99 0.85 0.96 1.06 ... ## ..$ Infit_Z : num [1:3629] -0.18 0.26 -0.09 0.1 0 0.13 -0.06 -0.81 -0.39 0.5 ... ## ..$ Outfit : num [1:3629] 0.97 1.03 1.01 0.87 1 0.93 1 0.77 0.97 0.99 ... ## ..$ Outfit_Z : num [1:3629] -0.17 0.23 0.17 -0.46 0 -0.03 0.03 -0.87 -0.19 0.07 ... ## ..$ Displacement : num [1:3629] 5e-04 5e-04 5e-04 5e-04 -7e-04 4e-04 5e-04 5e-04 5e-04 5e-04 ... ## ..$ PointMeasureCorr : num [1:3629] 0.38 0.28 0.34 0.29 0 0.34 0.38 0.45 0.38 0.27 ... ## ..$ Weight : num [1:3629] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3629] 66.7 80.6 66.7 80.6 100 61.1 69.4 75 63.9 61.1 ... ## ..$ ExpectMatch : num [1:3629] 64.4 80.5 66 80.5 100 68.3 64.6 70.3 64.4 66 ... ## ..$ PointMeasureExpected: num [1:3629] 0.35 0.25 0.33 0.25 0 0.35 0.35 0.3 0.35 0.33 ... ## ..$ RMSR : num [1:3629] 0.46 0.39 0.45 0.39 0 0.46 0.46 0.4 0.46 0.48 ... ## ..$ WMLE : num [1:3629] 0.512 1.915 0.888 1.915 -4.724 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 10 2 2 10 11 2 2 4 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 12 1 1 2 10 6 1 6 1 ... ## $ : tibble [3,627 × 26] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3627] 18631185 18342736 10897771 7663935 6709613 ... ## ..$ grade : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ testeventid : num [1:3627] 145641 147717 145196 149014 146545 ... ## ..$ asmtprmrydsbltycd : num [1:3627] 10 90 10 80 0 10 10 80 82 82 ... ## ..$ asmtscndrydsbltycd : num [1:3627] 0 0 0 10 0 82 0 70 0 0 ... ## ..$ Entry : num [1:3627] 444 444 343 74 3 77 181 261 297 182 ... ## ..$ Theta : num [1:3627] 2.24 3.48 2.72 -2.72 0.12 -1.39 -0.67 4.73 4.73 4.73 ... ## ..$ Status : num [1:3627] 1 1 1 1 1 1 1 0 0 0 ... ## ..$ Count : num [1:3627] 19 19 19 19 19 19 19 19 19 19 ... ## ..$ RawScore : num [1:3627] 16 18 17 1 8 3 5 19 19 19 ... ## ..$ SE : num [1:3627] 0.64 1.03 0.76 1.06 0.49 0.66 0.55 1.84 1.84 1.84 ... ## ..$ Infit : num [1:3627] 1.07 1.02 1.1 1 0.98 0.73 1.08 1 1 1 ... ## ..$ Infit_Z : num [1:3627] 0.29 0.31 0.53 0.29 -0.11 -0.54 0.37 0 0 0 ... ## ..$ Outfit : num [1:3627] 1.17 0.94 1.22 0.76 0.95 0.62 1.09 1 1 1 ... ## ..$ Outfit_Z : num [1:3627] 0.45 0.32 0.71 0 -0.22 -0.62 0.42 0 0 0 ... ## ..$ Displacement : num [1:3627] 0 0 0 0 0 0 0 0 0 0 ... ## ..$ PointMeasureCorr : num [1:3627] 0.01 0.08 0.05 0 0.32 0.68 0.2 0 0 0 ... ## ..$ Weight : num [1:3627] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3627] 84.2 94.7 89.5 94.7 63.2 84.2 78.9 100 100 100 ... ## ..$ ExpectMatch : num [1:3627] 84.2 94.7 89.5 94.7 64.2 85.1 76.3 100 100 100 ... ## ..$ PointMeasureExpected: num [1:3627] 0.17 0.1 0.14 0.24 0.3 0.31 0.32 0 0 0 ... ## ..$ RMSR : num [1:3627] 0.38 0.22 0.31 0.22 0.47 0.34 0.36 0 0 0 ... ## ..$ WMLE : num [1:3627] 2.11 3.02 2.5 -2.27 0.14 -0.9 -0.6 4.73 4.73 4.73 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 12 2 10 1 2 2 10 11 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 1 1 2 1 11 1 8 1 1 ... ## $ : tibble [1,435 × 26] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:1435] 7617607 7642717 10341706 9811494 10469745 ... ## ..$ grade : int [1:1435] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:1435] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ testeventid : num [1:1435] 148838 144146 149634 146456 144426 ... ## ..$ asmtprmrydsbltycd : num [1:1435] 10 82 80 10 90 10 10 80 80 10 ... ## ..$ asmtscndrydsbltycd : num [1:1435] 0 80 20 0 50 80 70 0 50 0 ... ## ..$ Entry : num [1:1435] 28 43 274 118 291 91 58 130 297 337 ... ## ..$ Theta : num [1:1435] 0.875 2.647 4.17 5.404 5.404 ... ## ..$ Status : num [1:1435] 1 1 1 0 0 1 1 1 1 1 ... ## ..$ Count : num [1:1435] 36 36 36 36 36 36 36 36 36 36 ... ## ..$ RawScore : num [1:1435] 22 32 35 36 36 26 2 34 34 28 ... ## ..$ SE : num [1:1435] 0.361 0.544 1.021 1.835 1.835 ... ## ..$ Infit : num [1:1435] 1.3 0.94 1.05 1.04 0.99 0.9 0.97 0.99 0.98 0.9 ... ## ..$ Infit_Z : num [1:1435] 2.35 0.02 0.37 0.35 0 -0.62 0.17 0.15 0.16 -0.52 ... ## ..$ Outfit : num [1:1435] 1.22 0.76 1.5 1.2 0.48 0.85 0.69 1.57 0.67 0.8 ... ## ..$ Outfit_Z : num [1:1435] 1.87 -0.27 0.91 0.6 -0.16 -0.87 -0.41 0.74 0.01 -0.76 ... ## ..$ Displacement : num [1:1435] 0.0011 0.0016 0.0016 0.0027 0.0027 0.0013 0.0007 0.0016 0.0016 0.0014 ... ## ..$ PointMeasureCorr : num [1:1435] 0.07 0.3 -0.07 0 0.29 0.46 0.28 0.05 0.22 0.45 ... ## ..$ Weight : num [1:1435] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:1435] 55.6 88.9 97.2 100 100 75 94.4 94.4 94.4 80.6 ... ## ..$ ExpectMatch : num [1:1435] 66.1 88.9 97.2 100 100 73.2 94.4 94.4 94.4 77.9 ... ## ..$ PointMeasureExpected: num [1:1435] 0.32 0.22 0.12 0 0 0.3 0.15 0.16 0.16 0.28 ... ## ..$ RMSR : num [1:1435] 0.52 0.3 0.17 0 0 0.39 0.22 0.22 0.22 0.36 ... ## ..$ WMLE : num [1:1435] 0.863 2.543 3.692 5.404 5.404 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 11 10 2 12 2 2 10 10 2 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 10 3 1 6 10 8 1 6 1 ... ## $ : tibble [3,627 × 26] (S3: tbl_df/tbl/data.frame) ## ..$ ssid : num [1:3627] 7653093 7640498 7650957 9305807 18060455 ... ## ..$ grade : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ year : int [1:3627] 11 11 11 11 11 11 11 11 11 11 ... ## ..$ testeventid : num [1:3627] 148893 148868 144397 148988 144865 ... ## ..$ asmtprmrydsbltycd : num [1:3627] 10 10 90 0 82 10 80 0 10 82 ... ## ..$ asmtscndrydsbltycd : num [1:3627] 0 0 80 0 0 0 50 0 0 10 ... ## ..$ Entry : num [1:3627] 61 48 63 107 443 214 267 344 423 27 ... ## ..$ Theta : num [1:3627] 1.67 2.17 -1.56 5.01 2.17 0.38 0.79 5.01 -0.48 2.78 ... ## ..$ Status : num [1:3627] 1 1 1 0 1 1 1 0 1 1 ... ## ..$ Count : num [1:3627] 13 13 13 13 13 13 13 13 13 13 ... ## ..$ RawScore : num [1:3627] 9 10 2 13 10 6 7 13 4 11 ... ## ..$ SE : num [1:3627] 0.69 0.74 0.83 1.87 0.74 0.64 0.65 1.87 0.68 0.84 ... ## ..$ Infit : num [1:3627] 0.52 0.65 0.61 1 1.01 0.46 0.47 1 0.9 0.88 ... ## ..$ Infit_Z : num [1:3627] -2.07 -1.02 -0.79 0 0.24 -2.3 -2.2 0 -0.16 -0.21 ... ## ..$ Outfit : num [1:3627] 0.44 0.52 0.33 1 0.74 0.41 0.4 1 1.32 0.54 ... ## ..$ Outfit_Z : num [1:3627] -1.65 -0.81 -0.47 0 -0.06 -1.88 -1.85 0 -0.14 -0.16 ... ## ..$ Displacement : num [1:3627] 0 0 0 0 0 0 0 0 0 0 ... ## ..$ PointMeasureCorr : num [1:3627] 0.86 0.73 0.62 0 0.51 0.87 0.87 0 0.56 0.56 ... ## ..$ Weight : num [1:3627] 1 1 1 1 1 1 1 1 1 1 ... ## ..$ ObservMatch : num [1:3627] 100 92.3 84.6 100 76.9 100 100 100 92.3 84.6 ... ## ..$ ExpectMatch : num [1:3627] 75.5 79.7 84.6 100 79.7 72.9 73 100 74.5 84.6 ... ## ..$ PointMeasureExpected: num [1:3627] 0.48 0.44 0.34 0 0.44 0.5 0.51 0 0.45 0.38 ... ## ..$ RMSR : num [1:3627] 0.31 0.34 0.26 0 0.38 0.29 0.29 0 0.37 0.31 ... ## ..$ WMLE : num [1:3627] 1.61 2.08 -1.38 5.01 2.08 0.38 0.78 5.01 -0.43 2.61 ... ## ..$ primary : Factor w/ 12 levels "0","10","20",..: 2 2 12 1 11 2 10 1 2 11 ... ## ..$ secondary : Factor w/ 12 levels "0","10","20",..: 1 1 10 1 1 1 6 1 1 2 ... ``` --- # Explore a bit ```r map_dbl(by_content$data, nrow) ``` ``` ## [1] 3627 3629 3627 1435 3627 ``` ```r map_dbl(by_content$data, ncol) ``` ``` ## [1] 26 26 26 26 26 ``` ```r map_dbl(by_content$data, ~mean(.x$Theta)) ``` ``` ## [1] 1.28001056 -0.06683086 1.37068376 1.57850321 1.26090709 ``` --- # It's a data frame! We can add these summaries if we want ```r by_content %>% mutate(n = map_dbl(data, nrow)) ``` ``` ## # A tibble: 5 × 3 ## # Groups: content [5] ## content data n ## <chr> <list> <dbl> ## 1 ELA <tibble [3,627 × 26]> 3627 ## 2 Math <tibble [3,629 × 26]> 3629 ## 3 Rdg <tibble [3,627 × 26]> 3627 ## 4 Science <tibble [1,435 × 26]> 1435 ## 5 Wri <tibble [3,627 × 26]> 3627 ``` --- # `map_*` * Note on the previous example we used `map_dbl` and we got a vector in return. * What would happen if we just used `map`? -- ```r by_content %>% mutate(n = map(data, nrow)) ``` ``` ## # A tibble: 5 × 3 ## # Groups: content [5] ## content data n ## <chr> <list> <list> ## 1 ELA <tibble [3,627 × 26]> <int [1]> ## 2 Math <tibble [3,629 × 26]> <int [1]> ## 3 Rdg <tibble [3,627 × 26]> <int [1]> ## 4 Science <tibble [1,435 × 26]> <int [1]> ## 5 Wri <tibble [3,627 × 26]> <int [1]> ``` --- # Let's fit a model! ```r by_content %>% mutate(m1 = map(data, ~lm(Theta ~ primary, data = .x))) ``` ``` ## # A tibble: 5 × 3 ## # Groups: content [5] ## content data m1 ## <chr> <list> <list> ## 1 ELA <tibble [3,627 × 26]> <lm> ## 2 Math <tibble [3,629 × 26]> <lm> ## 3 Rdg <tibble [3,627 × 26]> <lm> ## 4 Science <tibble [1,435 × 26]> <lm> ## 5 Wri <tibble [3,627 × 26]> <lm> ``` --- # Extract the coefficients ```r by_content %>% mutate( m1 = map(data, ~lm(Theta ~ primary, data = .x)), coefs = map(m1, coef) ) ``` ``` ## # A tibble: 5 × 4 ## # Groups: content [5] ## content data m1 coefs ## <chr> <list> <list> <list> ## 1 ELA <tibble [3,627 × 26]> <lm> <dbl [11]> ## 2 Math <tibble [3,629 × 26]> <lm> <dbl [12]> ## 3 Rdg <tibble [3,627 × 26]> <lm> <dbl [11]> ## 4 Science <tibble [1,435 × 26]> <lm> <dbl [12]> ## 5 Wri <tibble [3,627 × 26]> <lm> <dbl [12]> ``` --- # Challenge * Continue with the above, but output a data frame with three columns: `content`, `intercept`, and `TBI` (which is code 74). * In other words, output the mean score for students who were coded as not having a disability (code 0), along with students coded as having TBI.
04
:
00
--- ```r by_content %>% mutate( m1 = map(data, ~lm(Theta ~ primary, data = .x)), coefs = map(m1, coef), no_disab = map_dbl(coefs, 1), tbi = no_disab + map_dbl(coefs, "primary74") ) %>% select(content, no_disab, tbi) ``` ``` ## # A tibble: 5 × 3 ## # Groups: content [5] ## content no_disab tbi ## <chr> <dbl> <dbl> ## 1 ELA 0.9322336 0.1674462 ## 2 Math -0.1587907 0.1910821 ## 3 Rdg 1.363101 1.629048 ## 4 Science 1.491319 2.790971 ## 5 Wri 1.571441 1.167429 ``` -- Note - I wouldn't have neccesarily expected you to add `no_disab` to the TBI coefficient. --- # Compare models * Back to our original task - fit all three models ### You try first 1. `lm(Theta ~ primary)` 1. `lm(Theta ~ primary + secondary)` 1. `lm(Theta ~ primary + secondary + primary:secondary)`
04
:
00
--- # Model fits ```r mods <- by_content %>% mutate( m1 = map(data, ~lm(Theta ~ primary, data = .x)), m2 = map(data, ~lm(Theta ~ primary + secondary, data = .x)), m3 = map(data, ~lm(Theta ~ primary * secondary, data = .x)) ) mods ``` ``` ## # A tibble: 5 × 5 ## # Groups: content [5] ## content data m1 m2 m3 ## <chr> <list> <list> <list> <list> ## 1 ELA <tibble [3,627 × 26]> <lm> <lm> <lm> ## 2 Math <tibble [3,629 × 26]> <lm> <lm> <lm> ## 3 Rdg <tibble [3,627 × 26]> <lm> <lm> <lm> ## 4 Science <tibble [1,435 × 26]> <lm> <lm> <lm> ## 5 Wri <tibble [3,627 × 26]> <lm> <lm> <lm> ``` --- # Brief foray into parallel iterations The `stats::anova` function can compare the fit of two models -- ### Pop Quiz How would we extract just ELA model 1 and 2? -- .pull-left[ ```r mods$m1[[1]] ``` ``` ## ## Call: ## lm(formula = Theta ~ primary, data = .x) ## ## Coefficients: ## (Intercept) primary10 primary20 primary40 primary50 primary60 ## 0.93223 0.38570 -0.03168 -1.84434 1.17372 0.77625 ## primary70 primary74 primary80 primary82 primary90 ## -1.83026 -0.76479 0.46765 0.33825 1.47936 ``` ] .pull-right[ ```r mods$m2[[1]] ``` ``` ## ## Call: ## lm(formula = Theta ~ primary + secondary, data = .x) ## ## Coefficients: ## (Intercept) primary10 primary20 primary40 primary50 primary60 ## 1.0043 0.4285 0.2807 -1.6282 1.1147 0.8676 ## primary70 primary74 primary80 primary82 primary90 secondary10 ## -1.4430 -0.1168 0.5283 0.3298 1.4101 -0.5685 ## secondary20 secondary40 secondary43 secondary50 secondary60 secondary70 ## -1.0558 -2.2707 -1.7943 0.2484 0.4831 -1.3336 ## secondary74 secondary80 secondary82 secondary90 ## -6.3695 -0.5142 -0.7259 1.6000 ``` ] --- # Which fits better? ```r compare <- anova(mods$m1[[1]], mods$m2[[1]]) compare ``` ``` ## Analysis of Variance Table ## ## Model 1: Theta ~ primary ## Model 2: Theta ~ primary + secondary ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 3616 20905 ## 2 3605 20100 11 804.26 13.113 < 2.2e-16 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` --- # `map2` * Works the same as `map` but iterates over two vectors concurrently * Let's compare model 1 and 2 -- ```r mods %>% mutate(comp12 = map2(m1, m2, anova)) ``` ``` ## # A tibble: 5 × 6 ## # Groups: content [5] ## content data m1 m2 m3 comp12 ## <chr> <list> <list> <list> <list> <list> ## 1 ELA <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> ## 2 Math <tibble [3,629 × 26]> <lm> <lm> <lm> <anova [2 × 6]> ## 3 Rdg <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> ## 4 Science <tibble [1,435 × 26]> <lm> <lm> <lm> <anova [2 × 6]> ## 5 Wri <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> ``` -- Perhaps not terrifically helpful --- # Back to our `anova` object * Can we pull out useful things? ```r str(compare) ``` ``` ## Classes 'anova' and 'data.frame': 2 obs. of 6 variables: ## $ Res.Df : num 3616 3605 ## $ RSS : num 20905 20100 ## $ Df : num NA 11 ## $ Sum of Sq: num NA 804 ## $ F : num NA 13.1 ## $ Pr(>F) : num NA 7.66e-25 ## - attr(*, "heading")= chr [1:2] "Analysis of Variance Table\n" "Model 1: Theta ~ primary\nModel 2: Theta ~ primary + secondary" ``` -- Try pulling out the `\(p\)` value --- # Extract `\(p\)` value * *Note - I'd recommend looking at more than just a p-value, but I do think this is useful for a quick glance* ```r compare$`Pr(>F)` ``` ``` ## [1] NA 7.663566e-25 ``` ```r compare[["Pr(>F)"]] ``` ``` ## [1] NA 7.663566e-25 ``` -- ```r compare$`Pr(>F)`[2] ``` ``` ## [1] 7.663566e-25 ``` ```r compare[["Pr(>F)"]][2] ``` ``` ## [1] 7.663566e-25 ``` --- # All p-values *Note - this is probably the most compact syntax, but that doesn't mean it's the most clear* ```r mods %>% mutate(comp12 = map2(m1, m2, anova), p12 = map_dbl(comp12, list("Pr(>F)", 2))) ``` ``` ## # A tibble: 5 × 7 ## # Groups: content [5] ## content data m1 m2 m3 comp12 p12 ## <chr> <list> <list> <list> <list> <list> <dbl> ## 1 ELA <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 7.663566e-25 ## 2 Math <tibble [3,629 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 1.724262e-22 ## 3 Rdg <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 1.527172e-28 ## 4 Science <tibble [1,435 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 4.685885e-18 ## 5 Wri <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 5.785623e-11 ``` --- # Slight alternative * Write a function that pulls the p-value from model comparison objects ```r extract_p <- function(anova_ob) { anova_ob[["Pr(>F)"]][2] } ``` -- * Loop this function through the anova objects --- ```r mods %>% mutate(comp12 = map2(m1, m2, anova), p12 = map_dbl(comp12, extract_p)) ``` ``` ## # A tibble: 5 × 7 ## # Groups: content [5] ## content data m1 m2 m3 comp12 p12 ## <chr> <list> <list> <list> <list> <list> <dbl> ## 1 ELA <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 7.663566e-25 ## 2 Math <tibble [3,629 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 1.724262e-22 ## 3 Rdg <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 1.527172e-28 ## 4 Science <tibble [1,435 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 4.685885e-18 ## 5 Wri <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 5.785623e-11 ``` --- # Brief sidetrack We can also create the function using `purrr::compose()`. -- ### Example Create a centering function (which subtracts the mean from each ob) ```r center <- compose(~.x - mean(.x, na.rm = TRUE)) ``` Use `~` and `.x`, just like with the `map()` functions. --- # Test it out ```r library(palmerpenguins) penguins$bill_length_mm %>% head() ``` ``` ## [1] 39.1 39.5 40.3 NA 36.7 39.3 ``` ```r penguins$bill_length_mm %>% center() %>% head() ``` ``` ## [1] -4.82193 -4.42193 -3.62193 NA -7.22193 -4.62193 ``` ```r penguins$bill_length_mm %>% center() %>% mean(na.rm = TRUE) %>% round() ``` ``` ## [1] 0 ``` --- # Compose a p-val extractor ```r p <- compose(~.x[["Pr(>F)"]][2]) ``` ### Use this instead ```r mods %>% mutate( comp12 = map2(m1, m2, anova), p12 = map_dbl(comp12, p) ) ``` ``` ## # A tibble: 5 × 7 ## # Groups: content [5] ## content data m1 m2 m3 comp12 p12 ## <chr> <list> <list> <list> <list> <list> <dbl> ## 1 ELA <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 7.663566e-25 ## 2 Math <tibble [3,629 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 1.724262e-22 ## 3 Rdg <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 1.527172e-28 ## 4 Science <tibble [1,435 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 4.685885e-18 ## 5 Wri <tibble [3,627 × 26]> <lm> <lm> <lm> <anova [2 × 6]> 5.785623e-11 ``` --- class: middle # Functions This was a quick intro - don't worry if it doesn't really make sense yet. We'll talk about them (a lot) more in the coming weeks. --- class: inverse-red middle # An alternative ## Conducting operations by row --- # Operations by row The `dplyr::rowwise()` function fundamentally changes the way a `tibble()` behaves ```r df <- tibble(name = c("Me", "You"), x = 1:2, y = 3:4, z = 5:6) ``` .pull-left[ ```r df %>% mutate(m = mean(c(x, y, z))) ``` ``` ## # A tibble: 2 × 5 ## name x y z m ## <chr> <int> <int> <int> <dbl> ## 1 Me 1 3 5 3.5 ## 2 You 2 4 6 3.5 ``` ] .pull-right[ ```r df %>% rowwise() %>% mutate(m = mean(c(x, y, z))) ``` ``` ## # A tibble: 2 × 5 ## # Rowwise: ## name x y z m ## <chr> <int> <int> <int> <dbl> ## 1 Me 1 3 5 3 ## 2 You 2 4 6 4 ``` ] --- # Add a group & summarize ```r df %>% rowwise(name) %>% summarize(m = mean(c(x, y, z))) ``` ``` ## # A tibble: 2 × 2 ## # Groups: name [2] ## name m ## <chr> <dbl> ## 1 Me 3 ## 2 You 4 ``` --- # List columns If you apply rowwise operation with a list column, you don't have to loop ```r df <- tibble(var = list(1, 2:3, 4:6)) ``` .pull-left[ ```r df %>% mutate( lngth = map_int(var, length) ) ``` ``` ## # A tibble: 3 × 2 ## var lngth ## <list> <int> ## 1 <dbl [1]> 1 ## 2 <int [2]> 2 ## 3 <int [3]> 3 ``` ] .pull-right[ ```r df %>% rowwise() %>% mutate(lnght = length(var)) ``` ``` ## # A tibble: 3 × 2 ## # Rowwise: ## var lnght ## <list> <int> ## 1 <dbl [1]> 1 ## 2 <int [2]> 2 ## 3 <int [3]> 3 ``` ] --- # Creating list columns You can use the `dplyr::nest_by()` function to create a list column for each group, *and* convert it to a rowwise data frame. -- ```r d %>% nest_by(content) ``` ``` ## # A tibble: 5 × 2 ## # Rowwise: content ## content data ## <chr> <list<tibble[,26]>> ## 1 ELA [3,627 × 26] ## 2 Math [3,629 × 26] ## 3 Rdg [3,627 × 26] ## 4 Science [1,435 × 26] ## 5 Wri [3,627 × 26] ``` --- # Challenge Given what we just learned, can you fit a model of the form `Theta ~ primary` to each content area (i.e., *not* using **{purrr}**)?
02
:
00
-- Wrap it in `list()` (should suggest this in the error reporting if you don't) ```r d %>% nest_by(content) %>% mutate(m1 = list(lm(Theta ~ primary, data = data))) ``` ``` ## # A tibble: 5 × 3 ## # Rowwise: content ## content data m1 ## <chr> <list<tibble[,26]>> <list> ## 1 ELA [3,627 × 26] <lm> ## 2 Math [3,629 × 26] <lm> ## 3 Rdg [3,627 × 26] <lm> ## 4 Science [1,435 × 26] <lm> ## 5 Wri [3,627 × 26] <lm> ``` --- # Challenge 2 Can you extend it further and extract the coefficients with `coef`? What about creating a new column that has the intercept values?
02
:
00
-- ```r d %>% nest_by(content) %>% mutate(m1 = list(lm(Theta ~ primary, data = data)), coefs = list(coef(m1))) ``` ``` ## # A tibble: 5 × 4 ## # Rowwise: content ## content data m1 coefs ## <chr> <list<tibble[,26]>> <list> <list> ## 1 ELA [3,627 × 26] <lm> <dbl [11]> ## 2 Math [3,629 × 26] <lm> <dbl [12]> ## 3 Rdg [3,627 × 26] <lm> <dbl [11]> ## 4 Science [1,435 × 26] <lm> <dbl [12]> ## 5 Wri [3,627 × 26] <lm> <dbl [12]> ``` --- # Return atomic vectors ```r d %>% nest_by(content) %>% mutate(m1 = list(lm(Theta ~ primary, data = data)), intercept = coef(m1)[1]) ``` ``` ## # A tibble: 5 × 4 ## # Rowwise: content ## content data m1 intercept ## <chr> <list<tibble[,26]>> <list> <dbl> ## 1 ELA [3,627 × 26] <lm> 0.9322336 ## 2 Math [3,629 × 26] <lm> -0.1587907 ## 3 Rdg [3,627 × 26] <lm> 1.363101 ## 4 Science [1,435 × 26] <lm> 1.491319 ## 5 Wri [3,627 × 26] <lm> 1.571441 ``` --- # Fit all models The below gets us the same results we got before ```r mods2 <- d %>% nest_by(content) %>% mutate( m1 = list(lm(Theta ~ primary, data = data)), m2 = list(lm(Theta ~ primary + secondary, data = data)), m3 = list(lm(Theta ~ primary * secondary, data = data)) ) mods2 ``` ``` ## # A tibble: 5 × 5 ## # Rowwise: content ## content data m1 m2 m3 ## <chr> <list<tibble[,26]>> <list> <list> <list> ## 1 ELA [3,627 × 26] <lm> <lm> <lm> ## 2 Math [3,629 × 26] <lm> <lm> <lm> ## 3 Rdg [3,627 × 26] <lm> <lm> <lm> ## 4 Science [1,435 × 26] <lm> <lm> <lm> ## 5 Wri [3,627 × 26] <lm> <lm> <lm> ``` --- # Look at all `\(R^2\)` ### It's a normal data frame! ```r mods %>% pivot_longer( m1:m3, names_to = "model", values_to = "output" ) ``` ``` ## # A tibble: 15 × 4 ## # Groups: content [5] ## content data model output ## <chr> <list> <chr> <list> ## 1 ELA <tibble [3,627 × 26]> m1 <lm> ## 2 ELA <tibble [3,627 × 26]> m2 <lm> ## 3 ELA <tibble [3,627 × 26]> m3 <lm> ## 4 Math <tibble [3,629 × 26]> m1 <lm> ## 5 Math <tibble [3,629 × 26]> m2 <lm> ## 6 Math <tibble [3,629 × 26]> m3 <lm> ## 7 Rdg <tibble [3,627 × 26]> m1 <lm> ## 8 Rdg <tibble [3,627 × 26]> m2 <lm> ## 9 Rdg <tibble [3,627 × 26]> m3 <lm> ## 10 Science <tibble [1,435 × 26]> m1 <lm> ## 11 Science <tibble [1,435 × 26]> m2 <lm> ## 12 Science <tibble [1,435 × 26]> m3 <lm> ## 13 Wri <tibble [3,627 × 26]> m1 <lm> ## 14 Wri <tibble [3,627 × 26]> m2 <lm> ## 15 Wri <tibble [3,627 × 26]> m3 <lm> ``` --- # Extract all `\(R^2\)` *Note - might want to write a function here again* ```r r2 <- mods %>% pivot_longer( m1:m3, names_to = "model", values_to = "output" ) %>% mutate(r2 = map_dbl(output, ~summary(.x)$r.squared)) r2 ``` ``` ## # A tibble: 15 × 5 ## # Groups: content [5] ## content data model output r2 ## <chr> <list> <chr> <list> <dbl> ## 1 ELA <tibble [3,627 × 26]> m1 <lm> 0.04517421 ## 2 ELA <tibble [3,627 × 26]> m2 <lm> 0.08190917 ## 3 ELA <tibble [3,627 × 26]> m3 <lm> 0.1161187 ## 4 Math <tibble [3,629 × 26]> m1 <lm> 0.05326550 ## 5 Math <tibble [3,629 × 26]> m2 <lm> 0.08675264 ## 6 Math <tibble [3,629 × 26]> m3 <lm> 0.1185931 ## 7 Rdg <tibble [3,627 × 26]> m1 <lm> 0.04805713 ## 8 Rdg <tibble [3,627 × 26]> m2 <lm> 0.08926212 ## 9 Rdg <tibble [3,627 × 26]> m3 <lm> 0.1217497 ## 10 Science <tibble [1,435 × 26]> m1 <lm> 0.08683581 ## 11 Science <tibble [1,435 × 26]> m2 <lm> 0.1522437 ## 12 Science <tibble [1,435 × 26]> m3 <lm> 0.2170660 ## 13 Wri <tibble [3,627 × 26]> m1 <lm> 0.05171555 ## 14 Wri <tibble [3,627 × 26]> m2 <lm> 0.06977688 ## 15 Wri <tibble [3,627 × 26]> m3 <lm> 0.09820067 ``` --- # Plot ```r ggplot(r2, aes(model, r2)) + geom_col(aes(fill = model)) + facet_wrap(~content) + guides(fill = "none") + scale_fill_brewer(palette = "Set2") ``` ![](w4_files/figure-html/model-plot-1.png)<!-- --> --- # Unnesting * Sometimes you just want to `unnest` -- * Imagine we want to plot the coefficients by model... how? -- * `broom::tidy()` => `tidyr::unnest()` --- # Tidy ```r mods %>% pivot_longer( m1:m3, names_to = "model", values_to = "output" ) %>% mutate(tidied = map(output, broom::tidy)) ``` ``` ## # A tibble: 15 × 5 ## # Groups: content [5] ## content data model output tidied ## <chr> <list> <chr> <list> <list> ## 1 ELA <tibble [3,627 × 26]> m1 <lm> <tibble [11 × 5]> ## 2 ELA <tibble [3,627 × 26]> m2 <lm> <tibble [22 × 5]> ## 3 ELA <tibble [3,627 × 26]> m3 <lm> <tibble [132 × 5]> ## 4 Math <tibble [3,629 × 26]> m1 <lm> <tibble [12 × 5]> ## 5 Math <tibble [3,629 × 26]> m2 <lm> <tibble [23 × 5]> ## 6 Math <tibble [3,629 × 26]> m3 <lm> <tibble [144 × 5]> ## 7 Rdg <tibble [3,627 × 26]> m1 <lm> <tibble [11 × 5]> ## 8 Rdg <tibble [3,627 × 26]> m2 <lm> <tibble [22 × 5]> ## 9 Rdg <tibble [3,627 × 26]> m3 <lm> <tibble [132 × 5]> ## 10 Science <tibble [1,435 × 26]> m1 <lm> <tibble [12 × 5]> ## 11 Science <tibble [1,435 × 26]> m2 <lm> <tibble [22 × 5]> ## 12 Science <tibble [1,435 × 26]> m3 <lm> <tibble [132 × 5]> ## 13 Wri <tibble [3,627 × 26]> m1 <lm> <tibble [12 × 5]> ## 14 Wri <tibble [3,627 × 26]> m2 <lm> <tibble [22 × 5]> ## 15 Wri <tibble [3,627 × 26]> m3 <lm> <tibble [132 × 5]> ``` --- # Equivalently ```r mods %>% pivot_longer( m1:m3, names_to = "model", values_to = "output" ) %>% rowwise() %>% mutate(tidied = list(broom::tidy(output))) ``` ``` ## # A tibble: 15 × 5 ## # Rowwise: content ## content data model output tidied ## <chr> <list> <chr> <list> <list> ## 1 ELA <tibble [3,627 × 26]> m1 <lm> <tibble [11 × 5]> ## 2 ELA <tibble [3,627 × 26]> m2 <lm> <tibble [22 × 5]> ## 3 ELA <tibble [3,627 × 26]> m3 <lm> <tibble [132 × 5]> ## 4 Math <tibble [3,629 × 26]> m1 <lm> <tibble [12 × 5]> ## 5 Math <tibble [3,629 × 26]> m2 <lm> <tibble [23 × 5]> ## 6 Math <tibble [3,629 × 26]> m3 <lm> <tibble [144 × 5]> ## 7 Rdg <tibble [3,627 × 26]> m1 <lm> <tibble [11 × 5]> ## 8 Rdg <tibble [3,627 × 26]> m2 <lm> <tibble [22 × 5]> ## 9 Rdg <tibble [3,627 × 26]> m3 <lm> <tibble [132 × 5]> ## 10 Science <tibble [1,435 × 26]> m1 <lm> <tibble [12 × 5]> ## 11 Science <tibble [1,435 × 26]> m2 <lm> <tibble [22 × 5]> ## 12 Science <tibble [1,435 × 26]> m3 <lm> <tibble [132 × 5]> ## 13 Wri <tibble [3,627 × 26]> m1 <lm> <tibble [12 × 5]> ## 14 Wri <tibble [3,627 × 26]> m2 <lm> <tibble [22 × 5]> ## 15 Wri <tibble [3,627 × 26]> m3 <lm> <tibble [132 × 5]> ``` --- # Select and unnest ```r tidied <- mods %>% gather(model, output, m1:m3) %>% mutate(tidied = map(output, broom::tidy)) %>% select(content, model, tidied) %>% unnest(tidied) tidied ``` ``` ## # A tibble: 841 × 7 ## # Groups: content [5] ## content model term estimate std.error statistic p.value ## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> ## 1 ELA m1 (Intercept) 0.9322336 0.2150561 4.334839 1.498396e-5 ## 2 ELA m1 primary10 0.3856986 0.2242965 1.719593 8.559207e-2 ## 3 ELA m1 primary20 -0.03167527 0.7266436 -0.04359120 9.652327e-1 ## 4 ELA m1 primary40 -1.844343 0.5559031 -3.317741 9.164595e-4 ## 5 ELA m1 primary50 1.173722 0.2890447 4.060694 4.996391e-5 ## 6 ELA m1 primary60 0.7762539 0.3866313 2.007737 4.474555e-2 ## 7 ELA m1 primary70 -1.830257 0.3086128 -5.930595 3.301860e-9 ## 8 ELA m1 primary74 -0.7647874 0.5182670 -1.475663 1.401215e-1 ## 9 ELA m1 primary80 0.4676481 0.2428640 1.925556 5.423822e-2 ## 10 ELA m1 primary82 0.3382547 0.2267600 1.491686 1.358687e-1 ## # … with 831 more rows ``` --- # Plot ### Lets look how the primary coefficients change ```r to_plot <- names(coef(mods$m1[[1]])) tidied %>% filter(term %in% to_plot) %>% ggplot(aes(estimate, term, color = model)) + geom_point() + scale_color_brewer(palette = "Set2") + facet_wrap(~content) ``` --- ![](w4_files/figure-html/coef-plot-eval-1.png)<!-- --> --- # Last bit * We've kind of been running the wrong models this whole time * We forgot about grade! * No problem, just change the grouping factor --- # By grade ```r by_grade_content <- d %>% group_by(content, grade) %>% nest() by_grade_content ``` ``` ## # A tibble: 31 × 3 ## # Groups: content, grade [31] ## grade content data ## <int> <chr> <list> ## 1 11 ELA <tibble [453 × 25]> ## 2 11 Math <tibble [460 × 25]> ## 3 11 Rdg <tibble [453 × 25]> ## 4 11 Science <tibble [438 × 25]> ## 5 11 Wri <tibble [453 × 25]> ## 6 3 ELA <tibble [540 × 25]> ## 7 3 Math <tibble [536 × 25]> ## 8 3 Rdg <tibble [540 × 25]> ## 9 3 Wri <tibble [540 × 25]> ## 10 4 ELA <tibble [585 × 25]> ## # … with 21 more rows ``` --- # Fit models ```r mods_grade <- by_grade_content %>% mutate( m1 = map(data, ~lm(Theta ~ primary, data = .x)), m2 = map(data, ~lm(Theta ~ primary + secondary, data = .x)), m3 = map(data, ~lm(Theta ~ primary * secondary, data = .x)) ) mods_grade ``` ``` ## # A tibble: 31 × 6 ## # Groups: content, grade [31] ## grade content data m1 m2 m3 ## <int> <chr> <list> <list> <list> <list> ## 1 11 ELA <tibble [453 × 25]> <lm> <lm> <lm> ## 2 11 Math <tibble [460 × 25]> <lm> <lm> <lm> ## 3 11 Rdg <tibble [453 × 25]> <lm> <lm> <lm> ## 4 11 Science <tibble [438 × 25]> <lm> <lm> <lm> ## 5 11 Wri <tibble [453 × 25]> <lm> <lm> <lm> ## 6 3 ELA <tibble [540 × 25]> <lm> <lm> <lm> ## 7 3 Math <tibble [536 × 25]> <lm> <lm> <lm> ## 8 3 Rdg <tibble [540 × 25]> <lm> <lm> <lm> ## 9 3 Wri <tibble [540 × 25]> <lm> <lm> <lm> ## 10 4 ELA <tibble [585 × 25]> <lm> <lm> <lm> ## # … with 21 more rows ``` --- # Look at `\(R^2\)` ```r mods_grade %>% pivot_longer( m1:m3, names_to = "model", values_to = "output" ) %>% mutate(r2 = map_dbl(output, ~summary(.x)$r.squared)) ``` ``` ## # A tibble: 93 × 6 ## # Groups: content, grade [31] ## grade content data model output r2 ## <int> <chr> <list> <chr> <list> <dbl> ## 1 11 ELA <tibble [453 × 25]> m1 <lm> 0.03353818 ## 2 11 ELA <tibble [453 × 25]> m2 <lm> 0.1084394 ## 3 11 ELA <tibble [453 × 25]> m3 <lm> 0.1536891 ## 4 11 Math <tibble [460 × 25]> m1 <lm> 0.1886003 ## 5 11 Math <tibble [460 × 25]> m2 <lm> 0.3161226 ## 6 11 Math <tibble [460 × 25]> m3 <lm> 0.4046634 ## 7 11 Rdg <tibble [453 × 25]> m1 <lm> 0.02066316 ## 8 11 Rdg <tibble [453 × 25]> m2 <lm> 0.1820512 ## 9 11 Rdg <tibble [453 × 25]> m3 <lm> 0.2337721 ## 10 11 Science <tibble [438 × 25]> m1 <lm> 0.1259080 ## # … with 83 more rows ``` --- # Plot ```r mods_grade %>% pivot_longer( m1:m3, names_to = "model", values_to = "output" ) %>% mutate(r2 = map_dbl(output, ~summary(.x)$r.squared)) %>% ggplot(aes(model, r2)) + geom_col(aes(fill = model)) + facet_grid(grade ~ content) + guides(fill = "none") + scale_fill_brewer(palette = "Set2") ``` --- ![](w4_files/figure-html/by_grade_r2_plot-eval-1.png)<!-- --> --- # Summary * List columns are really powerful and really flexible * Also help you stay organized * You can approach the problem either with **{purrr}** or `dplyr::rowwise()`. + **Important**: If you use `rowwise()`, remember to `ungroup()` when you want it to go back to being a normal data frame + I'm asking you to learn both - the row-wise approach might be a bit easier but is a little less general (only works with data frames) --- class: inverse-green middle # In-class Midterm ### Next time: Parallel iterations